以下问题是对 StatsExchange 上的帖子的具体改编。
以下 R 脚本运行良好:
library(reshape)
J1 <- c(9,6,8,7,10,6)
J2 <- c(2,1,4,1,5,2)
J3 <- c(5,3,6,2,6,4)
J4 <- c(8,2,8,6,9,7)
Subject <- c('S1', 'S2', 'S3', 'S4', 'S5', 'S6')
sf <- data.frame(Subject, J1, J2, J3, J4)
sf.df <- melt(sf, varnames=c("Subject", "Rater"))
anova(lm(value ~ Subject, sf.df))
anova(lm(value ~ Subject*variable, sf.df))
但是,当我尝试将其转换为 Python 时,我遇到了 scipy 抛出的异常。
这是等效的 Python 脚本:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
# Create the data
J1 = [9, 6, 8, 7, 10, 6]
J2 = [2, 1, 4, 1, 5, 2]
J3 = [5, 3, 6, 2, 6, 4]
J4 = [8, 2, 8, 6, 9, 7]
Subject = ['S1', 'S2', 'S3', 'S4', 'S5', 'S6']
# Create the DataFrame
sf = pd.DataFrame({'Subject': Subject, 'J1': J1, 'J2': J2, 'J3': J3, 'J4': J4})
print(sf)
# Melt the DataFrame
sf_melted = pd.melt(sf, id_vars=['Subject'], var_name='Rater', value_name='Value')
print(sf_melted)
# Perform ANOVA
model1 = ols('Value ~ Subject', data=sf_melted).fit()
anova_table1 = sm.stats.anova_lm(model1, typ=2)
print(anova_table1)
model2 = ols('Value ~ Subject * Rater', data=sf_melted).fit()
anova_table2 = sm.stats.anova_lm(model2, typ=2)
print(anova_table2)
第一个模型的方差分析计算运行得很好,但第二个模型似乎存在问题。
这是完整的堆栈跟踪:
/home/ec2-user/anaconda3/envs/python3/lib/python3.10/site-packages/statsmodels/regression/linear_model.py:1717: RuntimeWarning: divide by zero encountered in double_scalars
return np.dot(wresid, wresid) / self.df_resid
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[246], line 26
23 print(anova_table1)
25 model2 = ols('Value ~ Subject * Rater', data=sf_melted).fit()
---> 26 anova_table2 = sm.stats.anova_lm(model2, typ=2)
27 print(anova_table2)
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/statsmodels/stats/anova.py:353, in anova_lm(*args, **kwargs)
351 if len(args) == 1:
352 model = args[0]
--> 353 return anova_single(model, **kwargs)
355 if typ not in [1, "I"]:
356 raise ValueError("Multiple models only supported for type I. "
357 "Got type %s" % str(typ))
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/statsmodels/stats/anova.py:84, in anova_single(model, **kwargs)
81 return anova1_lm_single(model, endog, exog, nobs, design_info, table,
82 n_rows, test, pr_test, robust)
83 elif typ in [2, "II"]:
---> 84 return anova2_lm_single(model, design_info, n_rows, test, pr_test,
85 robust)
86 elif typ in [3, "III"]:
87 return anova3_lm_single(model, design_info, n_rows, test, pr_test,
88 robust)
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/statsmodels/stats/anova.py:207, in anova2_lm_single(model, design_info, n_rows, test, pr_test, robust)
205 LVL = np.dot(np.dot(L1,robust_cov),L2.T)
206 from scipy import linalg
--> 207 orth_compl,_ = linalg.qr(LVL)
208 r = L1.shape[0] - L2.shape[0]
209 # L1|2
210 # use the non-unique orthogonal completion since L12 is rank r
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/scipy/linalg/_decomp_qr.py:129, in qr(a, overwrite_a, lwork, mode, pivoting, check_finite)
125 raise ValueError("Mode argument should be one of ['full', 'r',"
126 "'economic', 'raw']")
128 if check_finite:
--> 129 a1 = numpy.asarray_chkfinite(a)
130 else:
131 a1 = numpy.asarray(a)
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/numpy/lib/function_base.py:603, in asarray_chkfinite(a, dtype, order)
601 a = asarray(a, dtype=dtype, order=order)
602 if a.dtype.char in typecodes['AllFloat'] and not np.isfinite(a).all():
--> 603 raise ValueError(
604 "array must not contain infs or NaNs")
605 return a
ValueError: array must not contain infs or NaNs
我尝试过的事情:
Value ~ Subject * C(Rater)
来在线性模型中强制执行分类变量。我想知道它是否没有将数据视为分类数据......但从我读到的 statsmodels 应该正确推断任何字符串变量都是分类变量,所以这可能是错误的想法。
预期输出(来自 R 脚本)如下:
Analysis of Variance Table
Response: value
Df Sum Sq Mean Sq F value Pr(>F)
Subject 5 56.208 11.2417 1.7947 0.1648
Residuals 18 112.750 6.2639
Analysis of Variance Table
Response: value
Df Sum Sq Mean Sq F value Pr(>F)
Subject 5 56.208 11.242
Rater 3 97.458 32.486
Subject:Rater 15 15.292 1.019
Residuals 0 0.000
^ 请注意,此预期输出也与引用的 StatsExchange 帖子中显示的输出相对应(“Rater”变量的 MeanSq 值 32.486 大致等于 StatsExchange 帖子中与“Judge”变量关联的值 32.49)。
Rater
也许应该转换为整数?我不确定您的预期输出是什么。
尝试:
sf_melted['Rater'] = sf_melted['Rater'].str.slice(1).astype(int)
model2 = ols('Value ~ Subject * Rater', data=sf_melted).fit()
anova_table2 = sm.stats.anova_lm(model2, typ=2, missing=0)
print(anova_table2)
输出:
sum_sq df F PR(>F)
Subject 56.208333 5.0 1.273843 0.336760
Rater 0.408333 1.0 0.046270 0.833298
Subject:Rater 6.441667 5.0 0.145987 0.977485
Residual 105.900000 12.0 NaN NaN
anova()
生成 I 型方差分析表(参见例如here),因此要获得相同的输出,必须将
typ=1
传递给
anova_lm
。您可以验证这样做时,statsmodels 的输出与下面的 R 相同。1 对于第一个型号:
sm.stats.anova_lm(model1, typ=1)
df sum_sq mean_sq F PR(>F)
Subject 5.0 56.208333 11.241667 1.794678 0.164769
Residual 18.0 112.750000 6.263889 NaN NaN
对于第二个型号:
sm.stats.anova_lm(model2, typ=1)
df sum_sq mean_sq F PR(>F)
Subject 5.0 5.620833e+01 11.241667 0.0 NaN
Rater 3.0 9.745833e+01 32.486111 0.0 NaN
Subject:Rater 15.0 1.529167e+01 1.019444 0.0 NaN
Residual 0.0 1.778203e-25 inf NaN NaN
顺便说一句,如果我们尝试在 R 中为 model2
生成类型 2 方差分析表,我们也会收到错误:
> car::Anova(lm(value ~ Subject*variable, sf.df), type=2)
Error in Anova.lm(lm(value ~ Subject * variable, sf.df), type = 2) :
residual df = 0
1 对于 model2
,
mean_sq
对于 statsmodels 的
inf
来说是
anova_lm()
,但对于 R 的
NaN
来说是
anova()
。这是由于两个函数中
sum_sq
的舍入误差造成的。对于 R 的
anova()
,Residual 的
sum_sq
为 0,而对于 statsmodels 的
anova_lm()
,则为 1.778203e-25。