如何使 statsmodels 的 ANOVA 结果与 R 的 ANOVA 结果匹配

Question

以下问题是对 StatsExchange 上的帖子的具体改编。

以下 R 脚本运行良好：

library(reshape)

J1 <- c(9,6,8,7,10,6)
J2 <- c(2,1,4,1,5,2)
J3 <- c(5,3,6,2,6,4)
J4 <- c(8,2,8,6,9,7)
Subject <- c('S1', 'S2', 'S3', 'S4', 'S5', 'S6')
sf <- data.frame(Subject, J1, J2, J3, J4)
sf.df <- melt(sf, varnames=c("Subject", "Rater"))
anova(lm(value ~ Subject, sf.df))
anova(lm(value ~ Subject*variable, sf.df))

但是，当我尝试将其转换为 Python 时，我遇到了 scipy 抛出的异常。

这是等效的 Python 脚本：

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create the data
J1 = [9, 6, 8, 7, 10, 6]
J2 = [2, 1, 4, 1, 5, 2]
J3 = [5, 3, 6, 2, 6, 4]
J4 = [8, 2, 8, 6, 9, 7]
Subject = ['S1', 'S2', 'S3', 'S4', 'S5', 'S6']

# Create the DataFrame
sf = pd.DataFrame({'Subject': Subject, 'J1': J1, 'J2': J2, 'J3': J3, 'J4': J4})
print(sf)

# Melt the DataFrame
sf_melted = pd.melt(sf, id_vars=['Subject'], var_name='Rater', value_name='Value')
print(sf_melted)

# Perform ANOVA
model1 = ols('Value ~ Subject', data=sf_melted).fit()
anova_table1 = sm.stats.anova_lm(model1, typ=2)
print(anova_table1)

model2 = ols('Value ~ Subject * Rater', data=sf_melted).fit()
anova_table2 = sm.stats.anova_lm(model2, typ=2)
print(anova_table2)

第一个模型的方差分析计算运行得很好，但第二个模型似乎存在问题。

这是完整的堆栈跟踪：

/home/ec2-user/anaconda3/envs/python3/lib/python3.10/site-packages/statsmodels/regression/linear_model.py:1717: RuntimeWarning: divide by zero encountered in double_scalars
  return np.dot(wresid, wresid) / self.df_resid

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[246], line 26
     23 print(anova_table1)
     25 model2 = ols('Value ~ Subject * Rater', data=sf_melted).fit()
---> 26 anova_table2 = sm.stats.anova_lm(model2, typ=2)
     27 print(anova_table2)

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/statsmodels/stats/anova.py:353, in anova_lm(*args, **kwargs)
    351 if len(args) == 1:
    352     model = args[0]
--> 353     return anova_single(model, **kwargs)
    355 if typ not in [1, "I"]:
    356     raise ValueError("Multiple models only supported for type I. "
    357                      "Got type %s" % str(typ))

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/statsmodels/stats/anova.py:84, in anova_single(model, **kwargs)
     81     return anova1_lm_single(model, endog, exog, nobs, design_info, table,
     82                             n_rows, test, pr_test, robust)
     83 elif typ in [2, "II"]:
---> 84     return anova2_lm_single(model, design_info, n_rows, test, pr_test,
     85                             robust)
     86 elif typ in [3, "III"]:
     87     return anova3_lm_single(model, design_info, n_rows, test, pr_test,
     88                             robust)

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/statsmodels/stats/anova.py:207, in anova2_lm_single(model, design_info, n_rows, test, pr_test, robust)
    205 LVL = np.dot(np.dot(L1,robust_cov),L2.T)
    206 from scipy import linalg
--> 207 orth_compl,_ = linalg.qr(LVL)
    208 r = L1.shape[0] - L2.shape[0]
    209 # L1|2
    210 # use the non-unique orthogonal completion since L12 is rank r

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/scipy/linalg/_decomp_qr.py:129, in qr(a, overwrite_a, lwork, mode, pivoting, check_finite)
    125     raise ValueError("Mode argument should be one of ['full', 'r',"
    126                      "'economic', 'raw']")
    128 if check_finite:
--> 129     a1 = numpy.asarray_chkfinite(a)
    130 else:
    131     a1 = numpy.asarray(a)

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/numpy/lib/function_base.py:603, in asarray_chkfinite(a, dtype, order)
    601 a = asarray(a, dtype=dtype, order=order)
    602 if a.dtype.char in typecodes['AllFloat'] and not np.isfinite(a).all():
--> 603     raise ValueError(
    604         "array must not contain infs or NaNs")
    605 return a

ValueError: array must not contain infs or NaNs

我尝试过的事情：

Value ~ Subject * C(Rater)

 来在线性模型中强制执行分类变量。我想知道它是否没有将数据视为分类数据......但从我读到的 statsmodels 应该正确推断任何字符串变量都是分类变量，所以这可能是错误的想法。

任何 R/Python/statsmodels 专家都可以帮我解决这个问题吗？

预期输出（来自 R 脚本）如下：

Analysis of Variance Table

Response: value
          Df  Sum Sq Mean Sq F value Pr(>F)
Subject    5  56.208 11.2417  1.7947 0.1648
Residuals 18 112.750  6.2639               
Analysis of Variance Table

Response: value
                 Df Sum Sq Mean Sq F value Pr(>F)
Subject           5 56.208  11.242               
Rater             3 97.458  32.486               
Subject:Rater    15 15.292   1.019               
Residuals         0  0.000

^ 请注意，此预期输出也与引用的 StatsExchange 帖子中显示的输出相对应（“Rater”变量的 MeanSq 值 32.486 大致等于 StatsExchange 帖子中与“Judge”变量关联的值 32.49）。

Answer 1

您的

Rater

也许应该转换为整数？

我不确定您的预期输出是什么。

尝试：

sf_melted['Rater'] = sf_melted['Rater'].str.slice(1).astype(int)
model2 = ols('Value ~ Subject * Rater', data=sf_melted).fit()
anova_table2 = sm.stats.anova_lm(model2, typ=2, missing=0)
print(anova_table2)

输出：

                   sum_sq    df         F    PR(>F)
Subject         56.208333   5.0  1.273843  0.336760
Rater            0.408333   1.0  0.046270  0.833298
Subject:Rater    6.441667   5.0  0.145987  0.977485
Residual       105.900000  12.0       NaN       NaN

Answer 2

anova()

 生成 I 型方差分析表（参见例如

here），因此要获得相同的输出，必须将 typ=1

 传递给

anova_lm

。您可以验证这样做时，statsmodels 的输出与下面的 R 相同。

1

对于第一个型号：

sm.stats.anova_lm(model1, typ=1)

            df      sum_sq    mean_sq         F    PR(>F)
Subject    5.0   56.208333  11.241667  1.794678  0.164769
Residual  18.0  112.750000   6.263889       NaN       NaN

对于第二个型号：

sm.stats.anova_lm(model2, typ=1)

                 df        sum_sq    mean_sq    F  PR(>F)
Subject         5.0  5.620833e+01  11.241667  0.0     NaN
Rater           3.0  9.745833e+01  32.486111  0.0     NaN
Subject:Rater  15.0  1.529167e+01   1.019444  0.0     NaN
Residual        0.0  1.778203e-25        inf  NaN     NaN

顺便说一句，如果我们尝试在 R 中为

model2

 生成类型 2 方差分析表，我们也会收到错误：

> car::Anova(lm(value ~ Subject*variable, sf.df), type=2)
Error in Anova.lm(lm(value ~ Subject * variable, sf.df), type = 2) : 
  residual df = 0

1^对于model2

，

mean_sq

 对于 statsmodels 的

inf

 来说是

anova_lm()

，但对于 R 的

NaN

 来说是

anova()

。这是由于两个函数中

sum_sq

 的舍入误差造成的。对于 R 的

anova()

，Residual 的

sum_sq

 为 0，而对于 statsmodels 的

anova_lm()

，则为 1.778203e-25。

如何使 statsmodels 的 ANOVA 结果与 R 的 ANOVA 结果匹配

问题描述投票：0回答：2

2个回答

最新问题

如何使 statsmodels 的 ANOVA 结果与 R 的 ANOVA 结果匹配

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2