如何使 statsmodels 的 ANOVA 结果与 R 的 ANOVA 结果匹配

问题描述 投票:0回答:2

以下问题是对 StatsExchange 上的帖子的具体改编。

以下 R 脚本运行良好:

library(reshape)

J1 <- c(9,6,8,7,10,6)
J2 <- c(2,1,4,1,5,2)
J3 <- c(5,3,6,2,6,4)
J4 <- c(8,2,8,6,9,7)
Subject <- c('S1', 'S2', 'S3', 'S4', 'S5', 'S6')
sf <- data.frame(Subject, J1, J2, J3, J4)
sf.df <- melt(sf, varnames=c("Subject", "Rater"))
anova(lm(value ~ Subject, sf.df))
anova(lm(value ~ Subject*variable, sf.df))

但是,当我尝试将其转换为 Python 时,我遇到了 scipy 抛出的异常。

这是等效的 Python 脚本:

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create the data
J1 = [9, 6, 8, 7, 10, 6]
J2 = [2, 1, 4, 1, 5, 2]
J3 = [5, 3, 6, 2, 6, 4]
J4 = [8, 2, 8, 6, 9, 7]
Subject = ['S1', 'S2', 'S3', 'S4', 'S5', 'S6']

# Create the DataFrame
sf = pd.DataFrame({'Subject': Subject, 'J1': J1, 'J2': J2, 'J3': J3, 'J4': J4})
print(sf)

# Melt the DataFrame
sf_melted = pd.melt(sf, id_vars=['Subject'], var_name='Rater', value_name='Value')
print(sf_melted)

# Perform ANOVA
model1 = ols('Value ~ Subject', data=sf_melted).fit()
anova_table1 = sm.stats.anova_lm(model1, typ=2)
print(anova_table1)

model2 = ols('Value ~ Subject * Rater', data=sf_melted).fit()
anova_table2 = sm.stats.anova_lm(model2, typ=2)
print(anova_table2)

第一个模型的方差分析计算运行得很好,但第二个模型似乎存在问题。

这是完整的堆栈跟踪:

/home/ec2-user/anaconda3/envs/python3/lib/python3.10/site-packages/statsmodels/regression/linear_model.py:1717: RuntimeWarning: divide by zero encountered in double_scalars return np.dot(wresid, wresid) / self.df_resid --------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[246], line 26 23 print(anova_table1) 25 model2 = ols('Value ~ Subject * Rater', data=sf_melted).fit() ---> 26 anova_table2 = sm.stats.anova_lm(model2, typ=2) 27 print(anova_table2) File ~/anaconda3/envs/python3/lib/python3.10/site-packages/statsmodels/stats/anova.py:353, in anova_lm(*args, **kwargs) 351 if len(args) == 1: 352 model = args[0] --> 353 return anova_single(model, **kwargs) 355 if typ not in [1, "I"]: 356 raise ValueError("Multiple models only supported for type I. " 357 "Got type %s" % str(typ)) File ~/anaconda3/envs/python3/lib/python3.10/site-packages/statsmodels/stats/anova.py:84, in anova_single(model, **kwargs) 81 return anova1_lm_single(model, endog, exog, nobs, design_info, table, 82 n_rows, test, pr_test, robust) 83 elif typ in [2, "II"]: ---> 84 return anova2_lm_single(model, design_info, n_rows, test, pr_test, 85 robust) 86 elif typ in [3, "III"]: 87 return anova3_lm_single(model, design_info, n_rows, test, pr_test, 88 robust) File ~/anaconda3/envs/python3/lib/python3.10/site-packages/statsmodels/stats/anova.py:207, in anova2_lm_single(model, design_info, n_rows, test, pr_test, robust) 205 LVL = np.dot(np.dot(L1,robust_cov),L2.T) 206 from scipy import linalg --> 207 orth_compl,_ = linalg.qr(LVL) 208 r = L1.shape[0] - L2.shape[0] 209 # L1|2 210 # use the non-unique orthogonal completion since L12 is rank r File ~/anaconda3/envs/python3/lib/python3.10/site-packages/scipy/linalg/_decomp_qr.py:129, in qr(a, overwrite_a, lwork, mode, pivoting, check_finite) 125 raise ValueError("Mode argument should be one of ['full', 'r'," 126 "'economic', 'raw']") 128 if check_finite: --> 129 a1 = numpy.asarray_chkfinite(a) 130 else: 131 a1 = numpy.asarray(a) File ~/anaconda3/envs/python3/lib/python3.10/site-packages/numpy/lib/function_base.py:603, in asarray_chkfinite(a, dtype, order) 601 a = asarray(a, dtype=dtype, order=order) 602 if a.dtype.char in typecodes['AllFloat'] and not np.isfinite(a).all(): --> 603 raise ValueError( 604 "array must not contain infs or NaNs") 605 return a ValueError: array must not contain infs or NaNs
我尝试过的事情:

    更改库版本。我最初使用 numpy 1.22.4 和 scipy 1.12,现在使用 numpy 2.0.0 和 scipy 1.13.1。
  1. 我尝试通过为第二个模型编写
  2. Value ~ Subject * C(Rater)
     来在线性模型中强制执行分类变量。我想知道它是否没有将数据视为分类数据......但从我读到的 statsmodels 应该正确推断任何字符串变量都是分类变量,所以这可能是错误的想法。
任何 R/Python/statsmodels 专家都可以帮我解决这个问题吗?

预期输出(来自 R 脚本)如下:

Analysis of Variance Table Response: value Df Sum Sq Mean Sq F value Pr(>F) Subject 5 56.208 11.2417 1.7947 0.1648 Residuals 18 112.750 6.2639 Analysis of Variance Table Response: value Df Sum Sq Mean Sq F value Pr(>F) Subject 5 56.208 11.242 Rater 3 97.458 32.486 Subject:Rater 15 15.292 1.019 Residuals 0 0.000
^ 请注意,此预期输出也与引用的 StatsExchange 帖子中显示的输出相对应(“Rater”变量的 MeanSq 值 32.486 大致等于 StatsExchange 帖子中与“Judge”变量关联的值 32.49)。

python r dataframe statsmodels anova
2个回答
0
投票
您的

Rater

也许应该转换为整数?

我不确定您的预期输出是什么。

尝试:

sf_melted['Rater'] = sf_melted['Rater'].str.slice(1).astype(int) model2 = ols('Value ~ Subject * Rater', data=sf_melted).fit() anova_table2 = sm.stats.anova_lm(model2, typ=2, missing=0) print(anova_table2)
输出:

sum_sq df F PR(>F) Subject 56.208333 5.0 1.273843 0.336760 Rater 0.408333 1.0 0.046270 0.833298 Subject:Rater 6.441667 5.0 0.145987 0.977485 Residual 105.900000 12.0 NaN NaN
    

0
投票

anova()

 生成 I 型方差分析表(参见例如 
here),因此要获得相同的输出,必须将 typ=1
 传递给 
anova_lm
。您可以验证这样做时,statsmodels 的输出与下面的 R 相同。
1

对于第一个型号:

sm.stats.anova_lm(model1, typ=1) df sum_sq mean_sq F PR(>F) Subject 5.0 56.208333 11.241667 1.794678 0.164769 Residual 18.0 112.750000 6.263889 NaN NaN
对于第二个型号:

sm.stats.anova_lm(model2, typ=1) df sum_sq mean_sq F PR(>F) Subject 5.0 5.620833e+01 11.241667 0.0 NaN Rater 3.0 9.745833e+01 32.486111 0.0 NaN Subject:Rater 15.0 1.529167e+01 1.019444 0.0 NaN Residual 0.0 1.778203e-25 inf NaN NaN
顺便说一句,如果我们尝试在 R 中为 

model2

 生成类型 2 方差分析表,我们也会收到错误:

> car::Anova(lm(value ~ Subject*variable, sf.df), type=2) Error in Anova.lm(lm(value ~ Subject * variable, sf.df), type = 2) : residual df = 0

1 对于 model2

mean_sq
 对于 statsmodels 的 
inf
 来说是 
anova_lm()
,但对于 R 的 
NaN
 来说是 
anova()
。这是由于两个函数中 
sum_sq
 的舍入误差造成的。对于 R 的 
anova()
,Residual 的 
sum_sq
 为 0,而对于 statsmodels 的 
anova_lm()
,则为 1.778203e-25。

© www.soinside.com 2019 - 2024. All rights reserved.