我使用 Python
rpy2
在 R中运行
glm()
(回归模型)。我的问题是得到一个我可以处理的结果。最好的是一个不错的数据结构(字典、数据框……),但我也对人类可读的多行字符串感到满意。
直接在 R 中使用
glm()
会产生类似的输出,这就是我所期望的:
Call:
glm(formula = C ~ B + A, family = poisson(), data = df, offset = D)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.391e+01 3.171e-03 -7540.871 <2e-16 ***
B -2.372e-04 1.331e-04 -1.782 0.0748 .
A 1.285e-03 1.345e-04 9.552 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 17500269 on 49999 degrees of freedom
Residual deviance: 17500174 on 49997 degrees of freedom
AIC: Inf
Number of Fisher Scoring iterations: 16
在下面的示例代码中,我尝试了两种提取结果的变体。
在 变体 A 中,我确实在 R 端使用
capture.output()
将控制台输出提取为字符串。但结果看起来很奇怪(请参阅代码中的文档字符串)。在变体 B 中,我通过 summary()
使用 rpy2
函数。这里我得到了一个我无法使用的有线数据结构。
#!/usr/bin/env python3
import random
import pandas as pd
import rpy2
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
import rpy2.robjects.pandas2ri as pandas2ri
pandas2ri.activate()
random.seed(0)
k = 10000
df = pd.DataFrame({
'A': random.choices(range(100), k=k),
'B': random.choices([1, 2, 3], k=k),
'C': random.choices([0, 1], k=k),
'D': random.choices(range(20, 30), k=k),
})
glm = robjects.r['glm']
reg = glm(
formula='C ~ B',
data=robjects.conversion.py2rpy(df),
family=robjects.r['poisson'](),
offset=df['D']
)
robjects.r('''get_result <- function(reg) {
result <- paste(capture.output(reg), collapse="\n")
return(result)
}
''')
get_result = robjects.globalenv['get_result']
result_variant_A = get_result(reg)
"""
[...SNIPPED...]
L, `9948` = 23L, `9949` = 23L, `9950` = 22L, `9951` = 22L, \n`9952` = 29L, `9953` = 27L, `9954` = 26L, `9955` = 23L, `9956` = 28L, \n`9957` = 23L, `9958` = 22L, `9959` = 26L, `9960` = 21L, `9961` = 24L, \n`9962` = 28L, `9963` = 22L, `9964` = 24L, `9965` = 26L, `9966` = 29L, \n`9967` = 20L, `9968` = 25L, `9969` = 22L, `9970` = 20L, `9971` = 22L, \n`9972` = 28L, `9973` = 29L, `9974` = 24L, `9975` = 21L, `9976` = 28L, \n`9977` = 22L, `9978` = 27L, `9979` = 26L, `9980` = 23L, `9981` = 22L, \n`9982` = 22L, `9983` = 21L, `9984` = 22L, `9985` = 28L, `9986` = 23L, \n`9987` = 23L, `9988` = 21L, `9989` = 28L, `9990` = 28L, `9991` = 20L, \n`9992` = 20L, `9993` = 20L, `9994` = 24L, `9995` = 22L, `9996` = 24L, \n`9997` = 20L, `9998` = 20L, `9999` = 22L))\n\nCoefficients:\n(Intercept) B \n -27.79929 -0.01377 \n\nDegrees of Freedom: 9999 Total (i.e. Null); 9998 Residual\nNull Deviance:\t 33360 \nResidual Deviance: 33360 \tAIC: 43370'],
dtype='<U419704')
>>> type(result)
<class 'numpy.ndarray'>
"""
result_variant_B = robjects.r['summary'](reg)
"""
<rpy2.robjects.vectors.ListVector object at 0x000002C9B71F6A10> [19]
R classes: ('summary.glm',)
[LangSexpV..., LangSexpV..., ListSexpV..., FloatSexp..., ..., FloatSexp..., IntSexpVe..., FloatSexp..., FloatSexp...]
call: <class 'rpy2.robjects.language.LangVector'>
Rlang( (function (formula, family = gaussian, data, weights, subset, )
terms: <class 'rpy2.robjects.Formula'>
<rpy2.robjects.Formula object at 0x000002C9B2630890> [6]
R classes: ('terms', 'formula')
<rpy2.robjects.vectors.ListVector object at 0x000002C9B71F6A10> [19]
R classes: ('summary.glm',)
[LangSexpV..., LangSexpV..., ListSexpV..., FloatSexp..., ..., FloatSexp..., IntSexpVe..., FloatSexp..., FloatSexp...]
deviance: <class 'numpy.ndarray'>
array([33357.71065632])
...
contrasts: <class 'numpy.ndarray'>
array([1.])
df.residual: <class 'rpy2.robjects.vectors.IntVector'>
<rpy2.robjects.vectors.IntVector object at 0x000002C9B2630890> [13]
R classes: ('integer',)
[2, 9998, 2]
null.deviance: <class 'numpy.ndarray'>
array([[ 0.00138301, -0.00059514],
[-0.00059514, 0.00029936]])
df.null: <class 'numpy.ndarray'>
array([[ 0.00138301, -0.00059514],
[-0.00059514, 0.00029936]])
"""
您可以使用
rx
或 rx2
访问 R 对象中的元素。例如,要访问 GLM 模型的系数,您可以使用以下命令:
reg.rx2('coefficients')
总体来说:
有关更多详细信息,请参阅rpy2有关线性模型的文档。
或者,您可以使用 R 中的 broom::tidy 包将模型结果转换为整洁的 R 数据帧,然后将其转换为 pandas DataFrame,以便在 Python 中更轻松地访问。方法如下:
# Convert model results using broom::tidy
tidy = robjects.r('broom::tidy')
tidy_reg = tidy(reg)
# Convert the R dataframe to a pandas dataframe
glm_result = pandas2ri.rpy2py(tidy_reg)
# Accessing elements in the pandas DataFrame
print(glm_result)
print(glm_result.at['2', 'estimate']) # Note: Index values are strings.
生成的 pandas DataFrame 将具有基于字符串的索引。如果需要,您可以重置索引并使用 Python 的标准索引方法。 以下是完整代码:
import random
import pandas as pd
import rpy2.robjects as robjects
import rpy2.robjects.pandas2ri as pandas2ri
pandas2ri.activate()
random.seed(0)
k = 10000
df = pd.DataFrame({
'A': random.choices(range(100), k=k),
'B': random.choices([1, 2, 3], k=k),
'C': random.choices([0, 1], k=k),
'D': random.choices(range(20, 30), k=k),
})
glm = robjects.r['glm']
reg = glm(
formula='C ~ B',
data=robjects.conversion.py2rpy(df),
family=robjects.r['poisson'](),
offset=df['D'])
print(robjects.r("summary")(reg))
print(reg.rx2('coefficients'))
tidy = robjects.r('broom::tidy')
tidy_reg = tidy(reg)
glm_result = pandas2ri.rpy2py(tidy_reg)
print(glm_result)
print(glm_result.at['2','estimate'])