如何使用Python rpy2从R对象中提取glm()结果?

问题描述 投票:0回答:1

我使用 Python

rpy2
R
中运行 glm()(回归模型)。我的问题是得到一个我可以处理的结果。最好的是一个不错的数据结构(字典、数据框……),但我也对人类可读的多行字符串感到满意。

直接在 R 中使用

glm()
会产生类似的输出,这就是我所期望的:

Call:
glm(formula = C ~ B + A, family = poisson(), data = df, offset = D)

Coefficients:
              Estimate Std. Error   z value Pr(>|z|)
(Intercept) -2.391e+01  3.171e-03 -7540.871   <2e-16 ***
B           -2.372e-04  1.331e-04    -1.782   0.0748 .
A            1.285e-03  1.345e-04     9.552   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 17500269  on 49999  degrees of freedom
Residual deviance: 17500174  on 49997  degrees of freedom
AIC: Inf

Number of Fisher Scoring iterations: 16

在下面的示例代码中,我尝试了两种提取结果的变体。

变体 A 中,我确实在 R 端使用

capture.output()
将控制台输出提取为字符串。但结果看起来很奇怪(请参阅代码中的文档字符串)。在变体 B 中,我通过
summary()
使用
rpy2
函数。这里我得到了一个我无法使用的有线数据结构。

#!/usr/bin/env python3
import random
import pandas as pd
import rpy2
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
import rpy2.robjects.pandas2ri as pandas2ri
pandas2ri.activate()

random.seed(0)
k = 10000

df = pd.DataFrame({
    'A': random.choices(range(100), k=k),
    'B': random.choices([1, 2, 3], k=k),
    'C': random.choices([0, 1], k=k),
    'D': random.choices(range(20, 30), k=k),
})

glm = robjects.r['glm']

reg = glm(
    formula='C ~ B',
    data=robjects.conversion.py2rpy(df),
    family=robjects.r['poisson'](),
    offset=df['D']
)    

robjects.r('''get_result <- function(reg) {
            result <- paste(capture.output(reg), collapse="\n")
            return(result)
        }
''')
get_result = robjects.globalenv['get_result']

result_variant_A = get_result(reg)
"""
[...SNIPPED...]
L, `9948` = 23L, `9949` = 23L, `9950` = 22L, `9951` = 22L, \n`9952` = 29L, `9953` = 27L, `9954` = 26L, `9955` = 23L, `9956` = 28L, \n`9957` = 23L, `9958` = 22L, `9959` = 26L, `9960` = 21L, `9961` = 24L, \n`9962` = 28L, `9963` = 22L, `9964` = 24L, `9965` = 26L, `9966` = 29L, \n`9967` = 20L, `9968` = 25L, `9969` = 22L, `9970` = 20L, `9971` = 22L, \n`9972` = 28L, `9973` = 29L, `9974` = 24L, `9975` = 21L, `9976` = 28L, \n`9977` = 22L, `9978` = 27L, `9979` = 26L, `9980` = 23L, `9981` = 22L, \n`9982` = 22L, `9983` = 21L, `9984` = 22L, `9985` = 28L, `9986` = 23L, \n`9987` = 23L, `9988` = 21L, `9989` = 28L, `9990` = 28L, `9991` = 20L, \n`9992` = 20L, `9993` = 20L, `9994` = 24L, `9995` = 22L, `9996` = 24L, \n`9997` = 20L, `9998` = 20L, `9999` = 22L))\n\nCoefficients:\n(Intercept)            B  \n  -27.79929     -0.01377  \n\nDegrees of Freedom: 9999 Total (i.e. Null);  9998 Residual\nNull Deviance:\t    33360 \nResidual Deviance: 33360 \tAIC: 43370'],
      dtype='<U419704')
>>> type(result)
<class 'numpy.ndarray'>
"""

result_variant_B = robjects.r['summary'](reg)
"""
<rpy2.robjects.vectors.ListVector object at 0x000002C9B71F6A10> [19]
R classes: ('summary.glm',)
[LangSexpV..., LangSexpV..., ListSexpV..., FloatSexp..., ..., FloatSexp..., IntSexpVe..., FloatSexp..., FloatSexp...]
  call: <class 'rpy2.robjects.language.LangVector'>
  Rlang( (function (formula, family = gaussian, data, weights, subset,  )
  terms: <class 'rpy2.robjects.Formula'>
  <rpy2.robjects.Formula object at 0x000002C9B2630890> [6]
R classes: ('terms', 'formula')
<rpy2.robjects.vectors.ListVector object at 0x000002C9B71F6A10> [19]
R classes: ('summary.glm',)
[LangSexpV..., LangSexpV..., ListSexpV..., FloatSexp..., ..., FloatSexp..., IntSexpVe..., FloatSexp..., FloatSexp...]
  deviance: <class 'numpy.ndarray'>
  array([33357.71065632])
...
  contrasts: <class 'numpy.ndarray'>
  array([1.])
  df.residual: <class 'rpy2.robjects.vectors.IntVector'>
  <rpy2.robjects.vectors.IntVector object at 0x000002C9B2630890> [13]
R classes: ('integer',)
[2, 9998, 2]
  null.deviance: <class 'numpy.ndarray'>
  array([[ 0.00138301, -0.00059514],
       [-0.00059514,  0.00029936]])
  df.null: <class 'numpy.ndarray'>
  array([[ 0.00138301, -0.00059514],
       [-0.00059514,  0.00029936]])
"""
python rpy2
1个回答
0
投票

您可以使用

rx
rx2
访问 R 对象中的元素。例如,要访问 GLM 模型的系数,您可以使用以下命令:

reg.rx2('coefficients')

总体来说:

  • 使用 rx 替换 R 中的单括号 ([)。
  • 使用 rx2 替换 R 中的双括号([[)或美元符号($)。

有关更多详细信息,请参阅rpy2有关线性模型的文档

或者,您可以使用 R 中的 broom::tidy 包将模型结果转换为整洁的 R 数据帧,然后将其转换为 pandas DataFrame,以便在 Python 中更轻松地访问。方法如下:

# Convert model results using broom::tidy
tidy = robjects.r('broom::tidy')
tidy_reg = tidy(reg)

# Convert the R dataframe to a pandas dataframe
glm_result = pandas2ri.rpy2py(tidy_reg)

# Accessing elements in the pandas DataFrame
print(glm_result)
print(glm_result.at['2', 'estimate'])  # Note: Index values are strings.

生成的 pandas DataFrame 将具有基于字符串的索引。如果需要,您可以重置索引并使用 Python 的标准索引方法。 以下是完整代码:

import random
import pandas as pd
import rpy2.robjects as robjects
import rpy2.robjects.pandas2ri as pandas2ri
pandas2ri.activate()

random.seed(0)
k = 10000

df = pd.DataFrame({
    'A': random.choices(range(100), k=k),
    'B': random.choices([1, 2, 3], k=k),
    'C': random.choices([0, 1], k=k),
    'D': random.choices(range(20, 30), k=k),
})

glm = robjects.r['glm']

reg = glm(
    formula='C ~ B',
    data=robjects.conversion.py2rpy(df),
    family=robjects.r['poisson'](),
    offset=df['D'])    

print(robjects.r("summary")(reg))
print(reg.rx2('coefficients'))
tidy = robjects.r('broom::tidy')
tidy_reg = tidy(reg)

glm_result = pandas2ri.rpy2py(tidy_reg)
print(glm_result)
print(glm_result.at['2','estimate'])
© www.soinside.com 2019 - 2024. All rights reserved.