Python 的 pd.qcut 如何给出与 R 的 star::xtile 相同的结果?

问题描述 投票:0回答:2

我需要根据数据框中的一列创建数据箱。一个问题是该列的值分布很奇怪。因此,Python 的 pd.qcut 可能会任意将观测值放入不同的 bin 中,即使它们具有相同的值。

在 R(或 Stata)中,我使用 statar 包的 xtile 函数。 R 能够将具有相同值的所有观测值分组到一个 bin 中。

library(tidyverse)

sample_df <- data.frame(customer_id  = seq(1:10),
                        purch_frequency = c(1, 1, 1, 1, 1, 2, 3, 10, 11, 11))

sample_df <- sample_df %>% 
  mutate(freq_bins1=statar::xtile(purch_frequency, 2),
         freq_bins2=statar::xtile(purch_frequency, 3))

print(sample_df)

Python 中的相应实现,

import pandas as pd

data = {'customer_id': range(1,11),
        'purch_frequency': [1, 1, 1, 1, 1, 2, 3, 10, 11, 11]}
sample_df = pd.DataFrame(data)

sample_df['freq_bins1'] = \
    (sample_df['purch_frequency'].rank(method = 'first')
     .transform(lambda x: pd.qcut(x, 2, labels = False)))
sample_df['freq_bins2'] = \
    (sample_df['purch_frequency'].rank(method = 'first')
     .transform(lambda x: pd.qcut(x, 3, labels = False)))
print(sample_df)

如您所见,R 和 Python 对于最后一列 freq_bins2 给出了不同的答案。我想知道如何修改 Python 代码以使其与 R 的结果匹配。谢谢!


快速跟进。 R 和 Python 输出现在附在下面。 对于 R(python 的索引比 R 的索引小“1”,这很好):

   customer_id purch_frequency freq_bins1 freq_bins2
1            1               1          1          1
2            2               1          1          1
3            3               1          1          1
4            4               1          1          1
5            5               1          1          1
6            6               2          2          2
7            7               3          2          2
8            8              10          2          3
9            9              11          2          3
10          10              11          2          3

对于Python:

   customer_id  purch_frequency  freq_bins1  freq_bins2
0            1                1           0           0
1            2                1           0           0
2            3                1           0           0
3            4                1           0           0
4            5                1           0           1
5            6                2           1           1
6            7                3           1           1
7            8               10           1           2
8            9               11           1           2
9           10               11           1           2
python pandas
2个回答
0
投票

也许有更好的答案,但我通过从 Python 中调用 R 函数(statar::xtile)找到了一条弯路。

# You need to first install rpy2
# Activate rpy2 to use R functions/packages in Python 
import rpy2
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
# in particular, the following two lines activate R functions
# for pandas' dataframes
from rpy2.robjects import pandas2ri 
pandas2ri.activate()

### This chunk is the same as the original post
import pandas as pd
data = {'customer_id': range(1,11),
        'purch_frequency': [1, 1, 1, 1, 1, 2, 3, 10, 11, 11]}
sample_df = pd.DataFrame(data)

sample_df['freq_bins1'] = \
    (sample_df['purch_frequency'].rank(method = 'first')
     .transform(lambda x: pd.qcut(x, 2, labels = False)))
sample_df['freq_bins2'] = \
    (sample_df['purch_frequency'].rank(method = 'first')
     .transform(lambda x: pd.qcut(x, 3, labels = False)))

### The following is to call R's statar::xtile
statar = importr('statar')
sample_df['freq_bin3'] = statar.xtile(sample_df['purch_frequency'], 3)

print(sample_df)

输出如下所示:

   customer_id  purch_frequency  freq_bins1  freq_bins2  freq_bin3
0            1                1           0           0          1
1            2                1           0           0          1
2            3                1           0           0          1
3            4                1           0           0          1
4            5                1           0           1          1
5            6                2           1           1          2
6            7                3           1           1          2
7            8               10           1           2          3
8            9               11           1           2          3
9           10               11           1           2          3

0
投票
def xtile(data, num_tiles):
# Calculate quantiles
quantiles = np.percentile(data, np.linspace(0, 100, num_tiles + 1))

# Use numpy.digitize to assign labels
tile_index = np.digitize(data, quantiles[1:-1], right=True) + 1
    
return tile_index
© www.soinside.com 2019 - 2024. All rights reserved.