我需要根据数据框中的一列创建数据箱。一个问题是该列的值分布很奇怪。因此,Python 的 pd.qcut 可能会任意将观测值放入不同的 bin 中,即使它们具有相同的值。
在 R(或 Stata)中,我使用 statar 包的 xtile 函数。 R 能够将具有相同值的所有观测值分组到一个 bin 中。
library(tidyverse)
sample_df <- data.frame(customer_id = seq(1:10),
purch_frequency = c(1, 1, 1, 1, 1, 2, 3, 10, 11, 11))
sample_df <- sample_df %>%
mutate(freq_bins1=statar::xtile(purch_frequency, 2),
freq_bins2=statar::xtile(purch_frequency, 3))
print(sample_df)
Python 中的相应实现,
import pandas as pd
data = {'customer_id': range(1,11),
'purch_frequency': [1, 1, 1, 1, 1, 2, 3, 10, 11, 11]}
sample_df = pd.DataFrame(data)
sample_df['freq_bins1'] = \
(sample_df['purch_frequency'].rank(method = 'first')
.transform(lambda x: pd.qcut(x, 2, labels = False)))
sample_df['freq_bins2'] = \
(sample_df['purch_frequency'].rank(method = 'first')
.transform(lambda x: pd.qcut(x, 3, labels = False)))
print(sample_df)
如您所见,R 和 Python 对于最后一列 freq_bins2 给出了不同的答案。我想知道如何修改 Python 代码以使其与 R 的结果匹配。谢谢!
快速跟进。 R 和 Python 输出现在附在下面。 对于 R(python 的索引比 R 的索引小“1”,这很好):
customer_id purch_frequency freq_bins1 freq_bins2
1 1 1 1 1
2 2 1 1 1
3 3 1 1 1
4 4 1 1 1
5 5 1 1 1
6 6 2 2 2
7 7 3 2 2
8 8 10 2 3
9 9 11 2 3
10 10 11 2 3
对于Python:
customer_id purch_frequency freq_bins1 freq_bins2
0 1 1 0 0
1 2 1 0 0
2 3 1 0 0
3 4 1 0 0
4 5 1 0 1
5 6 2 1 1
6 7 3 1 1
7 8 10 1 2
8 9 11 1 2
9 10 11 1 2
也许有更好的答案,但我通过从 Python 中调用 R 函数(statar::xtile)找到了一条弯路。
# You need to first install rpy2
# Activate rpy2 to use R functions/packages in Python
import rpy2
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
# in particular, the following two lines activate R functions
# for pandas' dataframes
from rpy2.robjects import pandas2ri
pandas2ri.activate()
### This chunk is the same as the original post
import pandas as pd
data = {'customer_id': range(1,11),
'purch_frequency': [1, 1, 1, 1, 1, 2, 3, 10, 11, 11]}
sample_df = pd.DataFrame(data)
sample_df['freq_bins1'] = \
(sample_df['purch_frequency'].rank(method = 'first')
.transform(lambda x: pd.qcut(x, 2, labels = False)))
sample_df['freq_bins2'] = \
(sample_df['purch_frequency'].rank(method = 'first')
.transform(lambda x: pd.qcut(x, 3, labels = False)))
### The following is to call R's statar::xtile
statar = importr('statar')
sample_df['freq_bin3'] = statar.xtile(sample_df['purch_frequency'], 3)
print(sample_df)
输出如下所示:
customer_id purch_frequency freq_bins1 freq_bins2 freq_bin3
0 1 1 0 0 1
1 2 1 0 0 1
2 3 1 0 0 1
3 4 1 0 0 1
4 5 1 0 1 1
5 6 2 1 1 2
6 7 3 1 1 2
7 8 10 1 2 3
8 9 11 1 2 3
9 10 11 1 2 3
def xtile(data, num_tiles):
# Calculate quantiles
quantiles = np.percentile(data, np.linspace(0, 100, num_tiles + 1))
# Use numpy.digitize to assign labels
tile_index = np.digitize(data, quantiles[1:-1], right=True) + 1
return tile_index