pandas 中 groupby 结果的长度

Question

问题

我想计算两列的唯一组合的数量。使用 pandas 实现此目的最高效的方法是什么？

对我来说，最直观的方法是按列分组并获取对象的长度。然而，对于大型数据帧，我发现性能比下面的第二个变体慢约 5 倍。

# Quick
len1 = lambda: len(df[['a', 'b']].groupby(['a', 'b'], observed=True).nunique())
# Slow
len2 = lambda: len(df[['a', 'b']].groupby(['a', 'b'], observed=True))

问题

为什么第二个变体更快？
此计算的最快方法是什么？

MWE 重现

import numpy as np
import pandas as pd
import timeit

# Set the number of rows and columns
num_rows = 10000000
num_cols = 5

# Generate random numbers
data = np.random.rand(num_rows, num_cols)

# Convert columns a and b to categorical values
data[:, 0] = pd.Categorical(np.random.randint(0, 10, size=num_rows))
data[:, 1] = pd.Categorical(np.random.randint(0, 10, size=num_rows))

# Create a DataFrame with the random numbers and categorical values
df = pd.DataFrame(data, columns=['a', 'b', 'c', 'd', 'e'])

len1 = lambda: len(df[['a', 'b']].groupby(['a', 'b'], observed=True).nunique())
len2 = lambda: len(df[['a', 'b']].groupby(['a', 'b'], observed=True))
           
time1 = timeit.timeit(len1, number=10)
print(f"Execution time of len1: {time1:.5f} seconds")

time2 = timeit.timeit(len2, number=10)
print(f"Execution time of len2: {time2:.5f} seconds")

输出：

Execution time of len1: 3.16599 seconds
Execution time of len2: 17.47438 seconds

Answer 1

避免

groupby

速度慢，我会使用：

df[['a', 'b']].drop_duplicates().shape[0]

Answer 2

将

DataFrame.duplicated

与

sum

和倒置面罩一起使用：

(~df[['a', 'b']].duplicated()).sum()

pandas 中 groupby 结果的长度

问题描述投票：0回答：2

问题

问题

MWE 重现

2个回答

最新问题

pandas 中 groupby 结果的长度

问题描述 投票：0回答：2

问题

问题

MWE 重现

2个回答

最新问题

问题描述投票：0回答：2