我想计算两列的唯一组合的数量。使用 pandas 实现此目的最高效的方法是什么?
对我来说,最直观的方法是按列分组并获取对象的长度。然而,对于大型数据帧,我发现性能比下面的第二个变体慢约 5 倍。
# Quick
len1 = lambda: len(df[['a', 'b']].groupby(['a', 'b'], observed=True).nunique())
# Slow
len2 = lambda: len(df[['a', 'b']].groupby(['a', 'b'], observed=True))
import numpy as np
import pandas as pd
import timeit
# Set the number of rows and columns
num_rows = 10000000
num_cols = 5
# Generate random numbers
data = np.random.rand(num_rows, num_cols)
# Convert columns a and b to categorical values
data[:, 0] = pd.Categorical(np.random.randint(0, 10, size=num_rows))
data[:, 1] = pd.Categorical(np.random.randint(0, 10, size=num_rows))
# Create a DataFrame with the random numbers and categorical values
df = pd.DataFrame(data, columns=['a', 'b', 'c', 'd', 'e'])
len1 = lambda: len(df[['a', 'b']].groupby(['a', 'b'], observed=True).nunique())
len2 = lambda: len(df[['a', 'b']].groupby(['a', 'b'], observed=True))
time1 = timeit.timeit(len1, number=10)
print(f"Execution time of len1: {time1:.5f} seconds")
time2 = timeit.timeit(len2, number=10)
print(f"Execution time of len2: {time2:.5f} seconds")
输出:
Execution time of len1: 3.16599 seconds
Execution time of len2: 17.47438 seconds
避免
groupby
速度慢,我会使用:
df[['a', 'b']].drop_duplicates().shape[0]