我想得到所有元素的数量小于或等于pandas.Series
中的每个条目,例如:
if __name__ == '__main__':
import pandas as pd
a = pd.Series(data=[4,7,3,5,2,1,1,6])
le = pd.Series(data=[a[a <= i].count() for i in a])
print(le)
结果:
0 5
1 8
2 4
3 6
4 3
5 2
6 2
7 7
dtype: int64
系列中是否有函数或更好的方法来处理大型数据集?
更快的是numpy解决方案 - 将Series
转换为numpy array
并通过广播比较到2d数组,最后计数True
值sum
:
b = a.values
#pandas 0.24+
#b = a.to_numpy()
le = pd.Series((b <= b[:, None]).sum(axis=1), index=a.index)
细节:
print (b <= b[:, None])
[[ True False True False True True True False]
[ True True True True True True True True]
[False False True False True True True False]
[ True False True True True True True False]
[False False False False True True True False]
[False False False False False True True False]
[False False False False False True True False]
[ True False True True True True True True]]
le = pd.Series([a.le(i).sum() for i in a])
le = a.apply(lambda i: a.le(i).sum())
print(le)
0 5
1 8
2 4
3 6
4 3
5 2
6 2
7 7
dtype: int64
性能:
np.random.seed(2019)
N = 10**6
s = pd.Series(np.random.randint(100, size=N))
#print (s)
In [173]: %%timeit
...: b = a.values
...: le = pd.Series((b <= b[:, None]).sum(axis=1), index=a.index)
...:
78.6 µs ± 510 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [174]: %%timeit
...: le = pd.Series([a.le(i).sum() for i in a])
...:
3.22 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [175]: %%timeit
...: le = a.apply(lambda i: a.le(i).sum())
...:
3.35 ms ± 290 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [176]: %%timeit
...: a.apply(lambda x: a[a.le(x)].count())
...:
...:
5.41 ms ± 457 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [177]: %%timeit
...: le = pd.Series(data=[a[a <= i].count() for i in a])
...:
4.91 ms ± 281 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
你可以使用apply
和lambda函数:
In [4]: a.apply(lambda x: a[a.le(x)].count())
Out[4]: 0 5
1 8
2 4
3 6
4 3
5 2
6 2
7 7
dtype: int64
由于该问题将应用于大型数据集:
%timeit [(a.values <= x).sum() for x in a]
10000 loops, best of 3: 28.6 µs per loop
%timeit le = pd.Series(data=[a[a <= i].count() for i in a])
100 loops, best of 3: 2.74 ms per loop
%timeit a.apply(lambda x: a[a.le(x)].count())
100 loops, best of 3: 3.09 ms per loop
这意味着申请缓慢,以及OP的方式也不是最好的。