我有一个 10m 行的数据框,如下所示:
rdata={'first': {0: 'john', 1: 'david', 2: 'daniel', 3: 'john'}, 'last': {0: 'smith', 1: 'jones', 2: 'bond', 3: 'smith'}, 'ph': {0: 123456, 1: 456789, 2: 12345, 3: 234567}, 'address': {0: 'AB12 3CB', 1: 'EF45 6GH', 2: 'IJ78 9KL', 3: 'AB32 3CD'}, 'email': {0: '[email protected]', 1: '[email protected]', 2: '[email protected]', 3: '[email protected]'}, 'id': {0: 12345678, 1: 90123456, 2: 78901234, 3: 98765432}}
df = pd.DataFrame(rdata)
我想根据名字和姓氏进行搜索,并提取相应的记录。
我正在使用以下方法:
d=data.loc[(data['last']=='smith') & (data['first']=='john')]
d=data[(data['last']=='smith') & (data['first']=='john')]
d=data.iloc[np.where((data['last']=='smith') & (data['first']=='john'))]
这些方法在执行时间上都非常相似(1.73s),但我想知道是否有更有效的方法来搜索记录。
我尝试转换
df.to_numpy()[(df['last']=='smith') & (df['first']=='john')]
,但没有看到明显的收益。
还有其他我可能会错过的可能性吗?
query
可以在大型数据集上更快一点:
df.query('(last == "smith") & (first == "john")')
40行:
# query
1.45 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# loc
239 µs ± 20.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
40k 行:
# query
3.74 ms ± 318 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# loc
5.06 ms ± 597 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
这可能取决于确切的数据集,所以最好尝试使用您的数据集
这是一个有趣的观察,
pd.DataFrame.query
行数越多,速度可能会更快:
import pandas as pd
import numpy as np
from timeit import timeit
rdata={'first': {0: 'john', 1: 'david', 2: 'daniel', 3: 'john'}, 'last': {0: 'smith', 1: 'jones', 2: 'bond', 3: 'smith'}, 'ph': {0: 123456, 1: 456789, 2: 12345, 3: 234567}, 'address': {0: 'AB12 3CB', 1: 'EF45 6GH', 2: 'IJ78 9KL', 3: 'AB32 3CD'}, 'email': {0: '[email protected]', 1: '[email protected]', 2: '[email protected]', 3: '[email protected]'}, 'id': {0: 12345678, 1: 90123456, 2: 78901234, 3: 98765432}}
df = pd.DataFrame(rdata)
def func_loc(data):
return data.loc[(data['last']=='smith') & (data['first']=='john')]
def func_bool(data):
return data[(data['last']=='smith') & (data['first']=='john')]
def func_iloc(data):
return data.iloc[np.where((data['last']=='smith') & (data['first']=='john'))]
def func_query(data):
return data.query('last == "smith" and first == "john"')
res = pd.DataFrame(
index=[1, 10, 30, 100, 300, 1_000, 3_000, 10_000, 30_000, 100_000],
columns='func_loc func_bool func_iloc func_query'.split(),
dtype=float
)
for i in res.index:
d = pd.concat([df]*i)
for j in res.columns:
stmt = '{}(d)'.format(j)
setp = 'from __main__ import d, {}'.format(j)
res.at[i, j] = timeit(stmt, setp, number=100)
res.plot(loglog=True);
输出:图像在这里