搜索包含文本和重复值的 pandas 数据框的最快方法[重复]

Question

我有一个 10m 行的数据框，如下所示：

rdata={'first': {0: 'john', 1: 'david', 2: 'daniel', 3: 'john'}, 'last': {0: 'smith', 1: 'jones', 2: 'bond', 3: 'smith'}, 'ph': {0: 123456, 1: 456789, 2: 12345, 3: 234567}, 'address': {0: 'AB12 3CB', 1: 'EF45 6GH', 2: 'IJ78 9KL', 3: 'AB32 3CD'}, 'email': {0: '[email protected]', 1: '[email protected]', 2: '[email protected]', 3: '[email protected]'}, 'id': {0: 12345678, 1: 90123456, 2: 78901234, 3: 98765432}}
df = pd.DataFrame(rdata)

我想根据名字和姓氏进行搜索，并提取相应的记录。

我正在使用以下方法：

d=data.loc[(data['last']=='smith') & (data['first']=='john')]
d=data[(data['last']=='smith') & (data['first']=='john')]
d=data.iloc[np.where((data['last']=='smith') & (data['first']=='john'))]

这些方法在执行时间上都非常相似（1.73s），但我想知道是否有更有效的方法来搜索记录。

我尝试转换

df.to_numpy()[(df['last']=='smith') & (df['first']=='john')]

，但没有看到明显的收益。

还有其他我可能会错过的可能性吗？

Answer 1

query

可以在大型数据集上更快一点：

df.query('(last == "smith") & (first == "john")')

40行：

# query
1.45 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# loc
239 µs ± 20.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

40k 行：

# query
3.74 ms ± 318 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# loc
5.06 ms ± 597 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

这可能取决于确切的数据集，所以最好尝试使用您的数据集

Answer 2

这是一个有趣的观察，

pd.DataFrame.query

行数越多，速度可能会更快：

import pandas as pd
import numpy as np
from timeit import timeit

rdata={'first': {0: 'john', 1: 'david', 2: 'daniel', 3: 'john'}, 'last': {0: 'smith', 1: 'jones', 2: 'bond', 3: 'smith'}, 'ph': {0: 123456, 1: 456789, 2: 12345, 3: 234567}, 'address': {0: 'AB12 3CB', 1: 'EF45 6GH', 2: 'IJ78 9KL', 3: 'AB32 3CD'}, 'email': {0: '[email protected]', 1: '[email protected]', 2: '[email protected]', 3: '[email protected]'}, 'id': {0: 12345678, 1: 90123456, 2: 78901234, 3: 98765432}}
df = pd.DataFrame(rdata)

def func_loc(data):
    return data.loc[(data['last']=='smith') & (data['first']=='john')]

def func_bool(data):
    return data[(data['last']=='smith') & (data['first']=='john')]

def func_iloc(data):
    return data.iloc[np.where((data['last']=='smith') & (data['first']=='john'))]

def func_query(data):
    return data.query('last == "smith" and first == "john"')

res = pd.DataFrame(
    index=[1, 10, 30, 100, 300, 1_000, 3_000, 10_000, 30_000, 100_000],
    columns='func_loc func_bool func_iloc func_query'.split(),
    dtype=float
)

for i in res.index:
    d = pd.concat([df]*i)
    for j in res.columns:
        stmt = '{}(d)'.format(j)
        setp = 'from __main__ import d, {}'.format(j)
        res.at[i, j] = timeit(stmt, setp, number=100)

res.plot(loglog=True);

输出：图像在这里

搜索包含文本和重复值的 pandas 数据框的最快方法[重复]

问题描述投票：0回答：2

2个回答

最新问题

搜索包含文本和重复值的 pandas 数据框的最快方法[重复]

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2