从大型数据帧中删除大量 ID 需要很长时间

问题描述 投票:0回答:1

我有两个数据框

df1
df2

print(df1.shape)
(1042009, 40)

print(df1.columns)
Index(['date_acte', 'transaction_id', 'amount', ...],
      dtype='object')

print(df2.shape)
(734738, 37)

print(df2.columns)
Index(['date', 'transaction_id', 'amount', ...],
      dtype='object')

我想从

transaction_id
中删除
df2
中唯一的
df1
并保留其余部分。

我做了以下事情:

Filtre = list(df2.transaction_id.unique())
print(len(Filtre))
733465

noMatched = df1.loc[
    (~df1['transaction_id'].str.contains('|'.join(Filtre), case=False, na=False))]

我的问题是输出

noMatched
需要将近5个小时才能准备好。我想知道是否有更有效的方法来编写这段代码。 5小时之内能产生输出吗?

python pandas dataframe contains
1个回答
0
投票

你可以这样做:

import pandas as pd
import numpy as np

df1 = pd.DataFrame({
    'transaction_id': np.random.randint(1000000, 2000000, size=1042009),
    'amount': np.random.rand(1042009),
    'date_acte': pd.date_range('2020-01-01', periods=1042009, freq='T')
})

df2 = pd.DataFrame({
    'transaction_id': np.random.randint(1500000, 2500000, size=734738),
    'amount': np.random.rand(734738),
    'date': pd.date_range('2020-01-01', periods=734738, freq='T')
})

start_time = time.time()

filtre_set = set(df2['transaction_id'].unique())

noMatched = df1[~df1['transaction_id'].isin(filtre_set)]

end_time = time.time()

print(f"Filtered DataFrame shape: {noMatched}")
print(f"Execution time: {end_time - start_time:.2f} seconds")

返回

Filtered DataFrame shape:          transaction_id    amount           date_acte
1               1231651  0.849124 2020-01-01 00:01:00
2               1443550  0.031414 2020-01-01 00:02:00
3               1164444  0.973699 2020-01-01 00:03:00
4               1371353  0.554666 2020-01-01 00:04:00
7               1072327  0.867207 2020-01-01 00:07:00
...                 ...       ...                 ...
1042004         1499512  0.114861 2021-12-24 14:44:00
1042005         1255963  0.756608 2021-12-24 14:45:00
1042006         1203341  0.091380 2021-12-24 14:46:00
1042007         1016687  0.153179 2021-12-24 14:47:00
1042008         1036581  0.382781 2021-12-24 14:48:00

[770625 rows x 3 columns]
Execution time: 0.52 seconds
© www.soinside.com 2019 - 2024. All rights reserved.