根据列中第二个最新时间从 pandas df 中有效删除行

Question

我有一个与此类似的 pandas Dataframe：

索引	身份证	时间_1	时间_2
0	101	2024-06-20 14:32:22	2024-06-20 14:10:31
1	101	2024-06-20 15:21:31	2024-06-20 14:32:22
2	101	2024-06-20 15:21:31	2024-06-20 15:21:31
3	102	2024-06-20 16:26:51	2024-06-20 15:21:31
4	102	2024-06-20 16:26:51	2024-06-20 16:56:24
5	103	2024-06-20 20:05:44	2024-06-20 21:17:35
6	103	2024-06-20 22:41:22	2024-06-20 22:21:31
7	103	2024-06-20 23:11:56	2024-06-20 23:01:31

对于我的 df 中的每个 ID，我想获取第二个最新的 time_1 （如果存在）。然后，我想将此时间与 time_2 中的时间戳进行比较，并从 df 中删除 time_2 早于该时间的所有行。我的预期输出是：

索引	身份证	时间_1	时间_2
1	101	2024-06-20 15:21:31	2024-06-20 14:32:22
2	101	2024-06-20 15:21:31	2024-06-20 15:21:31
3	102	2024-06-20 16:26:51	2024-06-20 15:21:31
4	102	2024-06-20 16:26:51	2024-06-20 16:56:24
7	103	2024-06-20 23:11:56	2024-06-20 23:01:31

这个问题超出了我的 pandas 水平。我问了 ChatGPT，这是我得到的解决方案，原则上可以满足我的要求：

import pandas as pd

ids = [101, 101, 101, 102, 102, 103, 103, 103]
time_1 = ['2024-06-20 14:32:22', '2024-06-20 15:21:31', '2024-06-20 15:21:31', '2024-06-20 16:26:51', '2024-06-20 16:26:51', '2024-06-20 20:05:44', '2024-06-20 22:41:22', '2024-06-20 23:11:56']
time_2 = ['2024-06-20 14:10:31', '2024-06-20 14:32:22', '2024-06-20 15:21:31', '2024-06-20 15:21:31', '2024-06-20 16:56:24', '2024-06-20 21:17:35', '2024-06-20 22:21:31', '2024-06-20 23:01:31']


df = pd.DataFrame({
    'id': ids,
    'time_1': pd.to_datetime(time_1),
    'time_2': pd.to_datetime(time_2)
})

grouped = df.groupby('id')['time_1']
mask = pd.Series(False, index=df.index)

for id_value, group in df.groupby('id'):
    # Remove duplicates and sort timestamps
    unique_sorted_times = group['time_1'].drop_duplicates().sort_values()

    # Check if there's more than one unique time
    if len(unique_sorted_times) > 1:
        # Select the second last time
        second_last_time = unique_sorted_times.iloc[-2]
        # Update the mask for rows with time_2 greater than or equal to the second last time_1
        mask |= (df['id'] == id_value) & (df['time_2'] >= second_last_time)
    else:
        # If there's only one unique time, keep the row(s)
        mask |= (df['id'] == id_value)

filtered_data = df[mask]

我对这个解决方案的问题是 for 循环。这看起来效率相当低，而且我的真实数据相当大。我也很好奇是否有更好、更有效的解决方案。

Answer 1

这是一个可能的解决方案，使用

groupby

我添加了一个在组中包含单个元素的示例

import pandas as pd

ids = [101, 101, 101, 102, 102, 103, 103, 103, 104]
time_1 = [
    '2024-06-20 14:32:22', '2024-06-20 15:21:31', '2024-06-20 15:21:31',
    '2024-06-20 16:26:51', '2024-06-20 16:26:51', '2024-06-20 20:05:44',
    '2024-06-20 22:41:22', '2024-06-20 23:11:56', '2024-06-20 23:11:56']
time_2 = [
    '2024-06-20 14:10:31', '2024-06-20 14:32:22', '2024-06-20 15:21:31',
    '2024-06-20 15:21:31', '2024-06-20 16:56:24', '2024-06-20 21:17:35',
    '2024-06-20 22:21:31', '2024-06-20 23:01:31', '2024-06-20 23:01:31']


df = pd.DataFrame({
    'id': ids,
    'time_1': pd.to_datetime(time_1),
    'time_2': pd.to_datetime(time_2)
})

我们定义一个考虑组内逻辑的函数

def fun(x):
    if len(x) > 1:
        unique_times = x['time_1'].unique()
        if len(unique_times) >= 2:
            second_last_time = unique_times[-2]
        else:
            second_last_time =unique_times[0]
        x = x[x['time_2'].ge(second_last_time)]
    return x

df.groupby('id').apply(lambda x:fun(x)).reset_index(drop=True)

    id              time_1              time_2
0  101 2024-06-20 15:21:31 2024-06-20 14:32:22
1  101 2024-06-20 15:21:31 2024-06-20 15:21:31
2  102 2024-06-20 16:26:51 2024-06-20 16:56:24
3  103 2024-06-20 23:11:56 2024-06-20 23:01:31
4  104 2024-06-20 23:11:56 2024-06-20 23:01:31

通过这种方法，如果您的 df 变大，您就会看到好处。对于 90.000 行数据框，我看到了 25% 的改进。

根据列中第二个最新时间从 pandas df 中有效删除行

问题描述投票：0回答：1

1个回答

最新问题

根据列中第二个最新时间从 pandas df 中有效删除行

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1