基于两个数据集的条件列值的动态时间戳附加

Question

这与我几天前提到的问题类似，但有一些特殊的警告。我提前真诚地道歉。我有两个文件放入数据帧中，df1 和 df2，它们分别具有不同的信息和不同的标头，但在这两个文件中，它们共享一个共同的“id”标头。本质上，这两个文件包含它们共享的公共“id”的不同信息集。例如：

df1：

id	日期	另一个_col
WPA54	2023-08-01	A
WPA54	2023-08-01	B
WPA54	2023-08-01	C
WPA54	2023-08-01	D
IBT675	2023-08-01	E
IBT675	2023-08-01	F

df2

id	日期时间
WPA54	2023-08-01 00:02:52.527
WPA54	2023-08-01 00:10:10.640
WPA54	2023-08-01 00:10:12:740
WPA54	2023-08-01 00:10:26.937
IBT675	2023-08-01 00:10:10.640
IBT675	2023-08-01 00:10:11.540
IBT675	2023-08-01 00:10:12:740

为了简单起见，我想采用 df2 的 DateTime 值并在 df1 中创建一个名为 fix_timestamps 的新列，其中唯一时间由分号 ' ; 连接' 分隔符并与 df1 中的正确 ID 配对。其重要性在于 df1 和 df2 有两种不同的形状。 df1 更加“固定”，包含被视为时间点的信息，我需要将其附加到 df2 的信息，因为 df2 是一个更大的文件，其中包含许多需要附加到 df1 的 id 的不同时间。

上一个问题与这个问题之间的区别在于，我的上一个问题是基于通过合并的一个文件，当我开始更多地了解我的数据时，我意识到我做错了。

我不能将此归功于 mozway，因为 mozway 对我之前的问题非常有帮助。这对于一个文件来说非常有效，但是当我使用两个文件时......

import pandas as pd

# Define the file paths for your CSV files
file1_path = 'input1.csv'
file2_path = 'input2.csv'

# Read the CSV files into DataFrames
df1 = pd.read_csv(file1_path)
df2 = pd.read_csv(file2_path)

def append_timestamp(row):
    id_length = len(row['id'])
    timestamps = []

    for i in range(id_length):
        timestamps.append(row['DateTime'])
    return ';'.join(timestamps)


# My thinking of this was to look in my df2 which has the DateTime and group it by the df2 id and create the new column, 'fix_timestamps' in df1 which would have all the DateTime values already appended.

df1['fix_timestamps'] = (df2['DateTime'].astype(str).groupby(df2['id'])
                        .transform(lambda x: ';'.join(x.unique())))


# Save the DataFrame to a CSV file
output_file_path = 'output'
df1.to_csv(output_file_path, index=False)

我所期待的是这样的：

id	日期时间
WPA54	2023-08-01 00:02:52.527; 2023-08-01 00:10:10.640;2023-08-01 00:10:12:740;2023-08-01 00:10:26.937
WPA54	2023-08-01 00:02:52.527; 2023-08-01 00:10:10.640;2023-08-01 00:10:12:740;2023-08-01 00:10:26.937
WPA54	2023-08-01 00:02:52.527; 2023-08-01 00:10:10.640;2023-08-01 00:10:12:740;2023-08-01 00:10:26.937
WPA54	2023-08-01 00:02:52.527; 2023-08-01 00:10:10.640;2023-08-01 00:10:12:740;2023-08-01 00:10:26.937
IBT675	2023-08-01 00:10:10.640;2023-08-01 00:10:11.540;2023-08-01 00:10:12:740
IBT675	2023-08-01 00:10:10.640;2023-08-01 00:10:11.540;2023-08-01 00:10:12:740
IBT675	2023-08-01 00:10:10.640;2023-08-01 00:10:11.540;2023-08-01 00:10:12:740

但是我得到的是 WPA54、IBET675 和每个 id 的相同整个附加时间序列

id	日期时间
WPA54	2023-08-01 00:02:52.527; 2023-08-01 00:10:10.640;2023-08-01 00:10:12:740;2023-08-01 00:10:26.937
WPA54	2023-08-01 00:02:52.527; 2023-08-01 00:10:10.640;2023-08-01 00:10:12:740;2023-08-01 00:10:26.937
WPA54	2023-08-01 00:02:52.527; 2023-08-01 00:10:10.640;2023-08-01 00:10:12:740;2023-08-01 00:10:26.937
WPA54	2023-08-01 00:02:52.527; 2023-08-01 00:10:10.640;2023-08-01 00:10:12:740;2023-08-01 00:10:26.937
IBT675	2023-08-01 00:02:52.527; 2023-08-01 00:10:10.640;2023-08-01 00:10:12:740;2023-08-01 00:10:26.937
IBT675	2023-08-01 00:02:52.527; 2023-08-01 00:10:10.640;2023-08-01 00:10:12:740;2023-08-01 00:10:26.937
IBT675	2023-08-01 00:02:52.527; 2023-08-01 00:10:10.640;2023-08-01 00:10:12:740;2023-08-01 00:10:26.937

提前谢谢您

Answer 1

你可以这样解决：

import pandas as pd

data_df1 = {
    'id': ['WPA54', 'WPA54', 'WPA54', 'WPA54', 'IBT675', 'IBT675'],
    'Date': ['2023-08-01', '2023-08-01', '2023-08-01', '2023-08-01', '2023-08-01', '2023-08-01'],
    'another_col': ['A', 'B', 'C', 'D', 'E', 'F']
}

data_df2 = {
    'id': ['WPA54', 'WPA54', 'WPA54', 'WPA54', 'IBT675', 'IBT675', 'IBT675'],
    'DateTime': ['2023-08-01 00:02:52.527', '2023-08-01 00:10:10.640', '2023-08-01 00:10:12.740', 
                 '2023-08-01 00:10:26.937', '2023-08-01 00:10:10.640', '2023-08-01 00:10:11.540', 
                 '2023-08-01 00:10:12.740']
}

df1 = pd.DataFrame(data_df1)
df2 = pd.DataFrame(data_df2)

timestamps = df2.groupby('id')['DateTime'].apply(lambda x: ';'.join(x.unique())).reset_index()
timestamps.columns = ['id', 'fix_timestamps']
df1_updated = pd.merge(df1, timestamps, on='id', how='left')

df1_updated

timestamps = df2.groupby('id')['DateTime'].apply(lambda x: ';'.join(x.unique())).reset_index()
timestamps.columns = ['id', 'fix_timestamps']

df1 = pd.merge(df1, timestamps, on='id', how='left')

df1.to_csv(r'C:\Users\s-degossondevarennes/outputdd.csv', index=False)

将会返回

基于两个数据集的条件列值的动态时间戳附加

问题描述投票：0回答：1

1个回答

最新问题

基于两个数据集的条件列值的动态时间戳附加

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1