这与我几天前提到的问题类似,但有一些特殊的警告。我提前真诚地道歉。我有两个文件放入数据帧中,df1 和 df2,它们分别具有不同的信息和不同的标头,但在这两个文件中,它们共享一个共同的“id”标头。本质上,这两个文件包含它们共享的公共“id”的不同信息集。例如:
df1:
id | 日期 | 另一个_col |
---|---|---|
WPA54 | 2023-08-01 | A |
WPA54 | 2023-08-01 | B |
WPA54 | 2023-08-01 | C |
WPA54 | 2023-08-01 | D |
IBT675 | 2023-08-01 | E |
IBT675 | 2023-08-01 | F |
df2
id | 日期时间 |
---|---|
WPA54 | 2023-08-01 00:02:52.527 |
WPA54 | 2023-08-01 00:10:10.640 |
WPA54 | 2023-08-01 00:10:12:740 |
WPA54 | 2023-08-01 00:10:26.937 |
IBT675 | 2023-08-01 00:10:10.640 |
IBT675 | 2023-08-01 00:10:11.540 |
IBT675 | 2023-08-01 00:10:12:740 |
为了简单起见,我想采用 df2 的 DateTime 值并在 df1 中创建一个名为 fix_timestamps 的新列,其中唯一时间由分号 ' ; 连接' 分隔符并与 df1 中的正确 ID 配对。其重要性在于 df1 和 df2 有两种不同的形状。 df1 更加“固定”,包含被视为时间点的信息,我需要将其附加到 df2 的信息,因为 df2 是一个更大的文件,其中包含许多需要附加到 df1 的 id 的不同时间。
上一个问题与这个问题之间的区别在于,我的上一个问题是基于通过合并的一个文件,当我开始更多地了解我的数据时,我意识到我做错了。
我不能将此归功于 mozway,因为 mozway 对我之前的问题非常有帮助。这对于一个文件来说非常有效,但是当我使用两个文件时......
import pandas as pd
# Define the file paths for your CSV files
file1_path = 'input1.csv'
file2_path = 'input2.csv'
# Read the CSV files into DataFrames
df1 = pd.read_csv(file1_path)
df2 = pd.read_csv(file2_path)
def append_timestamp(row):
id_length = len(row['id'])
timestamps = []
for i in range(id_length):
timestamps.append(row['DateTime'])
return ';'.join(timestamps)
# My thinking of this was to look in my df2 which has the DateTime and group it by the df2 id and create the new column, 'fix_timestamps' in df1 which would have all the DateTime values already appended.
df1['fix_timestamps'] = (df2['DateTime'].astype(str).groupby(df2['id'])
.transform(lambda x: ';'.join(x.unique())))
# Save the DataFrame to a CSV file
output_file_path = 'output'
df1.to_csv(output_file_path, index=False)
我所期待的是这样的:
id | 日期时间 |
---|---|
WPA54 | 2023-08-01 00:02:52.527; 2023-08-01 00:10:10.640;2023-08-01 00:10:12:740;2023-08-01 00:10:26.937 |
WPA54 | 2023-08-01 00:02:52.527; 2023-08-01 00:10:10.640;2023-08-01 00:10:12:740;2023-08-01 00:10:26.937 |
WPA54 | 2023-08-01 00:02:52.527; 2023-08-01 00:10:10.640;2023-08-01 00:10:12:740;2023-08-01 00:10:26.937 |
WPA54 | 2023-08-01 00:02:52.527; 2023-08-01 00:10:10.640;2023-08-01 00:10:12:740;2023-08-01 00:10:26.937 |
IBT675 | 2023-08-01 00:10:10.640;2023-08-01 00:10:11.540;2023-08-01 00:10:12:740 |
IBT675 | 2023-08-01 00:10:10.640;2023-08-01 00:10:11.540;2023-08-01 00:10:12:740 |
IBT675 | 2023-08-01 00:10:10.640;2023-08-01 00:10:11.540;2023-08-01 00:10:12:740 |
但是我得到的是 WPA54、IBET675 和每个 id 的相同整个附加时间序列
id | 日期时间 |
---|---|
WPA54 | 2023-08-01 00:02:52.527; 2023-08-01 00:10:10.640;2023-08-01 00:10:12:740;2023-08-01 00:10:26.937 |
WPA54 | 2023-08-01 00:02:52.527; 2023-08-01 00:10:10.640;2023-08-01 00:10:12:740;2023-08-01 00:10:26.937 |
WPA54 | 2023-08-01 00:02:52.527; 2023-08-01 00:10:10.640;2023-08-01 00:10:12:740;2023-08-01 00:10:26.937 |
WPA54 | 2023-08-01 00:02:52.527; 2023-08-01 00:10:10.640;2023-08-01 00:10:12:740;2023-08-01 00:10:26.937 |
IBT675 | 2023-08-01 00:02:52.527; 2023-08-01 00:10:10.640;2023-08-01 00:10:12:740;2023-08-01 00:10:26.937 |
IBT675 | 2023-08-01 00:02:52.527; 2023-08-01 00:10:10.640;2023-08-01 00:10:12:740;2023-08-01 00:10:26.937 |
IBT675 | 2023-08-01 00:02:52.527; 2023-08-01 00:10:10.640;2023-08-01 00:10:12:740;2023-08-01 00:10:26.937 |
提前谢谢您
你可以这样解决:
import pandas as pd
data_df1 = {
'id': ['WPA54', 'WPA54', 'WPA54', 'WPA54', 'IBT675', 'IBT675'],
'Date': ['2023-08-01', '2023-08-01', '2023-08-01', '2023-08-01', '2023-08-01', '2023-08-01'],
'another_col': ['A', 'B', 'C', 'D', 'E', 'F']
}
data_df2 = {
'id': ['WPA54', 'WPA54', 'WPA54', 'WPA54', 'IBT675', 'IBT675', 'IBT675'],
'DateTime': ['2023-08-01 00:02:52.527', '2023-08-01 00:10:10.640', '2023-08-01 00:10:12.740',
'2023-08-01 00:10:26.937', '2023-08-01 00:10:10.640', '2023-08-01 00:10:11.540',
'2023-08-01 00:10:12.740']
}
df1 = pd.DataFrame(data_df1)
df2 = pd.DataFrame(data_df2)
timestamps = df2.groupby('id')['DateTime'].apply(lambda x: ';'.join(x.unique())).reset_index()
timestamps.columns = ['id', 'fix_timestamps']
df1_updated = pd.merge(df1, timestamps, on='id', how='left')
df1_updated
timestamps = df2.groupby('id')['DateTime'].apply(lambda x: ';'.join(x.unique())).reset_index()
timestamps.columns = ['id', 'fix_timestamps']
df1 = pd.merge(df1, timestamps, on='id', how='left')
df1.to_csv(r'C:\Users\s-degossondevarennes/outputdd.csv', index=False)
将会返回