我有一堆数据,每行有 4 个日期字段。第一个日期字段与 ID 号重复,有时则不然。看起来有点像这样:
ID,LName,FName,DateIn,DateOut,Days,ODateIn,ODateOut,Odays
1,Doe,Jay,7/14/2023,8/14/2023,31.00,8/15/2023,4/22/2024,251.00
1,Doe,Jay,3/4/2021,11/5/2021,246.00,11/12/2021,12/31/2021,49.00
1,Doe,Jay,7/14/2023,8/14/2023,31.00,5/30/2024,7/2/2024,33.00
1,Doe,Jay,5/8/2022,1/1/2023,238.00,2/28/2023,4/8/2023,39.00
2,Smith,Dude,4/16/2022,6/2/2022,47.00,7/23/2022,9/13/2022,52.00
2,Smith,Dude,12/5/2022,3/14/2023,99.00,8/30/2023,10/11/2023,42.00
2,Smith,Dude,1/3/2024,3/30/2024,87.00,7/18/2024,9/1/2024,45.00
3,Doe,Jane,4/6/2020,8/10/2020,126.00,11/12/2020,1/18/2021,67.00
3,Doe,Jane,4/6/2020,8/10/2020,126.00,3/27/2021,6/9/2021,74.00
3,Doe,Jane,4/6/2020,8/10/2020,126.00,10/4/2021,11/30/2021,57.00
我想清理它,将它们合并为每个 ID 号的一行。它看起来像这样:
ID,DateIn1,DateOut1,Days1,DateIn2,DateOut2,Days2,DateIn3,DateOut3,Days3,ODateIn1,ODateOut1,Days1,ODateIn2,ODateOut2,Days2,ODateIn3,ODateOut3,Days3,ODateIn4,ODateOut4,Days4
1,3/4/2021,11/5/2021,246.00,5/8/2022,1/1/2023,238.00,7/14/2023,8/14/2023,31.00,11/12/2021,12/31/2021,49.00,2/28/2023,4/8/2023,39.00,8/15/2023,4/22/2024,251.00,5/30/2024,7/2/2024,33.00
2,4/16/2022,6/2/2022,47.00,12/5/2022,3/14/2023,99.00,1/3/2024,3/30/2024,87.00,7/23/2022,9/13/2022,52.00,8/30/2023,10/11/2023,42.00,7/18/2024,9/1/2024,45.00,,,
3,4/6/2020,8/10/2020,126.00,,,,,,,11/12/2020,1/18/2021,67.00,3/27/2021,6/9/2021,74.00,10/4/2021,11/30/2021,57.00,,,
我尝试了透视方法,但它不起作用,因为前两组日期中有重复的值。有人知道什么可以使这项工作有效吗?
由于您还没有分享任何自己的代码,所以我假设您没有做任何事情(如果您有,很好,请发布反馈):-(
在下面加入您的示例行以方便对它们进行熊猫处理...... 可以很容易地改进。
伪代码
import csv
key=[] # fields used to build a unique key
flattened=[] # holds a joined row
header=[]
hdrNum=1 # used when generating new column names
skipHeader=True
hdrTemplate=[] # repeating field names for joined row (incremented to provide unique column name)
mxFlds=0 # used to determine how many ',' need to be added to rows with < mxFlds
buffer=[] # holds processed rows, written out to csv
with open('flattenMe.csv', newline='') as csvfile:
spam = csv.reader(csvfile, delimiter=',' )
for row in spam:
if skipHeader == True:
header=row[0:3] #
hdrTemplate=row[3:] # we need this when adding new columns
for h in row[3:]:
header.append(h+str(hdrNum))
hdrNum=hdrNum+1
skipHeader=False # so we only do this block once
continue
if key == []:
key=row[0:3]
flattened=row
else:
thiskey=row[0:3]
if thiskey != key: #key change processing
buffer.append(flattened)
hdrNum=hdrNum+1
flattened=row
key=thiskey
else:
for fld in row[3:]:
flattened.append(fld)
if not hdrTemplate[-1]+str(hdrNum) in header: # this key set more rows than any previous
for fld in hdrTemplate:
header.append(fld+str(hdrNum))
if len(header) > mxFlds:
mxFlds=len(header)
buffer.append(flattened)
#
# write results back to a csv
#
with open('flattened.csv', 'w' ) as spam:
slice = csv.writer(spam, dialect='unix', quoting=csv.QUOTE_NONE )
slice.writerow( header )
for r in buffer:
slice.writerow( r )
#
# pandas
#
import pandas as pd
acsv = pd.read_csv('flattened.csv')
#
# show some detail
#
print(acsv)
跑步产生:
ID LName FName DateIn1 DateOut1 Days1 ODateIn1 ODateOut1 Odays1 ... ODateIn4 ODateOut4 Odays4 DateIn5 DateOut5 Days5 ODateIn5 ODateOut5 Odays5
0 1 Doe Jay 7/14/2023 8/14/2023 31.0 8/15/2023 4/22/2024 251.0 ... 2/28/2023 4/8/2023 39.0 NaN NaN NaN NaN NaN NaN
1 2 Smith Dude 4/16/2022 6/2/2022 47.0 7/23/2022 9/13/2022 52.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 3 Doe Jane 4/6/2020 8/10/2020 126.0 11/12/2020 1/18/2021 67.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 4 Pat May 4/11/2020 8/10/2020 126.0 10/4/2021 11/30/2021 57.0 ... 11/12/2020 1/18/2021 67.0 3/9/2021 11/5/2021 246.0 11/12/2021 12/31/2021 49.0
[4 rows x 33 columns]
酌情随意使用/屠宰/忽略。 E&O 承认,所有批评都被感激地接受,但不会再就此发表任何内容