针对每种情况将具有多个日期的数据组合成一行

问题描述 投票:0回答:1

我有一堆数据,每行有 4 个日期字段。第一个日期字段与 ID 号重复,有时则不然。看起来有点像这样:

ID,LName,FName,DateIn,DateOut,Days,ODateIn,ODateOut,Odays
1,Doe,Jay,7/14/2023,8/14/2023,31.00,8/15/2023,4/22/2024,251.00
1,Doe,Jay,3/4/2021,11/5/2021,246.00,11/12/2021,12/31/2021,49.00
1,Doe,Jay,7/14/2023,8/14/2023,31.00,5/30/2024,7/2/2024,33.00
1,Doe,Jay,5/8/2022,1/1/2023,238.00,2/28/2023,4/8/2023,39.00
2,Smith,Dude,4/16/2022,6/2/2022,47.00,7/23/2022,9/13/2022,52.00
2,Smith,Dude,12/5/2022,3/14/2023,99.00,8/30/2023,10/11/2023,42.00
2,Smith,Dude,1/3/2024,3/30/2024,87.00,7/18/2024,9/1/2024,45.00
3,Doe,Jane,4/6/2020,8/10/2020,126.00,11/12/2020,1/18/2021,67.00
3,Doe,Jane,4/6/2020,8/10/2020,126.00,3/27/2021,6/9/2021,74.00
3,Doe,Jane,4/6/2020,8/10/2020,126.00,10/4/2021,11/30/2021,57.00

我想清理它,将它们合并为每个 ID 号的一行。它看起来像这样:

ID,DateIn1,DateOut1,Days1,DateIn2,DateOut2,Days2,DateIn3,DateOut3,Days3,ODateIn1,ODateOut1,Days1,ODateIn2,ODateOut2,Days2,ODateIn3,ODateOut3,Days3,ODateIn4,ODateOut4,Days4
1,3/4/2021,11/5/2021,246.00,5/8/2022,1/1/2023,238.00,7/14/2023,8/14/2023,31.00,11/12/2021,12/31/2021,49.00,2/28/2023,4/8/2023,39.00,8/15/2023,4/22/2024,251.00,5/30/2024,7/2/2024,33.00
2,4/16/2022,6/2/2022,47.00,12/5/2022,3/14/2023,99.00,1/3/2024,3/30/2024,87.00,7/23/2022,9/13/2022,52.00,8/30/2023,10/11/2023,42.00,7/18/2024,9/1/2024,45.00,,,
3,4/6/2020,8/10/2020,126.00,,,,,,,11/12/2020,1/18/2021,67.00,3/27/2021,6/9/2021,74.00,10/4/2021,11/30/2021,57.00,,,

我尝试了透视方法,但它不起作用,因为前两组日期中有重复的值。有人知道什么可以使这项工作有效吗?

python pandas dataframe merge
1个回答
0
投票

由于您还没有分享任何自己的代码,所以我假设您没有做任何事情(如果您有,很好,请发布反馈):-(

在下面加入您的示例行以方便对它们进行熊猫处理...... 可以很容易地改进。

伪代码

  • 读取输入csv,将公共密钥记录追加到缓冲区数组中
  • 扫描缓冲区以用空/空字段填充“短”记录
  • 将缓冲区写入“扁平化”csv
  • 将扁平化的 csv 加载到 panda 数据框中(用于测试)
  • 显示 结束
import csv

key=[]       # fields used to build a unique key
flattened=[] # holds a joined row
header=[]
hdrNum=1 # used when generating new column names
skipHeader=True
hdrTemplate=[] # repeating field names for joined row (incremented to provide unique column name)

mxFlds=0 # used to determine how many ',' need to be added to rows with < mxFlds 

buffer=[] # holds processed rows, written out to csv

with open('flattenMe.csv', newline='') as csvfile:
    spam = csv.reader(csvfile, delimiter=',' )
    for row in spam:
        if skipHeader == True:
            header=row[0:3]     # 
            hdrTemplate=row[3:] # we need this when adding new columns
            for h in row[3:]:
                header.append(h+str(hdrNum))
            hdrNum=hdrNum+1
            skipHeader=False # so we only do this block once

            continue

        if key == []:
            key=row[0:3]
            flattened=row
        else:
            thiskey=row[0:3]
            if thiskey != key: #key change processing

                buffer.append(flattened)
                hdrNum=hdrNum+1

                flattened=row
                key=thiskey
            else:
                for fld in row[3:]:
                    flattened.append(fld)

                if not hdrTemplate[-1]+str(hdrNum) in header: # this key set more rows than any previous

                    for fld in hdrTemplate:
                        header.append(fld+str(hdrNum))

        if len(header) > mxFlds:
            mxFlds=len(header)

buffer.append(flattened)

#
# write results back to a csv
#
with open('flattened.csv', 'w' ) as spam:
    slice = csv.writer(spam, dialect='unix', quoting=csv.QUOTE_NONE )
    slice.writerow( header )
    for r in buffer:
        slice.writerow( r )

#
# pandas
#
import pandas as pd
acsv = pd.read_csv('flattened.csv')

#
# show some detail
#
print(acsv)

跑步产生:

   ID  LName FName    DateIn1   DateOut1  Days1    ODateIn1   ODateOut1  Odays1  ...    ODateIn4  ODateOut4  Odays4   DateIn5   DateOut5  Days5    ODateIn5   ODateOut5  Odays5
0   1    Doe   Jay  7/14/2023  8/14/2023   31.0   8/15/2023   4/22/2024   251.0  ...   2/28/2023   4/8/2023    39.0       NaN        NaN    NaN         NaN         NaN     NaN
1   2  Smith  Dude  4/16/2022   6/2/2022   47.0   7/23/2022   9/13/2022    52.0  ...         NaN        NaN     NaN       NaN        NaN    NaN         NaN         NaN     NaN
2   3    Doe  Jane   4/6/2020  8/10/2020  126.0  11/12/2020   1/18/2021    67.0  ...         NaN        NaN     NaN       NaN        NaN    NaN         NaN         NaN     NaN
3   4    Pat   May  4/11/2020  8/10/2020  126.0   10/4/2021  11/30/2021    57.0  ...  11/12/2020  1/18/2021    67.0  3/9/2021  11/5/2021  246.0  11/12/2021  12/31/2021    49.0

[4 rows x 33 columns]

酌情随意使用/屠宰/忽略。 E&O 承认,所有批评都被感激地接受,但不会再就此发表任何内容

© www.soinside.com 2019 - 2024. All rights reserved.