Pandas 在处理 10,000 行后速度变慢

问题描述 投票:0回答:1

我正在开发一个小函数,使用 pandas 对 csv 进行简单的清理。这是代码:

def clean_charges(conn, cur):
    charges = pd.read_csv('csv/all_charges.csv', parse_dates=['CreatedDate', 'PostingDate', 
                                                            'PrimaryInsurancePaymentPostingDate', 
                                                            'SecondaryInsurancePaymentPostingDate', 
                                                            'TertiaryInsurancePaymentPostingDate'])
    
    # Split charges into 10 equal sized dataframes
    num_splits = 10
    charges_split = np.array_split(charges, num_splits)
    
    cur_month = datetime.combine(datetime.now().date().replace(day=1), datetime.min.time())

    count = 0
    total = 0
    for cur_charge in charges_split:
        for index, charge in cur_charge.iterrows():
            if total % 1000 == 0:
                print(total)
            total += 1
            # Delete it from the dataframe if its a charge from the current month
            if charge['PostingDate'] >= cur_month:
                count += 1
                charges.drop(index, inplace=True)
                continue
            # Delete the payments if they were applied in the current month
            if charge['PrimaryInsurancePaymentPostingDate'] >= cur_month:
                charge['TotalBalance'] = charge['TotalBalance'] + charge['PrimaryInsuranceInsurancePayment']
                charge['PrimaryInsurancePayment'] = 0
            if charge['SecondaryInsurancePaymentPostingDate'] >= cur_month:
                charge['TotalBalance'] = charge['TotalBalance'] + charge['SecondaryInsuranceInsurancePayment']
                charge['SecondaryInsurancePayment'] = 0
            if charge['TertiaryInsurancePaymentPostingDate'] >= cur_month:
                charge['TotalBalance'] = charge['TotalBalance'] + charge['TertiaryInsuranceInsurancePayment']
                charge['TertiaryInsurancePayment'] = 0
            # Delete duplicate payments
            if charge['AdjustedCharges'] - (charge['PrimaryInsuranceInsurancePayment'] + charge['SecondaryInsuranceInsurancePayment'] + 
                                            charge['TertiaryInsuranceInsurancePayment'] + charge['PatientPaymentAmount']) != charge['TotalBalance']:
                charge['SecondaryInsurancePayment'] = 0

    charges = pd.concat(charges_split)
    
    charges.to_csv('csv/updated_charges.csv', index=False)

all_charges.csv 的总大小约为 270,000 行,但我遇到了一个问题,它将非常快地处理前 10,000 行,然后减慢速度。前 10,000 条的大概时间为 5 秒,之后每 10000 条大约需要 2 分钟。当我将整套数据帧作为一个数据帧进行处理时,以及当我将其分成 10 个数据帧(如您现在在我的代码中看到的那样)时,这是一个问题。我没有看到任何可能导致此问题的原因,我的代码可能没有 100% 优化,但我觉得我没有做任何非常愚蠢的事情。我的电脑也仅以 15% CPU 使用率和 40% 内存使用率运行,所以我不认为这是硬件问题。

如果我能得到任何帮助来弄清楚为什么运行如此缓慢,我将不胜感激!

python pandas
1个回答
0
投票

据报告从数据帧中删除记录很慢,因此最好使用 pandas 过滤功能。

生成 70000 条记录 csv 并仅处理前 10000 条

def clean_charges(charges):

    flt_date = datetime(2024, 9, 1)

    count = 0
    total = 0
    # for cur_charge in charges_split:
    for index, charge in charges.iterrows():
        if total % 1000 == 0:
            print(total)
        total += 1
        # Delete it from the dataframe if its a charge from the current month
        if charge['PostingDate'] >= flt_date:
            count += 1
            charges.drop(index, inplace=True)
            continue
        if total == 10000:
            break

charges = pd.read_csv('faker_data_70000.csv', parse_dates=['PostingDate'])
print(f'df length: {len(charges.index)}')
clean_charges(charges)
print(f'df length: {len(charges.index)}')

运行它

time filter.py

结果

df length: 70000
0
1000
2000
...
9000
df length: 66694

real    0m40.134s
user    0m40.555s
sys     0m0.096s

使用pandas过滤

charges = pd.read_csv('faker_data_70000.csv', parse_dates=['PostingDate'])
print(f'df length: {len(charges.index)}')

flt_date = datetime(2024, 9, 1)
charges_flt = charges[charges['PostingDate'] > flt_date]
print(f'df length: {len(charges_flt.index)}')

结果

df length: 70000
df length: 23092

real    0m0.534s
user    0m1.018s
sys     0m0.040s
© www.soinside.com 2019 - 2024. All rights reserved.