我正在开发一个小函数,使用 pandas 对 csv 进行简单的清理。这是代码:
def clean_charges(conn, cur):
charges = pd.read_csv('csv/all_charges.csv', parse_dates=['CreatedDate', 'PostingDate',
'PrimaryInsurancePaymentPostingDate',
'SecondaryInsurancePaymentPostingDate',
'TertiaryInsurancePaymentPostingDate'])
# Split charges into 10 equal sized dataframes
num_splits = 10
charges_split = np.array_split(charges, num_splits)
cur_month = datetime.combine(datetime.now().date().replace(day=1), datetime.min.time())
count = 0
total = 0
for cur_charge in charges_split:
for index, charge in cur_charge.iterrows():
if total % 1000 == 0:
print(total)
total += 1
# Delete it from the dataframe if its a charge from the current month
if charge['PostingDate'] >= cur_month:
count += 1
charges.drop(index, inplace=True)
continue
# Delete the payments if they were applied in the current month
if charge['PrimaryInsurancePaymentPostingDate'] >= cur_month:
charge['TotalBalance'] = charge['TotalBalance'] + charge['PrimaryInsuranceInsurancePayment']
charge['PrimaryInsurancePayment'] = 0
if charge['SecondaryInsurancePaymentPostingDate'] >= cur_month:
charge['TotalBalance'] = charge['TotalBalance'] + charge['SecondaryInsuranceInsurancePayment']
charge['SecondaryInsurancePayment'] = 0
if charge['TertiaryInsurancePaymentPostingDate'] >= cur_month:
charge['TotalBalance'] = charge['TotalBalance'] + charge['TertiaryInsuranceInsurancePayment']
charge['TertiaryInsurancePayment'] = 0
# Delete duplicate payments
if charge['AdjustedCharges'] - (charge['PrimaryInsuranceInsurancePayment'] + charge['SecondaryInsuranceInsurancePayment'] +
charge['TertiaryInsuranceInsurancePayment'] + charge['PatientPaymentAmount']) != charge['TotalBalance']:
charge['SecondaryInsurancePayment'] = 0
charges = pd.concat(charges_split)
charges.to_csv('csv/updated_charges.csv', index=False)
all_charges.csv 的总大小约为 270,000 行,但我遇到了一个问题,它将非常快地处理前 10,000 行,然后减慢速度。前 10,000 条的大概时间为 5 秒,之后每 10000 条大约需要 2 分钟。当我将整套数据帧作为一个数据帧进行处理时,以及当我将其分成 10 个数据帧(如您现在在我的代码中看到的那样)时,这是一个问题。我没有看到任何可能导致此问题的原因,我的代码可能没有 100% 优化,但我觉得我没有做任何非常愚蠢的事情。我的电脑也仅以 15% CPU 使用率和 40% 内存使用率运行,所以我不认为这是硬件问题。
如果我能得到任何帮助来弄清楚为什么运行如此缓慢,我将不胜感激!
据报告从数据帧中删除记录很慢,因此最好使用 pandas 过滤功能。
生成 70000 条记录 csv 并仅处理前 10000 条
def clean_charges(charges):
flt_date = datetime(2024, 9, 1)
count = 0
total = 0
# for cur_charge in charges_split:
for index, charge in charges.iterrows():
if total % 1000 == 0:
print(total)
total += 1
# Delete it from the dataframe if its a charge from the current month
if charge['PostingDate'] >= flt_date:
count += 1
charges.drop(index, inplace=True)
continue
if total == 10000:
break
charges = pd.read_csv('faker_data_70000.csv', parse_dates=['PostingDate'])
print(f'df length: {len(charges.index)}')
clean_charges(charges)
print(f'df length: {len(charges.index)}')
运行它
time filter.py
结果
df length: 70000
0
1000
2000
...
9000
df length: 66694
real 0m40.134s
user 0m40.555s
sys 0m0.096s
使用pandas过滤
charges = pd.read_csv('faker_data_70000.csv', parse_dates=['PostingDate'])
print(f'df length: {len(charges.index)}')
flt_date = datetime(2024, 9, 1)
charges_flt = charges[charges['PostingDate'] > flt_date]
print(f'df length: {len(charges_flt.index)}')
结果
df length: 70000
df length: 23092
real 0m0.534s
user 0m1.018s
sys 0m0.040s