我想将一个包含超过 300,000 条记录的数据框写入 csv 文件。我尝试过根据块大小将数据帧写入多个文件的方法,这就是我正在做的:
chunk_size=25000
no_of_chunks=len(df)//chunk_size+1
for i,chunk in enumerate(np.array_split(df,no_of_chunks):
chunk.to_csv(f"filename_{i}.csv")
有没有一种方法可以动态确定块大小,并且可以对数据帧进行分区,以便行集不会被划分为新的 csv,例如:col1 col1 ..... coln
apple ...............
apple................
mango................
mango.................
我不希望苹果行被分割并进入新的 csv 文件,任何人都可以帮忙吗?
这是 Pandas 中这样做的方法:
import pandas as pd
import os
def split_df_to_csv(df, group_column, max_file_size_mb=5):
# Initialize variables
current_size = 0
file_index = 0
max_size_bytes = max_file_size_mb * 1024 * 1024 # Convert MB to bytes
temp_df = pd.DataFrame()
# Iterate over groups based on the specified column
for key, group in df.groupby(group_column):
# Estimate the size of the current group when converted to CSV
temp_csv_size = group.to_csv(index=False).encode('utf-8')
group_size = len(temp_csv_size)
# If adding this group exceeds the max size, write the temp_df to a CSV file
if current_size + group_size > max_size_bytes and not temp_df.empty:
temp_df.to_csv(f"filename_{file_index}.csv", index=False)
print(f"Written: filename_{file_index}.csv")
# Reset for the next file
file_index += 1
temp_df = pd.DataFrame()
current_size = 0
# Add the group to the temp DataFrame
temp_df = pd.concat([temp_df, group])
current_size += group_size
# Write any remaining data in the temp DataFrame to the last CSV file
if not temp_df.empty:
temp_df.to_csv(f"filename_{file_index}.csv", index=False)
print(f"Written: filename_{file_index}.csv")