如果默认分区字节大小为 128MB,根据我的理解,不可能用 600MB 来编写镶木地板。如果我不更改分区字节大小,如何确保使用合并的 DataLake 中没有小文件。
df.count()* df.columns
# Step 1: Calculate the number of partitions you want
num_partitions = (df.count() * df.columns) // (600 * 1024 * 1024) # Assuming 600MB files
# Step 2: Coalesce to reduce the number of partitions
df_coalesced = df.coalesce(num_partitions)
sample_row_size = len(str(df.head(1))) # Estimate the size of the first row (in bytes)
# Total size of the DataFrame in bytes
df_size_in_bytes = df.count() * sample_row_size
# Calculate the number of partitions based on actual byte size
num_partitions = df_size_in_bytes // (600 * 1024 * 1024)