写入大于默认分区大小的镶木地板文件

Question

如果默认分区字节大小为 128MB，根据我的理解，不可能用 600MB 来编写镶木地板。如果我不更改分区字节大小，如何确保使用合并的 DataLake 中没有小文件。

Answer 1

您可以像下面这样确定合并的有效分区数。您可以使用

df.count()* df.columns

来查找数据的大约大小。

# Step 1: Calculate the number of partitions you want
num_partitions = (df.count() * df.columns) // (600 * 1024 * 1024)  # Assuming 600MB files

# Step 2: Coalesce to reduce the number of partitions 
df_coalesced = df.coalesce(num_partitions)

为了提高准确性，您还可以使用以下方法来计算数据的字节大小

sample_row_size = len(str(df.head(1)))  # Estimate the size of the first row (in bytes)

# Total size of the DataFrame in bytes
df_size_in_bytes = df.count() * sample_row_size

# Calculate the number of partitions based on actual byte size
num_partitions = df_size_in_bytes // (600 * 1024 * 1024)

写入大于默认分区大小的镶木地板文件

问题描述投票：0回答：1

1个回答

最新问题

写入大于默认分区大小的镶木地板文件

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1