Databricks Pyspark 写入 Delta 格式模式覆盖无法正常工作

问题描述 投票:0回答:1

我有以下代码

以前我有一个增量表,其中 my_path´ 中有

180
列,我选择一列并尝试覆盖

    columns_to_select = ["one_column"]
    df_one_column = df.select(*columns_to_select)
    df_one_column.write.format("delta").mode("overwrite").option("mergeSchema", "true").save(my_path)
    
    new_schema = spark.read.format("delta").load(my_path).schema
    target_column = [field.name for field in new_schema.fields]
    print(len(target_column)) # return 180

我期望返回 1,因为我只从数据框中选择一列,但返回了 180 列

databricks delta-lake
1个回答
0
投票

写作时需要使用

option("overwriteSchema", "True")

这是示例示例

df.write.format("delta").mode("overwrite").save(my_path)
df_first = spark.read.format("delta").load(my_path)
print(df_first.columns, len(df_first.columns))

Please fins the below screenshot

columns_to_select = ["firstname"]
df_one_column = df.select(*columns_to_select)
df_one_column.write.format("delta").mode("overwrite").option("overwriteSchema", "True").option("mergeSchema", "true").save(my_path) 
df_second = spark.read.format("delta").load(my_path)
print(df_second.columns, len(df_second.columns))

Please find the below screenshot

请参阅以下链接了解更多信息

mergeSchema:https://delta.io/blog/2023-02-08-delta-lake-schema-evolution/

覆盖架构:https://docs.databricks.com/en/delta/update-schema.html#explicitly-update-schema-to-change-column-type-or-name

© www.soinside.com 2019 - 2024. All rights reserved.