我正在尝试连接 2 个数据框,使它们都具有以下命名列。进行 LEFT OUTER 连接的最佳方法是什么?
df = df.join(df_forecast, ["D_ACCOUNTS_ID", "D_APPS_ID", "D_CONTENT_PAGE_ID"], 'left')
目前,我得到一个错误:
You're trying to access a column, but multiple columns have that name.
我错过了什么?
让我知道你对此的看法:
import pyspark.sql.functions as f
join_keys = ["D_ACCOUNTS_ID", "D_APPS_ID", "D_CONTENT_PAGE_ID"]
df = (
df
.join(df_forecast, join_keys, 'left')
.select(
*join_keys,
*[f.col(df[element]).alias('df_'+element) for element in df.columns if element not in join_keys],
*[f.col(df_forecast[element]).alias('df_forecast_'+element) for element in df_forecast.columns if element not in join_keys]
)
)