我有一个名为水果的专栏。每行的格式都是这种格式
["apple","banana","mango"]
。我有一本字典,上面有{oldvalue:newvalue}
。我想用葡萄等代替苹果。如何在pyspark中做到这一点?
df_silver_3 = df_silver_2.filter(F.col("fruits").isNotNull())\ .withColumn("fruits_cleaned", regexp_replace(df_silver_2["fruits"].cast("string"), r"[\[\]]", ""))\ .select("fruits", "fruits_cleaned")
replace_dict = { "apple": "grapes", "value2": "replacement2" }
df_replaced = df_silver_3.select( [when(col("fruits_cleaned") == key, value).otherwise(col("fruits_cleaned")).alias("fruits_cleaned") for key, value in replace_dict.items()] + [col("fruits")] # Include original column if needed )
我以为苹果会被葡萄取代
["apple","banana","mango"]
-> ["grapes","banana","mango"]
您可以利用
expr
和transform
来达到您的要求。
使用上述函数构建如下表达式,以将数组替换为所需的字典值:
from pyspark.sql.functions import expr
my_dict = {"apple": "grapes", "mango": "orange"}
req_exp = "transform(fruits, a -> CASE " + " ".join([f"WHEN a = '{x}' THEN '{y}'" for x, y in my_dict.items()]) + " ELSE a END)"
res_df = df.withColumn("fruits_cleaned", expr(req_exp))
res_df.show()
结果: