如何在pyspark中用字典替换单行中的多个值?

问题描述 投票:0回答:1

我有一个名为水果的专栏。每行的格式都是这种格式

["apple","banana","mango"]
。我有一本字典,上面有
{oldvalue:newvalue}
。我想用葡萄等代替苹果。如何在pyspark中做到这一点?

我尝试了什么

df_silver_3 = df_silver_2.filter(F.col("fruits").isNotNull())\ .withColumn("fruits_cleaned", regexp_replace(df_silver_2["fruits"].cast("string"), r"[\[\]]", ""))\ .select("fruits", "fruits_cleaned")

替换词典

replace_dict = { "apple": "grapes",   "value2": "replacement2" }

使用字典替换“fruits_cleaned”列中的值

df_replaced = df_silver_3.select( [when(col("fruits_cleaned") == key, value).otherwise(col("fruits_cleaned")).alias("fruits_cleaned") for key, value in replace_dict.items()] +  [col("fruits")]  # Include original column if needed ) 

我所期待的

我以为苹果会被葡萄取代

["apple","banana","mango"]
->
["grapes","banana","mango"]

azure pyspark apache-spark-sql databricks azure-synapse
1个回答
0
投票

您可以利用

expr
transform
来达到您的要求。

使用上述函数构建如下表达式,以将数组替换为所需的字典值:

from pyspark.sql.functions import expr

my_dict = {"apple": "grapes", "mango": "orange"}
req_exp =  "transform(fruits, a -> CASE "  +  " ".join([f"WHEN a = '{x}' THEN '{y}'"  for x, y in my_dict.items()]) +  " ELSE a END)"

res_df = df.withColumn("fruits_cleaned", expr(req_exp))

res_df.show()

结果:

enter image description here

© www.soinside.com 2019 - 2024. All rights reserved.