我有一个这种格式的表格:
名字 | 水果 | 苹果 | 香蕉 | 橙色 |
---|---|---|---|---|
爱丽丝 | [“苹果”,“香蕉”,“橙子”] | 5 | 8 | 3 |
鲍勃 | [“苹果”] | 2 | 9 | 1 |
我想创建一个包含这种格式的JSON包的新列,其中键是数组的元素,值是列名称的结果值:
名字 | 水果 | 苹果 | 香蕉 | 橙色 | new_col |
---|---|---|---|---|---|
爱丽丝 | [“苹果”,“香蕉”,“橙子”] | 5 | 8 | 3 | {“苹果”:5,“香蕉”:8,“橙子”:3} |
鲍勃 | [“苹果”] | 2 | 9 | 1 | {“苹果”:2} |
对于如何进行有什么想法吗?我假设有一个 UDF,但我无法获得正确的语法。
这是我所掌握的代码:
from pyspark.sql.functions import udf, col
from pyspark.sql.types import MapType, StringType
# Create a Spark session
spark = SparkSession.builder.appName("example").getOrCreate()
# Sample data
data = [("Alice", ["apple", "banana", "orange"], 5, 8, 3),
("Bob", ["apple"], 2, 9, 1)]
# Define the schema
schema = ["name", "fruits", "apple", "banana", "orange"]
# Create a DataFrame
df = spark.createDataFrame(data, schema=schema)
# Show the initial DataFrame
print("Initial DataFrame:")
display(df)
# Define a UDF to create a dictionary
@udf(MapType(StringType(), StringType()))
def json_map(fruits):
result = {}
for i in fruits:
result[i] = col(i)
return result
# Apply the UDF to the 'fruits' column
new_df = df.withColumn('test', json_map(col('fruits')))
# Display the updated DataFrame
display(new_df)
首先,使用arrays_zip将数组值组合成一个结构体数组,然后删除空值的键,代码如下:
data = [("Alice", ["apple", "banana", "orange"], 5, 8, 3),
("Bob", ["apple"], 2, 9, 1)]
schema = ["name", "fruits", "apple", "banana", "orange"]
df = spark.createDataFrame(data, schema=schema)
df.withColumn("new_col", arrays_zip(col("fruits"), array(col("apple"), col("banana"), col("orange"))))\
.withColumn("new_col", expr("filter(new_col, x-> x.fruits IS NOT NULL)")).show(truncate=False)
结果:
+-----+-----------------------+-----+------+------+--------------------------------------+
|name |fruits |apple|banana|orange|new_col |
+-----+-----------------------+-----+------+------+--------------------------------------+
|Alice|[apple, banana, orange]|5 |8 |3 |[{apple, 5}, {banana, 8}, {orange, 3}]|
|Bob |[apple] |2 |9 |1 |[{apple, 2}] |
+-----+-----------------------+-----+------+------+--------------------------------------+