Pyspark 中的项目列表中的自定义熔化

问题描述 投票:0回答:1

我希望以定制的方式融化我的 pyspark 数据框。 我的数据框如下所示

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, expr

# Create a SparkSession
spark = SparkSession.builder \
    .appName("CustomMeltExample") \
    .getOrCreate()

# Sample DataFrame
data = [
    ("John", 25, 80, 90, 85, 75, 85, 88, 70, 75),
    ("Alice", 30, 85, 92, 88, 80, 88, 90, 75, 78),
    ("Bob", 35, 78, 85, 80, 70, 82, 85, 65, 45)
]

columns = ["Name", "Age", "English_Marks", "English_Highest",\
 "English_Avg", "English_Lowest", "History_Marks",\
 "History_Highest", "History_Avg", "History_Lowest"]

df = spark.createDataFrame(data, columns)

Original DataFrame:
+-----+---+-------------+---------------+-----------+--------------+-------------+---------------+-----------+--------------+
| Name|Age|English_Marks|English_Highest|English_Avg|English_Lowest|History_Marks|History_Highest|History_Avg|History_Lowest|
+-----+---+-------------+---------------+-----------+--------------+-------------+---------------+-----------+--------------+
| John| 25|           80|             90|         85|            75|           85|             88|         70|            75|
|Alice| 30|           85|             92|         88|            80|           88|             90|         75|            78|
|  Bob| 35|           78|             85|         80|            70|           82|             85|         65|            45|
+-----+---+-------------+---------------+-----------+--------------+-------------+---------------+-----------+--------------+

我想要的最终输出是这样的:数据帧应该融合在主题名称上,我可以将其作为列名称中指示的列表传递,这样英语和历史记录就会移到行中,并且它们的 5 个指标的值会保留在列中。

subject = ['English', 'History'] # There can be more items in the list

Melted DataFrame:
+-----+---+-------+-----+-------+----+------+
| Name|Age|Subject|Marks|Highest| Avg|Lowest|
+-----+---+-------+-----+-------+----+------+
| John| 25|English| 80  |   90  | 85 |  75  |
| John| 25|History| 85  |   88  | 70 |  75  |
|Alice| 30|English| 85  |   92  | 88 |  80  |
|Alice| 30|History| 88  |   90  | 75 |  78  |
|  Bob| 35|English| 78  |   85  | 80 |  70  |
|  Bob| 35|History| 82  |   85  | 65 |  45  |
+-----+---+-------+-----+-------+----+------+

希望避免转换为 pandas 来实现此目的。

我只能在

df.selectExpr

中使用堆栈应用正常熔化
pyspark apache-spark-sql
1个回答
0
投票

stack
函数创建字符串时与字符串连接有关:

expr_list = []
# Change your subject here
subject_list = ['English', 'History']
calculation_list = ["Marks", "Highest", "Avg", "Lowest"]

for subject in subject_list:
    expr_list.append(f"'{subject}'")
    for calculation in calculation_list:
        expr_list.append(f"{subject}_{calculation}")

column_str = ", ".join(expr_list)

结果应如下:

df.selectExpr(
    'Name', 'Age',
    f"stack(2, {column_str}) AS (Subject, Marks, Highest, Avg, Lowest)",
).show(
    10, False
)

+-----+---+-------+-----+-------+---+------+
|Name |Age|Subject|Marks|Highest|Avg|Lowest|
+-----+---+-------+-----+-------+---+------+
|John |25 |English|80   |90     |85 |75    |
|John |25 |History|85   |88     |70 |75    |
|Alice|30 |English|85   |92     |88 |80    |
|Alice|30 |History|88   |90     |75 |78    |
|Bob  |35 |English|78   |85     |80 |70    |
|Bob  |35 |History|82   |85     |65 |45    |
+-----+---+-------+-----+-------+---+------+
© www.soinside.com 2019 - 2024. All rights reserved.