我希望以定制的方式融化我的 pyspark 数据框。 我的数据框如下所示
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, expr
# Create a SparkSession
spark = SparkSession.builder \
.appName("CustomMeltExample") \
.getOrCreate()
# Sample DataFrame
data = [
("John", 25, 80, 90, 85, 75, 85, 88, 70, 75),
("Alice", 30, 85, 92, 88, 80, 88, 90, 75, 78),
("Bob", 35, 78, 85, 80, 70, 82, 85, 65, 45)
]
columns = ["Name", "Age", "English_Marks", "English_Highest",\
"English_Avg", "English_Lowest", "History_Marks",\
"History_Highest", "History_Avg", "History_Lowest"]
df = spark.createDataFrame(data, columns)
Original DataFrame:
+-----+---+-------------+---------------+-----------+--------------+-------------+---------------+-----------+--------------+
| Name|Age|English_Marks|English_Highest|English_Avg|English_Lowest|History_Marks|History_Highest|History_Avg|History_Lowest|
+-----+---+-------------+---------------+-----------+--------------+-------------+---------------+-----------+--------------+
| John| 25| 80| 90| 85| 75| 85| 88| 70| 75|
|Alice| 30| 85| 92| 88| 80| 88| 90| 75| 78|
| Bob| 35| 78| 85| 80| 70| 82| 85| 65| 45|
+-----+---+-------------+---------------+-----------+--------------+-------------+---------------+-----------+--------------+
我想要的最终输出是这样的:数据帧应该融合在主题名称上,我可以将其作为列名称中指示的列表传递,这样英语和历史记录就会移到行中,并且它们的 5 个指标的值会保留在列中。
subject = ['English', 'History'] # There can be more items in the list
Melted DataFrame:
+-----+---+-------+-----+-------+----+------+
| Name|Age|Subject|Marks|Highest| Avg|Lowest|
+-----+---+-------+-----+-------+----+------+
| John| 25|English| 80 | 90 | 85 | 75 |
| John| 25|History| 85 | 88 | 70 | 75 |
|Alice| 30|English| 85 | 92 | 88 | 80 |
|Alice| 30|History| 88 | 90 | 75 | 78 |
| Bob| 35|English| 78 | 85 | 80 | 70 |
| Bob| 35|History| 82 | 85 | 65 | 45 |
+-----+---+-------+-----+-------+----+------+
希望避免转换为 pandas 来实现此目的。
我只能在
df.selectExpr
中使用堆栈应用正常熔化
为
stack
函数创建字符串时与字符串连接有关:
expr_list = []
# Change your subject here
subject_list = ['English', 'History']
calculation_list = ["Marks", "Highest", "Avg", "Lowest"]
for subject in subject_list:
expr_list.append(f"'{subject}'")
for calculation in calculation_list:
expr_list.append(f"{subject}_{calculation}")
column_str = ", ".join(expr_list)
结果应如下:
df.selectExpr(
'Name', 'Age',
f"stack(2, {column_str}) AS (Subject, Marks, Highest, Avg, Lowest)",
).show(
10, False
)
+-----+---+-------+-----+-------+---+------+
|Name |Age|Subject|Marks|Highest|Avg|Lowest|
+-----+---+-------+-----+-------+---+------+
|John |25 |English|80 |90 |85 |75 |
|John |25 |History|85 |88 |70 |75 |
|Alice|30 |English|85 |92 |88 |80 |
|Alice|30 |History|88 |90 |75 |78 |
|Bob |35 |English|78 |85 |80 |70 |
|Bob |35 |History|82 |85 |65 |45 |
+-----+---+-------+-----+-------+---+------+