Pyspark - 基于RDD中的键来汇总和汇总

问题描述 投票:0回答:1

我有以下RDD。

[[1,101,001,100,product1],
 [2,102,001,105,product2],
 [3,103,002,101,product3]]

预期的产出是

[('001', ['product1','100'],['product2','105']),('002',['product3','101'])]
pyspark aggregate rdd
1个回答
0
投票

感受节日气氛,所以你去:

我假设,嵌套列表中的项目3和5应该是字符串......

创建RDD:

ls = [[1,101,"001",100,"product1"],
 [2,102,"001",105,"product2"],
 [3,103,"002",101,"product3"]]

rdd1 = sc.parallelize(ls)

这将使rdd1成为:

[[1, 101, '001', 100, 'product1'],
 [2, 102, '001', 105, 'product2'],
 [3, 103, '002', 101, 'product3']]

制图:

# discard items 1 & 2; set item 3 as key
rdd2 = rdd1.map(lambda row: (row[2], [row[4], row[3]]))
rdd2.collect() 

> [('001', ['product1', 100]),
>  ('001', ['product2', 105]),
>  ('002', ['product3', 101])]

# group by key and map values to a list
rdd3 = rdd2.groupByKey().mapValues(list)
rdd3.collect()

> [('001', [['product1', 100], ['product2', 105]]), 
>  ('002', [['product3', 101]])]

它不是你感兴趣的输出,但是RDD是关键的..

© www.soinside.com 2019 - 2024. All rights reserved.