鉴于
[('Project', 10),
("Alice's", 11),
('in', 401),
('Wonderland,', 3),
('Lewis', 10),
('Carroll', 4),
('', 2238),
('is', 10),
('use', 24),
('of', 596),
('anyone', 4),
('anywhere', 3),
其中,配对的RDD的值就是词频。
我只想返回出现10次的单词。预期输出
[('Project', 10),
('Lewis', 10),
('is', 10)]
我试着用
rdd.filter(lambda words: (words,10)).collect()
但它仍然显示整个列表。我应该如何处理这个问题?
你的lambda函数是错误的,应该是
rdd.filter(lambda words: words[1] == 10).collect()
例如:
my_rdd = sc.parallelize([('Project', 10), ("Alice's", 11), ('in', 401), ('Wonderland,', 3), ('Lewis', 10)], ('is', 10)]
>>> my_rdd.filter(lambda w: w[1] == 10).collect()
[('Project', 10), ('Lewis', 10), ('is', 10)]