如何删除 RDD PySpark 中的停用词?

问题描述 投票:0回答:1

如何删除 PySpark RDD 中的停用词?

my_doc = sc.parallelize([("Alex Smith", 101, ["i", "saw", "a", "sheep"]), ("John Lee", 102, ["he", "likes", "ice", "cream"] )])

我有下面的RDD:

(("Alex Smith", 101, ["I", "saw", "a", "sheep"]), ("John Lee", 102, ["He", "likes", "ice", "cream"]))

我想去掉x[2]中的停用词,比如“a”、“he”、“i”等

去除停用词后,应该如下所示:

(("Alex Smith", 101, ["saw", "sheep"]), ("John Lee", 102, ["likes", "ice", "cream"]))
python apache pyspark jupyter-notebook rdd
1个回答
0
投票

映射到 rdd 以创建具有过滤值的新元素。

stop_words = ['i', 'a', 'he']
my_doc.map(lambda x: (x[0], x[1], list(filter(lambda word: word.lower() not in stop_words, x[2])))).collect()

[('Alex Smith', 101, ['saw', 'sheep']),
 ('John Lee', 102, ['likes', 'ice', 'cream'])]
© www.soinside.com 2019 - 2024. All rights reserved.