如何删除 PySpark RDD 中的停用词?
my_doc = sc.parallelize([("Alex Smith", 101, ["i", "saw", "a", "sheep"]), ("John Lee", 102, ["he", "likes", "ice", "cream"] )])
我有下面的RDD:
(("Alex Smith", 101, ["I", "saw", "a", "sheep"]), ("John Lee", 102, ["He", "likes", "ice", "cream"]))
我想去掉x[2]中的停用词,比如“a”、“he”、“i”等
去除停用词后,应该如下所示:
(("Alex Smith", 101, ["saw", "sheep"]), ("John Lee", 102, ["likes", "ice", "cream"]))
映射到 rdd 以创建具有过滤值的新元素。
stop_words = ['i', 'a', 'he']
my_doc.map(lambda x: (x[0], x[1], list(filter(lambda word: word.lower() not in stop_words, x[2])))).collect()
[('Alex Smith', 101, ['saw', 'sheep']),
('John Lee', 102, ['likes', 'ice', 'cream'])]