如何删除 RDD PySpark 中的停用词？

Question

如何删除 PySpark RDD 中的停用词？

my_doc = sc.parallelize([("Alex Smith", 101, ["i", "saw", "a", "sheep"]), ("John Lee", 102, ["he", "likes", "ice", "cream"] )])

我有下面的RDD：

(("Alex Smith", 101, ["I", "saw", "a", "sheep"]), ("John Lee", 102, ["He", "likes", "ice", "cream"]))

我想去掉x[2]中的停用词，比如“a”、“he”、“i”等

去除停用词后，应该如下所示：

(("Alex Smith", 101, ["saw", "sheep"]), ("John Lee", 102, ["likes", "ice", "cream"]))

Answer 1

映射到 rdd 以创建具有过滤值的新元素。

stop_words = ['i', 'a', 'he']
my_doc.map(lambda x: (x[0], x[1], list(filter(lambda word: word.lower() not in stop_words, x[2])))).collect()

[('Alex Smith', 101, ['saw', 'sheep']),
 ('John Lee', 102, ['likes', 'ice', 'cream'])]

如何删除 RDD PySpark 中的停用词？

问题描述投票：0回答：1

1个回答

最新问题

如何删除 RDD PySpark 中的停用词？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1