我有一个带有文本和类别的数据框架。我想计算这些类别中常见的单词。我正在使用 nltk 来删除停顿词和 tokenize,但是在这个过程中无法包含类别。以下是我的问题示例代码。
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession,Row
import nltk
spark_conf = SparkConf()\
.setAppName("test")
sc=SparkContext.getOrCreate(spark_conf)
def wordTokenize(x):
words = [word for line in x for word in line.split()]
return words
def rmstop(x):
from nltk.corpus import stopwords
stop_words=set(stopwords.words('english'))
word = [w for w in x if not w in stop_words]
return word
# in actual problem I have a file which I am reading as a dataframe
# so creating a dataframe first
df = [('Happy','I am so happy today'),
('Happy', 'its my birthday'),
('Happy', 'lets have fun'),
('Sad', 'I am going to die today'),
('Neutral','I am going to office today'),('Neutral','This is my house')]
rdd = sc.parallelize(df)
rdd_data = rdd.map(lambda x: Row(Category=x[0], text=x[1]))
df_data = sqlContext.createDataFrame(rdd_data)
#convert to rdd for nltk process
df_data_rdd = df_data.select('text').rdd.flatMap(lambda x: x)
#make it lower and sentence tokenize
df_data_rdd1 = df_data_rdd.map(lambda x : x.lower())\
.map(lambda x: nltk.sent_tokenize(x))
#word tokenize
data_rdd1_words = df_data_rdd1.map(wordTokenize)
#stop word and distinct
data_rdd1_words_clean = data_rdd1_words.map(rmstop)\
.flatMap(lambda x: x)\
.distinct()
data_rdd1_words_clean.collect()
output : ['today', 'birthday', 'let', 'die', 'house', 'happy', 'fun', 'going', 'office'] 。
我想统计词(预处理后)在类别上的频率。例如 "今天":3,因为它在三个类别中都存在。
这里,extractphraseRDD是一个包含你的短语的RDD。因此,下面的代码将计算单词的数量,并按升序显示。
freqDistRDD = extractphraseRDD.flatMap(lambda x : nltk.FreqDist(x).most_common()).map(lambda x: x).reduceByKey(lambda x,y : x+y).sortBy(lambda x: x[1], ascending = False)