在pyspark(databricks)中使用来自NLTK的停用词时出现处理错误

问题描述 投票:0回答:1

我在网上找到了以下功能:

def RemoveStops(data_str):    
    #nltk.download('stopwords')
    english_stopwords = stopwords.words("english")
    broadcast(english_stopwords)
    # expects a string
    stops = set(english_stopwords)
    list_pos = 0
    cleaned_str = ''
    text = data_str.split()
    for word in text:
        if word not in stops:
            # rebuild cleaned_str
            if list_pos == 0:
                cleaned_str = word
            else:
                cleaned_str = cleaned_str + ' ' + word
            list_pos += 1
    return cleaned_str

然后我正在执行以下操作:

ColumntoClean = udf(lambda x: RemoveStops(x), StringType())
data = data.withColumn("CleanedText", ColumntoClean(data[TextColumn]))

我得到的错误如下:

PicklingError:newobj中的args [0] args具有错误的类

有趣的是,如果我重新运行同一组代码,它将运行并且不会引发任何酸洗错误。有人可以帮我解决这个问题吗?谢谢!

pyspark nltk stop-words
1个回答
0
投票
只需以这种方式更改您的功能,它就应该运行。

nltk.download('stopwords') english_stopwords = stopwords.words("english") def RemoveStops(data_str): # expects a string stops = set(english_stopwords) list_pos = 0 cleaned_str = '' text = data_str.split() for word in text: if word not in stops: # rebuild cleaned_str if list_pos == 0: cleaned_str = word else: cleaned_str = cleaned_str + ' ' + word list_pos += 1 return cleaned_str

Databricks对于nltk感到痛苦。应用udf时,不允许stopwords.words(“ english”)在函数内运行。 
© www.soinside.com 2019 - 2024. All rights reserved.