是否可以清除这些奇怪的符号

问题描述 投票:0回答:2
 def _clean(text):
    text = text.lower()
    text = re.sub(r'RT|rt', '', text)
    text = re.sub(r'&', '&', text)
    text = re.sub(r'[?!.;:,#@-]', '', text)
    text = re.sub(r"[$&+,:;=?#]|[0-9]|<ed>|<U\+[A-Z0-9]+>", "", text)
    text = re.sub("<+[A-Z0-9]+>", "", text)
    text = re.sub(r'https?|:\//\w.*', '', text)
    text = re.sub(r'\//?w*', '',text)
    text = re.sub(r'\ ã°â*', '' ,text)
    words = text.split()
    words = [w for w in words if w not in stopwords]
    text = " ".join(words)
    text = emoji_pattern.sub(r'', text)
    return text

到目前为止,我已经使用了上面的代码。我不知道如何清理此代码

[生日快乐,上周五晚上(tgif)ððððð上周五,晚上tgif ff…

python twitter nlp
2个回答
1
投票
您可以使用以下正则表达式删除所有非ASCII字符:

text = re.sub(r'[^\x00-\x7F]+', '', text)

另请参阅此问题:Replace non-ASCII characters with a single space

0
投票
这是您需要的:text = re.split(r'[^\w\(\)]', '', text)吗?
© www.soinside.com 2019 - 2024. All rights reserved.