感情分析中归一化过程中的类型错误。

问题描述 投票:0回答:1
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import unicodedata
sid = SentimentIntensityAnalyzer()

for date, row in df_stocks.T.iteritems():  
  print(df_stocks.loc[date, 'articles'])
    try:
        sentence = unicodedata.normalize('NFKD', df_stocks.loc[date, 'articles'])
        ss = sid.polarity_scores(sentence)
        df.at(date, 'compound', ss['compound'])
        df.at(date, 'neg', ss['neg'])
        df.at(date, 'neu', ss['neu'])
        df.at(date, 'pos', ss['pos'])
    except TypeError:
        print (df_stocks.loc[date, 'articles'])
        print ("date")

我已经打印了一小部分 df_stock.loc[date,'articles']

Trump Officially Wins Michigan Amid Calls for a Recount. World Trade Organization Rules Against Boeing Tax Break for New Jet. Donald Trump Faces Obstacles to Resuming Waterboarding. Flamingo Mating Rules: 1. Learn the Funky Chicken. Facebook Runs Up Against German Hate Speech Laws. What Changed, and Didn’t, After the 1988 Slaying of a Rain Forest Hero in Brazil. James C. Woolery Leaves Hudson Executive Capital Hedge Fund. Hampshire College Draws Protests Over Removal of U.S. Flag. China Takes a Chain Saw to a Center of Tibetan Buddhism. 5 Ways to Be a Better Tourist. How Tour Guides Abroad Learn to Cater to Exotic Americans. Local Transmission of Zika Virus Is Reported in Texas. Why Gunshot Victims Have Reason to Like the Affordable Care Act. Donald Trump’s Threat to Close Door Reopens Old Wounds in Cuba. Jimmy Carter: America Must Recognize Palestine. ‘Trump Effect’ Is Already Shaping Events Around the World. Delta Air Lines Bans Disruptive Donald Trump Supporter for Life. A Failed Bid for Time Inc. May Be Only a Start. C. Megan Urry, Peering Into Universe, Spots Bias on the Ground. A Forgotten Step in Saving African Wildlife: Protecting the Rangers. President Jacob Zuma of South Africa Faces Leadership Challenge. Belgium and the Netherlands Swap Land, and Remain Friends. Congress May Hold Key to Handling Trump’s Conflicts of Interest. Under Trump, Will NASA’s Space Science Include Planet Earth?. Summer Project Turns Into Leukemia Testing Breakthrough. Supreme Court Agenda in the Trump Era? A Justice Seems to Supply One. California Official Says Trump’s Claim of Voter Fraud Is ‘Absurd’. Daily Report: Uber Wants to Avoid ‘Transportation’ Label in Europe. Ukraine Has Made Great Progress, but We Need Our Allies. Thousands Flee Aleppo, Syria, as Government Forces Advance. Stuck at the Bottom in China. Suspect Is Killed in Attack at Ohio State University That Injured 11. A Baby Court Offers Hope for Families. 

堆栈跟踪如下

AttributeError                            Traceback (most recent call last)
<ipython-input-17-b6e18273a873> in <module>()
      6      # if type(df_stocks.loc[date, 'articles']).__name__ == 'str':
      7         sentence = unicodedata.normalize('NFKD', df_stocks.loc[date, 'articles']).encode('ascii','ignore')
----> 8         ss = sid.polarity_scores(sentence)
      9         df.at(date, 'compound', ss['compound'])
     10         df.at(date, 'neg', ss['neg'])

1 frames
/usr/local/lib/python3.6/dist-packages/nltk/sentiment/vader.py in __init__(self, text)
    152     def __init__(self, text):
    153         if not isinstance(text, str):
--> 154             text = str(text.encode('utf-8'))
    155         self.text = text
    156         self.words_and_emoticons = self._words_and_emoticons()

AttributeError: 'bytes' object has no attribute 'encode'

根据我的理解,问题似乎是unicode.normalize函数的问题,但我不明白到底是什么问题。

python normalization sentiment-analysis python-unicode
1个回答
0
投票

聲明: 我没有使用过nltk,也不知道这个软件包,这是由堆栈跟踪和 nltk 源代码。

这看起来像是 nltk 包本身。

有问题的代码。

    if not isinstance(text, str):
        text = str(text.encode('utf-8'))

很明显是Python 2的代码,在Python 3中无法使用。

对于背景:在Python 2中。str 基本上是一个字节字符串,它可以包含编码的unicode字符串。编码 unicode 弦。encode 被使用。在Python 3中。str 是unicode字符串,而 bytes 是字节字符串,所以你可以使用 encode 关于 str 来将它们转换为字节,而这段代码则尝试相反。

我检查了一下,这段代码仍然存在于master中(自从它被改成 写于2015年),所以您可以在他们的网站上提交一个问题。GitHub网站. 我没有发现这个问题,虽然有一个 类似错误 恰恰相反 'str' has no attribute 'decode'但在Python 2代码中也存在同样的问题),而且仍然没有解决。

为了解决你的代码中的这个问题,你可能应该通过一个 str 函数中,因为它不能处理字节。你可以试试这样的东西。

    sentence = unicodedata.normalize('NFKD', df_stocks.loc[date, 'articles'])
    if isinstance(sentence, bytes):
        sentence = sentence.decode('utf-8')  # this is assuming it is actually encoded as utf-8
    ss = sid.polarity_scores(sentence)

我不知道它是否能处理 str 正确的,但我希望它 - 否则,这将弹出之前。

© www.soinside.com 2019 - 2024. All rights reserved.