我需要网上抓a web page并找到五个最常见的名字。预期的输出应该是这样的
[
('Anna Pavlovna', 7),
('the prince', 7),
('the Empress', 3),
('Theprince', 3),
('Prince Vasili', 2),
]
我的代码确实计算了最常用的名称,但输出看起来像这样:
[(<span class="green">Anna Pavlovna</span>, 7),
(<span class="green">the prince</span>, 7),
(<span class="green">the Empress</span>, 3),
(<span class="green">The prince</span>, 3),
(<span class="green">Prince Vasili</span>, 2)]
我该怎么做才能使输出看起来像样本输出?
import nltk
from urllib.request import urlopen
from bs4 import BeautifulSoup
html=urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
soup=BeautifulSoup(html,'html.parser')
nameList = soup.findAll("span", {"class":"green"}) # may use bsObj.find_all()
fdist1 = nltk.FreqDist(nameList)
fdist1.most_common(5)
该页面显示错误502 Bad Gateway,但我想我知道你的问题是什么。当你使用findAll时,它会为你提供bs4元素而不是字符串。因此,您需要将其转换为类似obj.get_text()的字符串。 see documentation
items = soup.findAll("span", {"class": "green"})
texts = [item.get_text() for item in items]
# Now you have the texts of the span elements
BTW您的代码示例不正确,因为不会定义bsObj。
只需改变这一行:
nameList = soup.findAll("span", {"class":"green"})
对此:
nameList = [tag.text for tag in soup.findAll("span", {"class":"green"})]
findAll
函数返回一个标签列表,以获取您使用text
属性的标签内的文本。