我试图在span类中抓取数据并使用Beautifulsoup将数据放入DataFrame中。到目前为止,我已经成功地到达了网页的正确位置。但似乎无法抓住“幸福”,“悲伤”旁边的关键词和数字。
<span class="text-border tooltips" data-original-title="Happiness 84%
Sadness 80%
" data-placement="left" data-toggle="tooltip">More stats</span>,
<span class="text-border tooltips" data-original-title="Happiness 70%
Sadness 59%
" data-placement="left" data-toggle="tooltip">More stats</span>
如果有人可以帮我解决幸福和悲伤旁边的所有数字,并将它们作为pandas DataFrame中的列,那将会非常有用。
非常感谢
如果保证所有跨度都有数据原始标题......并且如果标题将始终以"Happiness<SPACE><PERCENTAGE><NEW LINE>Sadness<SPACE><PERCENTAGE>"
的格式出现,那么下面的内容应该适合您。
>>> import itertools
>>> import re
>>> import pandas as pd
>>> import bs4
>>> html = """<span class="text-border tooltips" data-original-title="Happiness 84%
... Sadness 80%
... " data-placement="left" data-toggle="tooltip">More stats</span>,
... <span class="text-border tooltips" data-original-title="Happiness 70%
... Sadness 59%
... " data-placement="left" data-toggle="tooltip">More stats</span>"""
>>> soup = bs4.BeautifulSoup(html, 'lxml')
>>> all_rows = []
>>> for span in soup.find_all('span'):
... title_eles = re.split(' |\n', span['data-original-title'])
... title_eles = list(filter(None, title_eles))
... row = dict(itertools.zip_longest(title_eles[::2], title_eles[1::2], fillvalue=""))
... all_rows.append(row)
...
>>> pd.DataFrame(all_rows)
Happiness Sadness
0 84% 80%
1 70% 59%
soup.find_all(class_='data-original-title')
返回空的原因也是因为data-original-title
是HTML中的一个属性。这不是一个班级。
你可以做点什么
from bs4 import BeautifulSoup
s = """
<span class="text-border tooltips" data-original-title="Happiness 84%
Sadness 80%
" data-placement="left" data-toggle="tooltip">More stats</span>,
<span class="text-border tooltips" data-original-title="Happiness 70%
Sadness 59%
" data-placement="left" data-toggle="tooltip">More stats</span>
"""
soup = BeautifulSoup(s, "lxml")
spans = soup.find_all("span") #get all spans
for span in spans:
data = span["data-original-title"].split("\n") #get attr and split by \n
happiness = data[0][:-1].replace("Happiness ", "") #remove % and remove words
sadness = data[1][:-1].replace("Sadness ", "")
print("{} {}".format(happiness, sadness))