如何使用正则表达式从两个相似的html类元素中提取数据?

问题描述 投票:1回答:2

如何使用python正则表达式从以下html片段中提取向上投票(215)和向下投票(82)计数?

<span class="vote-actions">
    <a class="btn btn-default vote-action-good">
        <span class="icon thumb-up black black-hover">&nbsp;</span>
        <span class="rating-inbtn">215</span>
    </a>
    <a class="btn btn-default vote-action-bad">
        <span class="icon thumb-down grey black-hover">&nbsp;</span>
        <span class="rating-inbtn">82</span>
    </a>
</span>

我格式化了html代码,但原始代码中没有'\ n'或'\ t'字符。

仅供参考我不希望任何美味的汤解决方案。 Python重新搜索功能是我正在寻找的。

python regex web-scraping
2个回答
2
投票

为了找到这两个数字,我会这样做

text = '''<span class="vote-actions">
    <a class="btn btn-default vote-action-good">
        <span class="icon thumb-up black black-hover">&nbsp;</span>
        <span class="rating-inbtn">215</span>
    </a>
    <a class="btn btn-default vote-action-bad">
        <span class="icon thumb-down grey black-hover">&nbsp;</span>
        <span class="rating-inbtn">82</span>
    </a>
</span>'''

import re

a = re.findall('rating-inbtn">(\d+)', text)
print(a)

['215', '82']

在HTML中我看到第一个数字是Up,第二个是Down所以我不需要更好的方法。

up = a[0]
down = a[1]

如果还不够,那么我会使用HTML解析器

text = '''<span class="vote-actions">
    <a class="btn btn-default vote-action-good">
        <span class="icon thumb-up black black-hover">&nbsp;</span>
        <span class="rating-inbtn">215</span>
    </a>
    <a class="btn btn-default vote-action-bad">
        <span class="icon thumb-down grey black-hover">&nbsp;</span>
        <span class="rating-inbtn">82</span>
    </a>
</span>'''

import lxml.html

soup = lxml.html.fromstring(text)

up = soup.xpath('//a[@class="btn btn-default vote-action-good"]/span[@class="rating-inbtn"]')
up = up[0].text
print(up)

down = soup.xpath('//a[@class="btn btn-default vote-action-bad"]/span[@class="rating-inbtn"]')
down = down[0].text
print(down)

2
投票

不要使用正则表达式来解析html https://stackoverflow.com/a/1732454/412529

以下是BeautifulSoup的使用方法:

html = '''<span class="vote-actions">...'''
import bs4
soup = bs4.BeautifulSoup(html)
soup.select("a.vote-action-good span.rating-inbtn")[0].text  # '215'
soup.select("a.vote-action-bad span.rating-inbtn")[0].text  # '82'
© www.soinside.com 2019 - 2024. All rights reserved.