如何在 beautifulsoup 中获取文本作为 .innerText 而不是 JS 中的 .textContent

问题描述 投票:0回答:1
python html python-3.x web-scraping beautifulsoup
1个回答
0
投票

如果我理解正确的话,您可以使用正则表达式来更改文本。考虑这个例子:

from bs4 import BeautifulSoup

html_text = """\
<body>
    <p>Lorem ipsum dolor sit amet,
        consectetur adipiscing elit.
        Maecenas sed mi lacus.
            <span>This is inner span.</span>
        Vivamus luctus vehicula lacus,
        ut malesuada justo posuere et.
        Donec ut diam volutpat</p>
</body>"""

soup = BeautifulSoup(html_text, "html.parser")
print(soup.p.text)

打印:

Lorem ipsum dolor sit amet,
        consectetur adipiscing elit.
        Maecenas sed mi lacus.
            This is inner span.
        Vivamus luctus vehicula lacus,
        ut malesuada justo posuere et.
        Donec ut diam volutpat

你可以这样做:

import re

print(re.sub(r"\s{2,}", " ", soup.p.text))

这会响起:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas sed mi lacus. This is inner span. Vivamus luctus vehicula lacus, ut malesuada justo posuere et. Donec ut diam volutpat
© www.soinside.com 2019 - 2024. All rights reserved.