如何从网页中仅获取纯文本

问题描述 投票:0回答:1

我正在尝试使用

urllib
bs4
访问网页内容:

import bs4
from urllib.request import Request, urlopen

url = "https://ar5iv.labs.arxiv.org/html/2309.10034"
req = Request(url=url, headers={'User-Agent': 'Mozilla/7.0'})
webpage = str(urlopen(req).read())
soup = bs4.BeautifulSoup(webpage)
text = soup.get_text()

但是,它包含各种非 ASCII 字符,例如

\n
\xc2
\x89
\subscript
等。我想删除所有这些字符并仅提取纯文本。这可能吗?我该怎么做?

python beautifulsoup urllib
1个回答
0
投票

这是一种仅从该论文中获取文本的方法:

import requests
from bs4 import BeautifulSoup as bs

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'
}
soup = bs(requests.get('https://ar5iv.labs.arxiv.org/html/2309.10034', headers=headers).text, 'html.parser')
text_only = ' '.join([x.get_text(strip=True, separator=' ') for x in soup.select('p[class="ltx_p"]')])

print(text_only)

终端结果:

We present constraints on cosmological parameters using maps from the last Planck data release (PR4). In particular, we detail an upgraded version of the cosmic microwave background likelihood, HiLLiPoP , based on angular power spectra and relying on a physical modelling of the foreground residuals in the spectral domain. This new version of the likelihood retains a larger sky fraction (up to 75 %) and uses an extended multipole range. Using this likelihood, along with low- ℓ ℓ \ell measurements from LoLLiPoP , we derive constraints on Λ Λ \Lambda CDM parameters that are in good agreement with previous Planck 2018 results, but with 10 % to 20 % smaller uncertainties.
We demonstrate that [...]

如果您愿意,您可以使用

regex
python replace()
进一步清理文本。请求的文档位于此处,对于 BeautifulSoup,您可以在此处找到它。

© www.soinside.com 2019 - 2024. All rights reserved.