如何从网页中仅获取纯文本

Question

我正在尝试使用

urllib

和

bs4

访问网页内容：

import bs4
from urllib.request import Request, urlopen

url = "https://ar5iv.labs.arxiv.org/html/2309.10034"
req = Request(url=url, headers={'User-Agent': 'Mozilla/7.0'})
webpage = str(urlopen(req).read())
soup = bs4.BeautifulSoup(webpage)
text = soup.get_text()

但是，它包含各种非 ASCII 字符，例如

\n

和

\xc2

或

\x89

或

\subscript

等。我想删除所有这些字符并仅提取纯文本。这可能吗？我该怎么做？

Answer 1

这是一种仅从该论文中获取文本的方法：

import requests
from bs4 import BeautifulSoup as bs

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'
}
soup = bs(requests.get('https://ar5iv.labs.arxiv.org/html/2309.10034', headers=headers).text, 'html.parser')
text_only = ' '.join([x.get_text(strip=True, separator=' ') for x in soup.select('p[class="ltx_p"]')])

print(text_only)

终端结果：

We present constraints on cosmological parameters using maps from the last Planck data release (PR4). In particular, we detail an upgraded version of the cosmic microwave background likelihood, HiLLiPoP , based on angular power spectra and relying on a physical modelling of the foreground residuals in the spectral domain. This new version of the likelihood retains a larger sky fraction (up to 75 %) and uses an extended multipole range. Using this likelihood, along with low- ℓ ℓ \ell measurements from LoLLiPoP , we derive constraints on Λ Λ \Lambda CDM parameters that are in good agreement with previous Planck 2018 results, but with 10 % to 20 % smaller uncertainties.
We demonstrate that [...]

如果您愿意，您可以使用

regex

或

python replace()

进一步清理文本。请求的文档位于此处，对于 BeautifulSoup，您可以在此处找到它。

如何从网页中仅获取纯文本

问题描述投票：0回答：1

1个回答

最新问题

如何从网页中仅获取纯文本

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1