我正在尝试使用
urllib
和 bs4
访问网页内容:
import bs4
from urllib.request import Request, urlopen
url = "https://ar5iv.labs.arxiv.org/html/2309.10034"
req = Request(url=url, headers={'User-Agent': 'Mozilla/7.0'})
webpage = str(urlopen(req).read())
soup = bs4.BeautifulSoup(webpage)
text = soup.get_text()
但是,它包含各种非 ASCII 字符,例如
\n
和 \xc2
或 \x89
或 \subscript
等。我想删除所有这些字符并仅提取纯文本。这可能吗?我该怎么做?
这是一种仅从该论文中获取文本的方法:
import requests
from bs4 import BeautifulSoup as bs
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'
}
soup = bs(requests.get('https://ar5iv.labs.arxiv.org/html/2309.10034', headers=headers).text, 'html.parser')
text_only = ' '.join([x.get_text(strip=True, separator=' ') for x in soup.select('p[class="ltx_p"]')])
print(text_only)
终端结果:
We present constraints on cosmological parameters using maps from the last Planck data release (PR4). In particular, we detail an upgraded version of the cosmic microwave background likelihood, HiLLiPoP , based on angular power spectra and relying on a physical modelling of the foreground residuals in the spectral domain. This new version of the likelihood retains a larger sky fraction (up to 75 %) and uses an extended multipole range. Using this likelihood, along with low- ℓ ℓ \ell measurements from LoLLiPoP , we derive constraints on Λ Λ \Lambda CDM parameters that are in good agreement with previous Planck 2018 results, but with 10 % to 20 % smaller uncertainties.
We demonstrate that [...]
如果您愿意,您可以使用
regex
或 python replace()
进一步清理文本。请求的文档位于此处,对于 BeautifulSoup,您可以在此处找到它。