我正在解析这个RSS feed。
https:/www.google.comalertsfeeds127005503042903815376239785894655863043
我正在使用以下代码。
import requests
from bs4 import BeautifulSoup
url = "https://www.google.com/alerts/feeds/12700550304290381537/6239785894655863043"
resp = requests.get(url)
soup = BeautifulSoup(resp.content, features='xml')
items = soup.findAll('entry')
news_items = []
for item in items:
news_item = {}
news_item['title'] = item.title.text
news_item['link'] = item.link['href']
news_item['published'] = item.published.text
news_item['source'] = item.link
news_items.append(news_item)
news_items[0]
我得到了以下输出。
{'link': <link href="https://www.google.com/url?rct=j&sa=t&url=https://duitslandinstituut.nl/artikel/38250/duitsland-lanceert-corona-tracing-app&ct=ga&cd=CAIyGWFlODkwMWNhMWM0YmE4ODU6bmw6bmw6Tkw&usg=AFQjCNHDFPconO3h8mpzJh92x4HrjPL2tQ"/>,
'published': '2020-06-11T15:33:11Z',
'source': <link href="https://www.google.com/url?rct=j&sa=t&url=https://duitslandinstituut.nl/artikel/38250/duitsland-lanceert-corona-tracing-app&ct=ga&cd=CAIyGWFlODkwMWNhMWM0YmE4ODU6bmw6bmw6Tkw&usg=AFQjCNHDFPconO3h8mpzJh92x4HrjPL2tQ"/>,
'title': 'Duitsland lanceert <b>corona</b>-tracing-<b>app</b>'}
然而,我想要的输出是:
{'link': 'https://duitslandinstituut.nl/artikel/38250/duitsland-lanceert-corona-tracing-app&ct=ga&cd=CAIyGWFlODkwMWNhMWM0YmE4ODU6bmw6bmw6Tkw&usg=AFQjCNHDFPconO3h8mpzJh92x4HrjPL2tQ',
'published': '2020-06-11T15:33:11Z',
'source': 'Duitslandinstituut'
'title': 'Duitsland lanceert corona-tracing-app'}
所以,首先,我想去掉google链接的部分。第二,我希望源是在第二个'https:/'后面的名字,用大写字母。 第三,我想去掉标题中的任何<\b>等属性。我打算将结果放入书目中,所以文本不能包含任何计算机代码。
我试着在BS4中解决这个问题,但不能。有人建议我的之后在pandas df中用regex来做,但我对regex不熟悉,举例也很难理解。有谁有办法解决吗?
试试改变你的 for
循环的思路是这样的。
for item in items:
news_item = {}
news_item['link'] = item.link['href']
news_item['published'] = item.published.text
source = item.link['href'].split('//')[2].split('.')[1].capitalize()
news_item['source'] = source
news_items.append(news_item)
n_s = BeautifulSoup(item.title.text,'lxml')
new_t = ''.join(n_s.find_all(text=True))
news_item['title'] = new_t
for item in news_items:
print(item)
输出(在我运行的时候):
{'link': 'https://www.google.com/url?rct=j&sa=t&url=https://www.nrc.nl/nieuws/2020/06/12/de-nieuwe-corona-app-een-balanceeract-tussen-te-streng-en-te-soft-a4002678&ct=ga&cd=CAIyGWFlODkwMWNhMWM0YmE4ODU6bmw6bmw6Tkw&usg=AFQjCNFc54u6UszfKuIsSWFHQ_JTeqfIQA', 'published': '2020-06-12T14:37:30Z', 'source': 'Nrc', 'title': "De nieuwe corona-app: een balanceeract tussen 'te streng' en 'te soft'"}
{'link': 'https://www.google.com/url?rct=j&sa=t&url=https://www.standaard.be/cnt/dmf20200612_04989287&ct=ga&cd=CAIyGWFlODkwMWNhMWM0YmE4ODU6bmw6bmw6Tkw&usg=AFQjCNHtIbdXB6q3hcvnNTvG7KC76fV7xQ', 'published': '2020-06-12T11:46:32Z', 'source': 'Standaard', 'title': 'Mobiele coronateams en app tegen tweede golf'}
等。
你可以使用 .replace
方法,如果你不想使用regex的话。而 urllib.parse.urlparse
从一个url中获取域名。
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse
def parse(url):
news_items = []
resp = requests.get(url)
soup = BeautifulSoup(resp.text)
items = soup.find_all('entry')
for item in items:
title = item.title.text.replace('<b>', '').replace('</b>', '')
link = item.link['href'].replace(
'https://www.google.com/url?rct=j&sa=t&url=', '').split('&')[0]
source = urlparse(link).netloc.split('.')[1].title()
published = item.published.text
news_items.append(dict(zip(
['link', 'published', 'source', 'title'], [link, published, source, title]
)))
return news_items