使用beautifulsoup完成网页抓取

问题描述 投票:-1回答:1

需要一些帮助使用beautifulsoup库进行网页抓取。

我需要从网页上提取文字http://thehill.com/ ... / 365407-sean-diddy-combs-want-to-buy-c ...

我的目标是提取文本与我正在提取所有“p”标签及其文本的网页完全相同,但在“p”标签内部有“a”标签,其中也有一些文本。

所以我的问题:1。如何将unicoded(“”)转换为普通字符串作为网页中的文本?因为当我只提取“p”标签时,beautifulsoup库将文本转换为unicoded,甚至特殊字符都是unicoded,所以我想将提取的unicoded文本转换为普通文本。我怎样才能做到这一点?

  1. 如何在“p”标签中提取其中包含“a”标签的文本。我的意思是我想提取“p”标签内的完整文本,包括嵌套标签内的文本。

我尝试过以下代码:

html = requests.get("http://thehill.com/…/365407-sean-diddy-combs-wants-to-buy-c…").content
news_soup = BeautifulSoup(html, "html.parser")
a_text = news_soup.find_all('p')

y = a_text[1].find_all('a').string
python web-scraping beautifulsoup
1个回答
0
投票

您可以使用嵌套列表推导来查找带有段落标记的所有链接,并使用encode("ascii", 'ignore')解码unicode:

import urllib
from bs4 import BeautifulSoup as soup
s = soup(str(urllib.urlopen('http://thehill.com/blogs/blog-briefing-room/365407-sean-diddy-combs-wants-to-buy-carolina-panthers-and-sign-kaepernick').read()), 'lxml')
all_text = [i.text.encode("ascii", 'ignore') for i in s.find_all('p')]
all_paragraphs = filter(None, [[b.text.encode("ascii", 'ignore') for b in i.find_all('a')] for i in s.find_all('p')])
print(all_text)
print(all_paragraphs)

输出:

['Hip hop mogul Sean Diddy Combs said Sunday night hes interested in buying the Carolina Panthers and signing quarterback Colin Kaepernick, who has been unemployed this season after kneeling during the national anthem in 2016.', 'Panthers owner Jerry Richardson announced Sunday he would be selling the team after the 2017 season, just hours after Sports Illustrated published accusations of sexual misconduct from former employees. Richardson also allegedly used a racial slur about a team scout.', 'Diddy took to Twitter soon after the Panthers announced the upcoming sale, declaring his desire to own a team and increase diversity among NFL ownership.', 'I would like to buy the @Panthers. Spread the word. Retweet!', 'There are no majority African American NFL owners. Lets make history.', '', 'Kaepernick respondedSundaymorning, saying I want in on the ownership group!', 'I want in on the ownership group! Lets make it happen!, 'Other athletes, including NBA starStephen Curryandformer NFL playerGreg Jennings,responded to Combs saying they were interested in part-owning the team.', "Former league MVP Cam Newton is the team's current quarterback.", 'Kaepernick has been a free agent since the end of the 2016 season, when he made headlinesfor kneeling during the national anthem before games to protest issues of racial inequality.', 'President TrumpDonald John TrumpHouse Democrat slams Donald Trump Jr. for serious case of amnesia after testimony Skier Lindsey Vonn: I dont want to represent Trump at Olympics Poll: 4 in 10 Republicans think senior Trump advisers had improper dealings with Russia MORE hascriticized Kaepernick directly, saying the NFL should have suspended him for the demonstration. He has since taken aim at other players who have knelt or sat during the anthem during the 2017 season.', '- This story was updated at 11:03 A.M. EST.', 'View the discussion thread.', 'The Hill 1625 K Street, NW Suite 900 Washington DC 20006 | 202-628-8500 tel | 202-628-8503 fax', 'The contents of this site are 2017 Capitol Hill Publishing Corp., a subsidiary of News Communications, Inc.']
[['Sports Illustrated'], ['@Panthers'], ['Stephen Curry', 'former NFL player'], ['President Trump', 'Donald John Trump', 'House Democrat slams Donald Trump Jr. for serious case of amnesia after testimony', 'Skier Lindsey Vonn: I dont want to represent Trump at Olympics', 'Poll: 4 in 10 Republicans think senior Trump advisers had improper dealings with Russia', 'MORE', 'criticized Kaepernick directly', 'knelt or sat'], ['View the discussion thread.']]
© www.soinside.com 2019 - 2024. All rights reserved.