我正在尝试编写一段代码来每天获取并清理 100 个网站的文本。我遇到了一个问题,一个网站有多个 h1 标签,当您滚动到下一个 h1 标签时,网站上的 URL 会发生变化,例如 此网站。
我所拥有的基本上就是这些。
response=requests.get('https://economictimes.indiatimes.com/news/international/business/volkswagen-sets-5-7-revenue-growth-target-preaches-cost-discipline/articleshow/101168014.cms',headers={"User-Agent" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"})
soup = BeautifulSoup(response.content, 'html.parser')
if len(soup.body.find_all('h1'))>2: #to check if there is more than one tag
if i.endswith(".cms"): #to check if the website has .cms ending (i have my doubts on this part)
for elem in soup.next_siblings:
if elem.name == 'h1':
GET THE TEXT SOME HOW
break
如何获取第一个 h1 标签后的文本? (请注意,文本位于标签中,而不是
标签中。
.next_siblings
的想法是正确的,但您应该记住 soup.next_siblings
不太可能生成任何内容,因为文档本身通常不会有任何同级。
以下代码找到第一个标头,然后[如果它没有任何同级],搜索其父级以找到最近的具有同级的标头,然后遍历同级,但如果到达另一个
h1
标记,则停止。
# list_of_urls = ['https://economictimes.indiatimes.com/...
# for url in list_of_urls:
# response = requests.get(url,.....
# soup = BeautifulSoup(response.content, 'html.parser')
header1 = soup.find('h1')
if not header1:
print(f'[{response.status_code} {response.reason}] No headers at', url)
continue
if header1.next_sibling: hSibs = header1.next_siblings
else:
hParent = next((p for p in header1.parents if p.next_sibling), None)
hSibs = hParent.next_siblings if hParent else []
h1Sibs = []
for ns in hSibs:
if ns.name == 'h1' or (not isinstance(ns,str) and ns.find('h1')): break
h1Sibs.append(ns)
h1Sibs_text = '\n---\n'.join(ns.get_text(' ') for ns in h1Sibs)
对于示例中的网站,应打印
print(h1Sibs_text)
SECTIONS Volkswagen sets 5-7% revenue growth target, preaches cost discipline Reuters Last Updated: Jun 21, 2023, 07:16 PM IST Rate Story Share Font Size Abc Small Abc Medium Abc Large Save Print Comment --- Synopsis The German carmaker has set "performance programmes" for each brand, allocating them capital and setting a specific return on sales target, but delegating responsibility to the brands for how those targets are reached, executives said in a press call on its Capital Markets Day. "If you look at how Volkswagen operated in the past, often we had a fixed cost growth and we wanted to outgrow that fixed cost," Chief Financial Officer Arno Antlitz said. Agencies Volkswagen sets 5-7% revenue growth target, preaches cost discipline Volkswagen set new financial targets on Wednesday of 5-7% annual revenue growth by 2027 and 9-11% returns by 2030, aiming to stay disciplined on investment and focus on boosting margins in the face of growing competition for market share. The German carmaker has set "performance programmes" for each brand, allocating them capital and setting a specific return on sales target, but delegating responsibility to the brands for how those targets are reached, executives said in a press call on its Capital Markets Day . "If you look at how Volkswagen operated in the past, often we had a fixed cost growth and we wanted to outgrow that fixed cost," Chief Financial Officer Arno Antlitz said. "We are convinced in the transformation we need to change that strategy to our value over volume approach, be very disciplined on fixed cost, be very disciplined on investment and rather focus on value," he added. In China , where internal combustion engine sales still provide high revenues for the carmaker, it has slightly reduced its target for battery-electric vehicle sales in the next 1-2 years and is instead focused on protecting margins, Antlitz said. The new revenue growth target is a marked jump from Volkswagen's performance in recent years, with revenue growing just 1.1-1.2% per year in the last two years, and 0.7% in 2018-2019 prior to the pandemic. Under the new performance programmes, each brand will have a set target for operating result, returns, net cash flow, cash conversion rate, and investment ratio, Volkswagen said in a statement, adding it would tie management incentives to meeting targets. The carmaker is planning separate capital markets days for each brand over the coming months to introduce those targets, sources close to the company told Reuters last Friday. Don’t miss out on ET Prime stories! Get your daily dose of business updates on WhatsApp. click here! Thursday, 22 Jun, 2023 Experience Your Economic Times Newspaper, The Digital Way! Read Complete Print Edition » Front Page Pure Politics ET Markets Smart Investing More Local Indices End at Record Peaks on HDFC, IT Gains India’s key stock benchmarks closed at record highs amid choppy trade on Wednesday, bucking the bearish mood in other Asian markets, as merger-bound HDFC Bank and HDFC, as well as software shares, paced the gains. Musk Meets Modi, Says Tesla to be in India Soon Tesla founder Elon Musk said he had a very good conversation with Prime Minister Narendra Modi and he is confident the company, the world’s largest electric carmaker, will be in India “as soon as humanly possible” and that it is likely to make a “significant investment” in the country. ZEE-Sony Deal on, Whether I’m CEO or Not Punit Goenka, CEO & MD of Zee Entertainment Enterprises, has said that the ZEE-Sony merger will go through whether or not he is the CEO of the merged company, as it benefits 96% of stakeholders. Read More News on volkswagen Capital Market Brands Capital Revenue antlitz arno antlitz capital markets day china (Catch all the Business News , Breaking News Events and Latest News Updates on The Economic Times .) Download The Economic Times News App to get Daily Market Updates & Live Business News. ... more less ETPrime stories of the day Venture capital Peak XV and Sequoia’s trek ahead has plenty of tricky troughs 11 mins read Investing Despite the INR1 lakh a share feat, MRF skids on 10 critical points investors shouldn’t overlook. 7 mins read OTT More than just claps and confetti: How IPL has transformed the in-stadium cricket viewing experience 14 mins read Subscribe to ETPrime Videos PM Modi, Joe Biden exchange gifts at White House PM signs on T-shirt of a boy as he welcomes Modi Details of PM Modi's gift to Joe Biden, Jill Biden Stock radar: Buy Grasim stock; target Rs 2080 Sensex loses over 50 pts, Nifty tests 18,850 Joe Biden, Jill Biden receive PM Modi at WH Here's what was there for PM at State Dinner Stock ideas by experts for June 22 Stocks in focus: Glenmark, LIC & more Richard Gere to Ruchira Kamboj on UN's Yoga event 1 2 3 Poll Are foreign rating agencies unfairly harsh on India? Yes No Can't say Vote Latest from ET Gritty Goenka says Sony-Zee merger is still on for larger audience Modi invites Micron to boost chip making in India Can Vedanta afford to repay debts amid profit pangs? Trending World News Nintendo Direct 2023 Pokemon Go Spotlight Hour Titanic tourist submersible Lionel Messi Britney Joy Venus Williams Boxing Summer solstice Jujutsu Kaisen Chapter 227 Wordle Today Quordle Today Mikayla Campinos Taylor Swift The Flash Box-office Plaza Wars Paxton Whitehead Summer Solstice 2023 Extraction 2 How to Watch Portugal vs Iceland Titanic Submarine
请注意,您不必使用
'\n---\n'
来连接兄弟姐妹的文本 - 您可以使用任何字符串作为分隔符。
顺便说一句,对于该特定网站的文章,一种更简单的方法是通过其类专门定位标题标签,
if url.startswith('https://economictimes.indiatimes.com/'): ## might need more
h1Sibs = soup.select('*:has(>h1.artTitle)~*')
h1Sibs_text = '\n---\n'.join(ns.get_text(' ') for ns in h1Sibs)
select
与 *:has(>h1.artTitle)~*
选择器一起使用与使用 soup.find('h1',class_='artTitle').parent.next_siblings
类似,但比链接 find
、parent
、next_siblings
更安全,因为它只会返回一个空列表而不是引发任何错误,如果找不到h1.artTitle
。
如果您正在抓取许多不同的链接,但您知道其中大多数链接的网站,您可能希望将每个网站(甚至网站组)分成
if...elif...
块,并且只使用像我这样的通用内容else
块中未列出网站的第一个片段。您甚至可以考虑使用类似此可配置解析器的东西,以及每个站点的选择器集。
`response = requests.get('https://economictimes.indiatimes.com/news/international/business/volkswagen-sets-5-7-revenue-growth-target-preaches-cost-discipline/articleshow/101168014.cms', headers={"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"})
soup = BeautifulSoup(response.content, 'html.parser')
h1_tags = soup.body.find_all('h1')
if len(h1_tags) > 1:
for sibling in h1_tags[0].next_siblings:
if sibling.name == 'p':
text_after_h1 = sibling.get_text(strip=True)
break
print(text_after_h1)`
- 这将找到“”中的所有“”元素 - 我们迭代它们的下一个兄弟姐妹,直到找到“get_text()” - 将抓取 p 标签下的文本。 ** strip=True** - 删除任何前导或空格。
曾经遇到过类似的问题。 希望这有帮助