尝试在Python中进行网页抓取

问题描述 投票:0回答:1

我正在尝试提取网址流,但被卡住了。我提取数据的网址重定向到一个包含该网址的页面,然后必须单击该页面才能转到我真正想要抓取的页面的网址。

这是源网址 - http://ndl.iitkgp.ac.in/nw_document/economictimes/IN__T_E_T___2__1_2001__3_12__65_05__23_71

这是当我单击源网址中的网址时出现的内容 我需要编写一种方法,仅使用源 url 来提取单击按钮时将出现的最终 url。

使用 selenium、WebDriver、beautifulsoup 包

一段代码。它找不到我想要的特定链接 -

soup = BeautifulSoup(driver.page_source, 'html.parser')

# Find the final URL based on the given HTML structure
final_link = soup.find('a', class_='btn btn-success', href=True)

if final_link:
    final_url = final_link['href']
    print(f"Found final URL: {final_url}")
    posts_urls.append(final_url)
else:
    print("No final URL found in the expected format.")
python selenium-webdriver web-scraping beautifulsoup
1个回答
0
投票

尝试:

import requests
from bs4 import BeautifulSoup

id_ = "IN__T_E_T___2__1_2001__3_12__65_05__23_71"

# this is <iframe> URL that has actual content
viewer_url = "https://ndl.iitkgp.ac.in/module-viewer/viewer.php?id=economictimes/economictimes/{}&domain=nw"


def get_page(id_):
    headers = {"Referer": f"http://ndl.iitkgp.ac.in/nw_document/economictimes/{id_}"}
    soup = BeautifulSoup(
        requests.get(viewer_url.format(id_), headers=headers).content, "html.parser"
    )
    return soup


def get_sub_ids(id_):
    soup = get_page(id_)
    return [a["href"].split("/")[-1] for a in soup.select("a")]


sub_ids = get_sub_ids(id_)  # sub_ids = ['37230_2', '37230_3', '37230_4', '37230_1']
for s in sub_ids:
    url = get_page(s).select_one(".btn")["href"]
    print(s, url)

打印:

37230_2 https://economictimes.indiatimes.com/news/economy/policy/yield-spread-crashes-prophesies-economy-gloom/articleshow/332850363.cms
37230_3 https://economictimes.indiatimes.com/news/economy/policy/supplementary-demands-approved/articleshow/1219281788.cms
37230_4 https://economictimes.indiatimes.com/news/economy/policy/shourie-seeks-ags-advice-on-itdc-hotel-sale/articleshow/1436743634.cms
37230_1 https://economictimes.indiatimes.com/news/economy/policy/govt-hints-at-removing-poto-provisions-affecting-media/articleshow/882241993.cms
© www.soinside.com 2019 - 2024. All rights reserved.