我正在尝试提取 URL 流,但被卡住了。我提取数据的 URL 重定向到一个页面,该页面具有 URL,然后必须单击该 URL 才能转到我实际想要抓取的页面 URL。
这是源网址: http://ndl.iitkgp.ac.in/nw_document/economictimes/IN__T_E_T___2__1_2001__3_12__65_05__23_71
我需要编写一种方法,仅使用源 URL 来提取单击按钮时将出现的最终 URL。
使用 Selenium、WebDriver、Beautiful Soup 包。
一段代码。它找不到我想要的特定链接 -
soup = BeautifulSoup(driver.page_source, 'html.parser')
# Find the final URL based on the given HTML structure
final_link = soup.find('a', class_='btn btn-success', href=True)
if final_link:
final_url = final_link['href']
print(f"Found final URL: {final_url}")
posts_urls.append(final_url)
else:
print("No final URL found in the expected format.")
尝试:
import requests
from bs4 import BeautifulSoup
id_ = "IN__T_E_T___2__1_2001__3_12__65_05__23_71"
# this is <iframe> URL that has actual content
viewer_url = "https://ndl.iitkgp.ac.in/module-viewer/viewer.php?id=economictimes/economictimes/{}&domain=nw"
def get_page(id_):
headers = {"Referer": f"http://ndl.iitkgp.ac.in/nw_document/economictimes/{id_}"}
soup = BeautifulSoup(
requests.get(viewer_url.format(id_), headers=headers).content, "html.parser"
)
return soup
def get_sub_ids(id_):
soup = get_page(id_)
return [a["href"].split("/")[-1] for a in soup.select("a")]
sub_ids = get_sub_ids(id_) # sub_ids = ['37230_2', '37230_3', '37230_4', '37230_1']
for s in sub_ids:
url = get_page(s).select_one(".btn")["href"]
print(s, url)
打印:
37230_2 https://economictimes.indiatimes.com/news/economy/policy/yield-spread-crashes-prophesies-economy-gloom/articleshow/332850363.cms
37230_3 https://economictimes.indiatimes.com/news/economy/policy/supplementary-demands-approved/articleshow/1219281788.cms
37230_4 https://economictimes.indiatimes.com/news/economy/policy/shourie-seeks-ags-advice-on-itdc-hotel-sale/articleshow/1436743634.cms
37230_1 https://economictimes.indiatimes.com/news/economy/policy/govt-hints-at-removing-poto-provisions-affecting-media/articleshow/882241993.cms