我在使用 BeautifulSoup 和 Python 从 MDPI 遥感期刊中抓取 PDF 时遇到问题。
我的代码的目的是抓取每本期刊卷及其中的期刊,以获取下载到本地计算机的 PDF。该期刊的每卷包含多期,而每期又包含多篇文章。每篇文章的PDF都在一个class_=“UD_Listings_ArticlePDF”中
我的问题是,我的代码每期、每卷最多只能下载 30 篇文章,而实际上大多数期刊的文章数量都超过 30 篇(例如第 8 卷第 2 期,文章数量超过 30 篇)。我不明白为什么会发生这种情况,因为 class_="UD_Listings_ArticlePDF" 在源 HTML 中可见,并且代码应该检测到它们。
谁能帮我弄清楚这是怎么回事? (见附件代码)
import requests
from bs4 import BeautifulSoup
import os
import time
# Base URL for the journal (change if the base URL pattern changes)
base_url = "https://www.mdpi.com/2072-4292"
# Directory to save the PDFs
os.makedirs("mdpi_pdfs", exist_ok=True)
# Define the range of volumes and issues to scrape
start_volume = 1
end_volume = 16 # Change this number based on the latest volume available
# Time delay between requests in seconds
request_delay = 4 # Time delay between requests to avoid 429 errors
# Maximum number of retries after 429 errors
max_retries = 5
# Iterate over each volume
for volume_num in range(start_volume, end_volume + 1):
print(f"\nProcessing Volume {volume_num}...")
# Assume a reasonable number of issues per volume
start_issue = 1
end_issue = 30 # You may need to adjust this based on the number of issues per volume
for issue_num in range(start_issue, end_issue + 1):
issue_url = f"{base_url}/{volume_num}/{issue_num}"
print(f" Processing Issue URL: {issue_url}")
retries = 0
while retries < max_retries:
try:
# Get the content of the issue webpage
response = requests.get(issue_url)
# If issue URL doesn't exist, break the loop for this volume
if response.status_code == 404:
print(f" Issue {issue_num} in Volume {volume_num} does not exist. Moving to next volume.")
time.sleep(request_delay * 5)
break
# Handle 429 errors gracefully
if response.status_code == 429:
print(f" Received 429 error. Too many requests. Retrying in {request_delay * 5} seconds...")
retries += 1
time.sleep(request_delay * 5) # Exponential backoff strategy
continue
response.raise_for_status() # Check for other request errors
# Parse the page content
soup = BeautifulSoup(response.content, "html.parser")
# Find all links that lead to PDFs
pdf_links = soup.find_all("a", class_="UD_Listings_ArticlePDF") # Adjust class if needed
if not pdf_links:
print(f" No PDF links found for Issue {issue_num} in Volume {volume_num}.")
break
# Download each PDF for the current issue
for index, link in enumerate(pdf_links, start=1):
try:
# Construct the full URL for the PDF
pdf_url = f"https://www.mdpi.com{link['href']}"
# Create a unique file name with volume and issue information
pdf_name = f"mdpi_volume_{volume_num}_issue_{issue_num}_article_{index}.pdf"
pdf_path = os.path.join("mdpi_pdfs", pdf_name)
print(f" Downloading: {pdf_url}")
# Download the PDF
pdf_response = requests.get(pdf_url)
pdf_response.raise_for_status() # Check for request errors
# Save the PDF file
with open(pdf_path, "wb") as file:
file.write(pdf_response.content)
print(f" Successfully downloaded: {pdf_name}")
# Sleep after each successful download
time.sleep(request_delay)
except Exception as e:
print(f" Failed to download {pdf_url}. Error: {e}")
# Exit the retry loop since request was successful
break
except Exception as e:
print(f" Failed to process Issue {issue_num} in Volume {volume_num}. Error: {e}")
retries += 1
if retries < max_retries:
print(f" Retrying in {request_delay * 2} seconds... (Retry {retries}/{max_retries})")
time.sleep(request_delay * 2)
else:
print(f" Maximum retries reached. Skipping Issue {issue_num} in Volume {volume_num}.")
print("\nDownload process completed for all specified volumes and issues!")
我尝试使用扩展选择器来捕获任何格式奇怪的类,但每个问题仍然只能返回 30 个 PDF 链接,而实际上还有更多。
用python请求页面并下载文章即可获取前30篇文章。
对于其余的,您需要以这种格式提出其他请求 https://www.mdpi.com/2072-4292/volume_number/issue_number/date/default/30/15 其中 30 是起点,15 是计数。 所以接下来的 url 需要是
https://www.mdpi.com/2072-4292/volume_number/issue_number/date/default/45/15 https://www.mdpi.com/2072-4292/volume_number/issue_number/date/default/60/15 https://www.mdpi.com/2072-4292/volume_number/issue_number/date/default/75/15
直到找到类为“UD_Listings_ArticlePDF”的空项目