从期刊网站下载 PDF 时出现数据抓取问题

问题描述 投票:0回答:1

我在使用 BeautifulSoup 和 Python 从 MDPI 遥感期刊中抓取 PDF 时遇到问题。

我的代码的目的是抓取每本期刊卷及其中的期刊,以获取下载到本地计算机的 PDF。该期刊的每卷包含多期,而每期又包含多篇文章。每篇文章的PDF都在一个class_=“UD_Listings_ArticlePDF”中

我的问题是,我的代码每期、每卷最多只能下载 30 篇文章,而实际上大多数期刊的文章数量都超过 30 篇(例如第 8 卷第 2 期,文章数量超过 30 篇)。我不明白为什么会发生这种情况,因为 class_="UD_Listings_ArticlePDF" 在源 HTML 中可见,并且代码应该检测到它们。

谁能帮我弄清楚这是怎么回事? (见附件代码)

import requests
from bs4 import BeautifulSoup
import os
import time

# Base URL for the journal (change if the base URL pattern changes)
base_url = "https://www.mdpi.com/2072-4292"

# Directory to save the PDFs
os.makedirs("mdpi_pdfs", exist_ok=True)

# Define the range of volumes and issues to scrape
start_volume = 1
end_volume = 16  # Change this number based on the latest volume available

# Time delay between requests in seconds
request_delay = 4  # Time delay between requests to avoid 429 errors

# Maximum number of retries after 429 errors
max_retries = 5

# Iterate over each volume
for volume_num in range(start_volume, end_volume + 1):
    print(f"\nProcessing Volume {volume_num}...")

    # Assume a reasonable number of issues per volume
    start_issue = 1
    end_issue = 30  # You may need to adjust this based on the number of issues per volume

    for issue_num in range(start_issue, end_issue + 1):
        issue_url = f"{base_url}/{volume_num}/{issue_num}"
        print(f"  Processing Issue URL: {issue_url}")

        retries = 0
        while retries < max_retries:

            try:
                # Get the content of the issue webpage
                response = requests.get(issue_url)

                # If issue URL doesn't exist, break the loop for this volume
                if response.status_code == 404:
                    print(f"  Issue {issue_num} in Volume {volume_num} does not exist. Moving to next volume.")
                    time.sleep(request_delay * 5)
                    break

                # Handle 429 errors gracefully
                if response.status_code == 429:
                    print(f"  Received 429 error. Too many requests. Retrying in {request_delay * 5} seconds...")
                    retries += 1
                    time.sleep(request_delay * 5)  # Exponential backoff strategy
                    continue

                response.raise_for_status()  # Check for other request errors

                # Parse the page content
                soup = BeautifulSoup(response.content, "html.parser")

                # Find all links that lead to PDFs
                pdf_links = soup.find_all("a", class_="UD_Listings_ArticlePDF")  # Adjust class if needed

                if not pdf_links:
                    print(f"  No PDF links found for Issue {issue_num} in Volume {volume_num}.")
                    break

                # Download each PDF for the current issue
                for index, link in enumerate(pdf_links, start=1):
                    try:
                        # Construct the full URL for the PDF
                        pdf_url = f"https://www.mdpi.com{link['href']}"

                        # Create a unique file name with volume and issue information
                        pdf_name = f"mdpi_volume_{volume_num}_issue_{issue_num}_article_{index}.pdf"
                        pdf_path = os.path.join("mdpi_pdfs", pdf_name)

                        print(f"    Downloading: {pdf_url}")

                        # Download the PDF
                        pdf_response = requests.get(pdf_url)
                        pdf_response.raise_for_status()  # Check for request errors

                        # Save the PDF file
                        with open(pdf_path, "wb") as file:
                            file.write(pdf_response.content)

                        print(f"    Successfully downloaded: {pdf_name}")

                        # Sleep after each successful download
                        time.sleep(request_delay)

                    except Exception as e:
                        print(f"    Failed to download {pdf_url}. Error: {e}")

                # Exit the retry loop since request was successful
                break

            except Exception as e:
                print(f"  Failed to process Issue {issue_num} in Volume {volume_num}. Error: {e}")
                retries += 1
                if retries < max_retries:
                    print(f"  Retrying in {request_delay * 2} seconds... (Retry {retries}/{max_retries})")
                    time.sleep(request_delay * 2)
                else:
                    print(f"  Maximum retries reached. Skipping Issue {issue_num} in Volume {volume_num}.")

print("\nDownload process completed for all specified volumes and issues!")

我尝试使用扩展选择器来捕获任何格式奇怪的类,但每个问题仍然只能返回 30 个 PDF 链接,而实际上还有更多。

python web-scraping beautifulsoup pypdf
1个回答
0
投票

用python请求页面并下载文章即可获取前30篇文章。

对于其余的,您需要以这种格式提出其他请求 https://www.mdpi.com/2072-4292/volume_number/issue_number/date/default/30/15 其中 30 是起点,15 是计数。 所以接下来的 url 需要是

https://www.mdpi.com/2072-4292/volume_number/issue_number/date/default/45/15 https://www.mdpi.com/2072-4292/volume_number/issue_number/date/default/60/15 https://www.mdpi.com/2072-4292/volume_number/issue_number/date/default/75/15

直到找到类为“UD_Listings_ArticlePDF”的空项目

© www.soinside.com 2019 - 2024. All rights reserved.