Python 网页抓取 - 从 SEC AAER 网站批量下载链接文件,403 禁止错误

问题描述 投票:0回答:1

我一直在尝试从 SEC 的 AAER 网站下载 300 个链接文件。大多数链接都是 pdf 格式,但有些是我需要保存为 pdf 而不仅仅是下载的网站。我正在自学一些 python 网页抓取,这似乎不是一项太难的任务,但下载时我一直无法克服 403 错误。

此代码可以很好地抓取文件的链接以及我想命名文件的 4 位代码:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import os
import requests

# Set up Chrome options to allow direct PDF download (for the download step)
download_path = "C:/Users/taylo/Downloads/sec_aaer_downloads"
chrome_options = Options()
chrome_options.add_experimental_option("prefs", {
    "download.default_directory": download_path,  # Specify your preferred download directory
    "download.prompt_for_download": False,  # Disable download prompt
    "plugins.always_open_pdf_externally": True,  # Automatically open PDF in browser
    "safebrowsing.enabled": False,  # Disable Chrome’s safe browsing check that can block downloads
    "profile.default_content_settings.popups": 0  # Disable popups
})

# Set up the webdriver with options
driver = webdriver.Chrome(executable_path="C:/chromedriver/chromedriver", options=chrome_options)

# URLs for pages 1, 2, and 3
urls = [
    "https://www.sec.gov/enforcement-litigation/accounting-auditing-enforcement-releases?page=0",
    "https://www.sec.gov/enforcement-litigation/accounting-auditing-enforcement-releases?page=1",
    "https://www.sec.gov/enforcement-litigation/accounting-auditing-enforcement-releases?page=2"
]

# Initialize an empty list to store the URLs and AAER numbers
pdf_data = []

# Loop through each URL (pages 1, 2, and 3)
for url in urls:
    print(f"Scraping URL: {url}...")
    driver.get(url)

    # Wait for the table rows containing links to be loaded
    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//*[@id="block-uswds-sec-content"]/div/div/div[3]/div/table/tbody/tr[1]')))
    
    # Extract the link and AAER number from each row on the current page
    rows = driver.find_elements(By.XPATH, '//*[@id="block-uswds-sec-content"]/div/div/div[3]/div/table/tbody/tr')
    for row in rows:
        try:
            # Extract the link from the first column (PDF link)
            link_element = row.find_element(By.XPATH, './/td[2]/div[1]/a')
            link_href = link_element.get_attribute('href')
            
            # Extract the AAER number from the second column
            aaer_text_element = row.find_element(By.XPATH, './/td[2]/div[2]/span[2]')
            aaer_text = aaer_text_element.text
            aaer_number = aaer_text.split("AAER-")[1].split()[0]  # Extract the number after AAER-

            # Store the data in a list of dictionaries
            pdf_data.append({'link': link_href, 'aaer_number': aaer_number})
        except Exception as e:
            print(f"Error extracting data from row: {e}")

# Print the scraped data (optional for verification)
for entry in pdf_data:
    print(f"Link: {entry['link']}, AAER Number: {entry['aaer_number']}")

但是当我尝试做这样的事情时,我无法完成下载:

import os
import time
import requests

# Set the download path
download_path = "C:/Users/taylo/Downloads/sec_aaer_downloads"
os.makedirs(download_path, exist_ok=True)

# Loop through each entry in the pdf_data list
for entry in pdf_data:
    try:
        # Extract the PDF link and AAER number
        link_href = entry['link']
        aaer_number = entry['aaer_number']

        # Send a GET request to download the PDF
        pdf_response = requests.get(link_href, stream=True, headers={
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
        })

        # Check if the request was successful
        if pdf_response.status_code == 200:
            # Save the PDF to the download folder, using the AAER number as the filename
            pdf_file_path = os.path.join(download_path, f"{aaer_number}.pdf")
            with open(pdf_file_path, "wb") as pdf_file:
                for chunk in pdf_response.iter_content(chunk_size=8192):
                    pdf_file.write(chunk)
            print(f"Downloaded: {aaer_number}.pdf")
        else:
            print(f"Failed to download the file from {link_href}, status code: {pdf_response.status_code}")
    
    except Exception as e:
        print(f"Error downloading the PDF for AAER {aaer_number}: {e}")

此时手动下载文件会更快,但我想知道我做错了什么。我尝试过设置用户代理标头并使用 Selenium 模拟用户点击。感谢您的任何建议!

python selenium-webdriver pdf web-scraping http-status-code-403
1个回答
0
投票

当您手动打开包含 PDF 的链接时,复制请求标头内的所有标头后,我能够下载文件:

enter image description here

您还需要删除

stream=true
中的
requests
参数。

这些回答了为什么会出现 Status Code 403 Forbidden,您需要所有标头才能访问 URL。

希望这有帮助!

最新问题
© www.soinside.com 2019 - 2025. All rights reserved.