使用firefox selenium抓取无限滚动的页面,导致错误,可能是由于数据太多

问题描述 投票:0回答:1

我正在尝试在聚会上使用无限滚动来抓取此页面以获取过去事件的列表。我想获取事件列表,包括名称、日期和 URL(大部分只是名称,其他 2 个是可选的)。

无论如何,如果我将滚动限制为 10 或 20 次,我的代码就可以工作,但如果我让它运行到最后,我会收到错误(见下文)。

我还在下面粘贴了完整的代码。

我已经和chatgpt合作好几天了,但运气不太好。看来错误是由于太多数据被输入到 selenium 中造成的。

我可以做些什么来完成这项工作吗?

提前致谢

错误消息(抱歉格式错误):

File "C:\Users\USER\Desktop\meetup.py", line 52, in <module>
page_source = driver.page_source
^^^^^^^^^^^^^^^^^^
File "C:\Users\USER\AppData\Local\Programs\Python\Python312\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 455, in page_source
return self.execute(Command.GET_PAGE_SOURCE)["value"]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\USER\AppData\Local\Programs\Python\Python312\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 354, in execute
self.error_handler.check_response(response)
File "C:\Users\USER\AppData\Local\Programs\Python\Python312\Lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 229, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.InvalidArgumentException: Message: unexpected end of hex escape at line 1 column 7937369

我的代码:

from selenium import webdriver
from selenium.webdriver.firefox.service import Service as FirefoxService
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time

# Path to your GeckoDriver
GECKODRIVER_PATH = 'C:\\Program Files\\GeckoDriver\\geckodriver.exe'

# Setup Firefox options
firefox_options = Options()
firefox_options.add_argument("--headless")  # Run in headless mode (no UI)
firefox_options.set_preference("general.useragent.override", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3")
firefox_options.set_preference('permissions.default.stylesheet', 2)  # Disable CSS
firefox_options.set_preference("permissions.default.image", 2)  # Disable images

# Initialize the WebDriver
service = FirefoxService(executable_path=GECKODRIVER_PATH)
driver = webdriver.Firefox(service=service, options=firefox_options)

# Load the page
url = 'https://www.meetup.com/meetup-group-philosophy101/events/?type=past'
driver.get(url)

# Wait for the page to load and start infinite scrolling
wait = WebDriverWait(driver, 1)

# Function to scroll down
def scroll_page(driver, wait, pause_time=1):
    last_height = driver.execute_script("return document.body.scrollHeight")
    j = 0
    while j < 5:
        # Scroll down to the bottom of the page
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight-1200);")
        time.sleep(pause_time)

        # Check if new content has been loaded
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            j += 1
            time.sleep(3)
        else:
            j = 0
            last_height = new_height

# Scroll to the bottom to load all events
scroll_page(driver, wait)
print("End of infinite scroll")

# Save HTML file locally
page_source = driver.page_source # the error starts here, BUT even if I don't save html file locally and skip to the next section, I still get an error with "driver.page_source"
html_file_path = 'C:\\meetup.html'
with open(html_file_path, 'w', encoding='utf-8') as file:
    file.write(page_source)

# Parse the page source with BeautifulSoup lxml
soup = BeautifulSoup(driver.page_source, 'lxml')

# Debugging: Check if the page source was retrieved
print("Page source retrieved.")

# Extract event details
events = []
event_cards = soup.find_all('div', class_='rounded-md bg-white p-4 shadow-sm sm:p-5')

# Debugging: Check if event cards were found
print(f"Found {len(event_cards)} event cards.")

for card in event_cards:
    title = card.find('span').get_text(strip=True) \
        if card.find('span') else 'Title not found'
    date = card.find('time').get_text(strip=True) if card.find('time') else 'Date not found'
    link = card.find('a')
    eventurl = link['href']
    events.append({'title': title, 'date': date, 'eventurl': eventurl})

# Print or save the events
file_path = 'C:\\meetup.txt'

if events:
    with open(file_path, 'w', encoding='utf-8') as file:
        for event in events:
            # Format the string
            formatted_text = f"Title: {event['title']}, Date: {event['date']}, URL: {event['eventurl']}\n"
            # Write the formatted text to the file
            file.write(formatted_text)
    print("write complete")
else:
    print("No events found.")

# Close the WebDriver
driver.quit()
python selenium-webdriver beautifulsoup
1个回答
0
投票

您可以尝试在

page_source = driver.page_source
之后添加“睡眠”吗,我认为这是问题的原因。
sleep
或类似以下内容:

WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, "body")))

© www.soinside.com 2019 - 2024. All rights reserved.