如何使用 Python 从网站上抓取活动链接和联系信息?

问题描述 投票:0回答:1

我正在尝试使用 Python、requests、Pandas 和 BeautifulSoup 从 RaceRoster 网站 (https://raceroster.com/search?q=5k&t=upcoming) 抓取活动链接和联系信息。目标是提取每个事件的事件名称、事件 URL、联系人姓名和电子邮件地址,并将数据保存到 Excel 文件中,以便我们可以出于业务开发目的联系这些事件。

但是,脚本始终报告在搜索结果页面上找不到事件链接,尽管在浏览器中检查 HTML 时链接可见。以下是搜索结果页面中活动链接的相关 HTML:

<a href="https://raceroster.com/events/2025/98542/13th-annual-delaware-tech-chocolate-run-5k" 
   target="_blank" 
   rel="noopener noreferrer" 
   class="search-results__card-event-name">
    13th Annual Delaware Tech Chocolate Run 5k
</a>

采取的步骤:

  1. 验证了事件链接的正确选择器:
soup.select("a.search-results__card-event-name")
  1. 使用 soup.prettify() 检查 requests.get() 调用的响应内容。 HTML 似乎缺少浏览器中可见的事件链接,这表明内容可能是通过 JavaScript 动态加载的。

  2. 尝试使用 BeautifulSoup 抓取数据,但始终得到:

Found 0 events on the page.
Scraped 0 events.
No contacts were scraped.

我需要什么帮助:

  • 我如何处理这个 JavaScript 加载的内容?有没有办法直接抓取,还是需要使用像Selenium这样的工具?
  • 如果需要 Selenium,如何正确地将其与 BeautifulSoup 集成以解析渲染的 HTML?

当前脚本:

import requests
from bs4 import BeautifulSoup
import pandas as pd

def scrape_event_contacts(base_url, search_url):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    }
    event_contacts = []

    # Fetch the main search page
    print(f"Scraping page: {search_url}")
    response = requests.get(search_url, headers=headers)

    if response.status_code != 200:
        print(f"Failed to fetch page: {search_url}, status code: {response.status_code}")
        return event_contacts

    soup = BeautifulSoup(response.content, "html.parser")
    # Select event links
    event_links = soup.select("a.search-results__card-event-name")


    print(f"Found {len(event_links)} events on the page.")

    for link in event_links:
        event_url = link['href']
        event_name = link.text.strip()  # Extract Event Name

        try:
            print(f"Scraping event: {event_url}")
            event_response = requests.get(event_url, headers=headers)
            if event_response.status_code != 200:
                print(f"Failed to fetch event page: {event_url}, status code: {event_response.status_code}")
                continue

            event_soup = BeautifulSoup(event_response.content, "html.parser")

            # Extract contact name and email
            contact_name = event_soup.find("dd", class_="event-details__contact-list-definition")
            email = event_soup.find("a", href=lambda href: href and "mailto:" in href)

            contact_name_text = contact_name.text.strip() if contact_name else "N/A"
            email_address = email['href'].split("mailto:")[1].split("?")[0] if email else "N/A"

            if contact_name or email:
                print(f"Found contact: {contact_name_text}, email: {email_address}")
                event_contacts.append({
                    "Event Name": event_name,
                    "Event URL": event_url,
                    "Event Contact": contact_name_text,
                    "Email": email_address
                })
            else:
                print(f"No contact information found for {event_url}")
        except Exception as e:
            print(f"Error scraping event {event_url}: {e}")

    print(f"Scraped {len(event_contacts)} events.")
    return event_contacts

def save_to_spreadsheet(data, output_file):
    if not data:
        print("No data to save.")
        return
    df = pd.DataFrame(data)
    df.to_excel(output_file, index=False)
    print(f"Data saved to {output_file}")

if __name__ == "__main__":
    base_url = "https://raceroster.com"
    search_url = "https://raceroster.com/search?q=5k&t=upcoming"
    output_file = "/Users/my_name/Documents/event_contacts.xlsx"

    contact_data = scrape_event_contacts(base_url, search_url)
    if contact_data:
        save_to_spreadsheet(contact_data, output_file)
    else:
        print("No contacts were scraped.")

预期结果:

  • 从搜索结果页面提取所有活动链接。
  • 导航至每个活动的详细信息页面。
  • 从详细信息页面中抓取联系人姓名 () 和电子邮件 ()。
  • 将结果保存到 Excel 文件。
python excel pandas web-scraping beautifulsoup
1个回答
1
投票

使用 API 端点获取即将发生的事件的数据。

具体方法如下:

import requests
from tabulate import tabulate

url = 'https://search.raceroster.com/search?q=5k&t=upcoming'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
}
events = requests.get(url,headers=headers).json()['data']

table = [
    [event["name"], event["url"]] for event in events
]

print(tabulate(table, headers=["Name", "URL"]))

这应该打印:

Name                                         URL
-------------------------------------------  ------------------------------------------------------------------------------------------
Credit Union Cherry Blossom                  https://raceroster.com/events/2025/72646/credit-union-cherry-blossom
Big Cork Wine Run 5k                         https://raceroster.com/events/2025/98998/big-cork-wine-run-5k
3rd Annual #OptOutside Black Friday Fun Run  https://raceroster.com/events/2025/98146/3rd-annual-number-optoutside-black-friday-fun-run
Ryan's Race 5K walk Run                      https://raceroster.com/events/2025/97852/ryans-race-5k-walk-run
13th Annual Delaware  Tech Chocolate Run 5k  https://raceroster.com/events/2025/98542/13th-annual-delaware-tech-chocolate-run-5k
Builders Dash 5k                             https://raceroster.com/events/2025/99146/builders-dash-5k
The Ivy Scholarship 5k                       https://raceroster.com/events/2025/96874/the-ivy-scholarship-5k
39th Firecracker 5k Run Walk                 https://raceroster.com/events/2025/96907/39th-firecracker-5k-run-walk
24th Annual John D Kelly Logan House 5k      https://raceroster.com/events/2025/97364/24th-annual-john-d-kelly-logan-house-5k
2nd Annual Scott Trot 5K                     https://raceroster.com/events/2025/96904/2nd-annual-scott-trot-5k
© www.soinside.com 2019 - 2024. All rights reserved.