谁能帮我根据列表中的“这只小狗不再可用”信息来提取小狗是否被“出售”?

问题描述 投票:0回答:1

我正在尝试构建一个数据集,以从 pawrade.com 中提取有关所有列出的“法国斗牛犬”的信息。

我已经启动了一个抓取脚本,但我似乎无法让“状态”列正常工作。 如果您进入已售出的小狗列表,您会看到“这只小狗不再可用”。信息。我正在使用该消息来确定“状态”列中的小狗是否已售出。

import requests
from bs4 import BeautifulSoup
import csv

# Function to scrape a single page
def scrape_page(url, csv_writer):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    spans = soup.find_all("span", class_="fave position-absolute")
    
    for span in spans:
        adid = span.get("data-adid")
        name = span.get("data-name")
        breed = span.get("data-breed")
        price = span.get("data-price")
        puppy_url = f"https://www.pawrade.com/puppies/french-bulldog/{adid}"

        # Extract additional details from the puppy's individual page
        puppy_response = requests.get(puppy_url)
        puppy_soup = BeautifulSoup(puppy_response.content, "html.parser")

        dob = extract_detail(puppy_soup, "fa-birthday-cake")
        weight = extract_detail(puppy_soup, "fa-weight")
        registration = extract_detail(puppy_soup, "fa-trophy")
        color = extract_detail(puppy_soup, "fa-palette")
        release_date = extract_detail(puppy_soup, "fa-calendar-alt")
        microchip = extract_detail(puppy_soup, "fa-microchip")

        # Determine status based on the presence of specific messages
        status = "available"
        if puppy_soup.find("h4", class_="mb-0") and "This puppy is no longer available." in puppy_soup.find("h4", class_="mb-0").text:
            status = "sold"

        csv_writer.writerow([adid, name, breed, price, puppy_url, dob, weight, registration, color, release_date, microchip, status])

# Helper function to extract details based on class
def extract_detail(soup, class_name):
    detail_element = soup.find("span", class_=class_name)
    if detail_element:
        detail_text = detail_element.find_next_sibling("div").find("small").text.strip()
        return detail_text
    return ""

# URL of the first page of listings
base_url = "https://www.pawrade.com/puppies/french-bulldog/"

# Create a CSV file
with open("french_bulldogs.csv", mode="w", newline="", encoding="utf-8") as csv_file:
    csv_writer = csv.writer(csv_file)
    csv_writer.writerow(["AdID", "Name", "Breed", "Price", "URL", "DOB", "Weight", "Registration", "Color", "Release Date", "Microchip", "Status"])

    # Scrape the first page to get the total number of pages
    response = requests.get(base_url)
    soup = BeautifulSoup(response.content, "html.parser")

    # Example logic to determine total pages - this may need adjustment based on actual site structure
    pagination = soup.find("ul", class_="pagination")
    if pagination:
        pages = pagination.find_all("li", class_="page-item")
        total_pages = int(pages[-2].text.strip())  # Adjust the logic to find the total number of pages
    else:
        total_pages = 1  # Default to 1 if no pagination found

    # Iterate through all pages
    for page in range(1, total_pages + 1):
        url = f"{base_url}?page={page}"
        scrape_page(url, csv_writer)

print("Data has been written to french_bulldogs.csv")
python web-scraping beautifulsoup
1个回答
0
投票

您的代码有效。只要看看你运行它的目的是什么。你根本不能拥有小狗

sold out
,因为你只为现在可用的小狗运行它。您对
base_url
进行查询,然后找到具有该类
span
的所有选择器
fave position-absolute
,然后执行
for
循环:

spans = soup.find_all("span", class_="fave position-absolute")
for span in spans:
    ...

问题是,如果我们仔细观察,就会发现这里只有

AVAILABLE
小狗:

screen

你可以通过稍微修改一下

scrape_page
函数来测试你的代码,如下所示:

def scrape_page(url):
    response = requests.get(url)
    puppy_soup = BeautifulSoup(response.content, "html.parser")
    status = "available"
    is_sold_out = puppy_soup.find("h4", class_="mb-0")
    if is_sold_out and "This puppy is no longer available." in is_sold_out.text:
        status = "sold"
    print(status)


scrape_page('https://www.pawrade.com/puppies/australian-shepherd/75daa83431/')  # sold
scrape_page('https://www.pawrade.com/puppies/french-bulldog/69d02849b1/')  # available

注意,我没有更改查找

status
的逻辑,只是将选择器放在单独的变量中以提高可读性。

最新问题
© www.soinside.com 2019 - 2024. All rights reserved.