我正在尝试构建一个数据集,以从 pawrade.com 中提取有关所有列出的“法国斗牛犬”的信息。
我已经启动了一个抓取脚本,但我似乎无法让“状态”列正常工作。 如果您进入已售出的小狗列表,您会看到“这只小狗不再可用”。信息。我正在使用该消息来确定“状态”列中的小狗是否已售出。
import requests
from bs4 import BeautifulSoup
import csv
# Function to scrape a single page
def scrape_page(url, csv_writer):
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
spans = soup.find_all("span", class_="fave position-absolute")
for span in spans:
adid = span.get("data-adid")
name = span.get("data-name")
breed = span.get("data-breed")
price = span.get("data-price")
puppy_url = f"https://www.pawrade.com/puppies/french-bulldog/{adid}"
# Extract additional details from the puppy's individual page
puppy_response = requests.get(puppy_url)
puppy_soup = BeautifulSoup(puppy_response.content, "html.parser")
dob = extract_detail(puppy_soup, "fa-birthday-cake")
weight = extract_detail(puppy_soup, "fa-weight")
registration = extract_detail(puppy_soup, "fa-trophy")
color = extract_detail(puppy_soup, "fa-palette")
release_date = extract_detail(puppy_soup, "fa-calendar-alt")
microchip = extract_detail(puppy_soup, "fa-microchip")
# Determine status based on the presence of specific messages
status = "available"
if puppy_soup.find("h4", class_="mb-0") and "This puppy is no longer available." in puppy_soup.find("h4", class_="mb-0").text:
status = "sold"
csv_writer.writerow([adid, name, breed, price, puppy_url, dob, weight, registration, color, release_date, microchip, status])
# Helper function to extract details based on class
def extract_detail(soup, class_name):
detail_element = soup.find("span", class_=class_name)
if detail_element:
detail_text = detail_element.find_next_sibling("div").find("small").text.strip()
return detail_text
return ""
# URL of the first page of listings
base_url = "https://www.pawrade.com/puppies/french-bulldog/"
# Create a CSV file
with open("french_bulldogs.csv", mode="w", newline="", encoding="utf-8") as csv_file:
csv_writer = csv.writer(csv_file)
csv_writer.writerow(["AdID", "Name", "Breed", "Price", "URL", "DOB", "Weight", "Registration", "Color", "Release Date", "Microchip", "Status"])
# Scrape the first page to get the total number of pages
response = requests.get(base_url)
soup = BeautifulSoup(response.content, "html.parser")
# Example logic to determine total pages - this may need adjustment based on actual site structure
pagination = soup.find("ul", class_="pagination")
if pagination:
pages = pagination.find_all("li", class_="page-item")
total_pages = int(pages[-2].text.strip()) # Adjust the logic to find the total number of pages
else:
total_pages = 1 # Default to 1 if no pagination found
# Iterate through all pages
for page in range(1, total_pages + 1):
url = f"{base_url}?page={page}"
scrape_page(url, csv_writer)
print("Data has been written to french_bulldogs.csv")
您的代码有效。只要看看你运行它的目的是什么。你根本不能拥有小狗
sold out
,因为你只为现在可用的小狗运行它。您对 base_url
进行查询,然后找到具有该类 span
的所有选择器 fave position-absolute
,然后执行 for
循环:
spans = soup.find_all("span", class_="fave position-absolute")
for span in spans:
...
问题是,如果我们仔细观察,就会发现这里只有
AVAILABLE
小狗:
你可以通过稍微修改一下
scrape_page
函数来测试你的代码,如下所示:
def scrape_page(url):
response = requests.get(url)
puppy_soup = BeautifulSoup(response.content, "html.parser")
status = "available"
is_sold_out = puppy_soup.find("h4", class_="mb-0")
if is_sold_out and "This puppy is no longer available." in is_sold_out.text:
status = "sold"
print(status)
scrape_page('https://www.pawrade.com/puppies/australian-shepherd/75daa83431/') # sold
scrape_page('https://www.pawrade.com/puppies/french-bulldog/69d02849b1/') # available
注意,我没有更改查找
status
的逻辑,只是将选择器放在单独的变量中以提高可读性。