我正在尝试抓取此网站中的所有评论 - https://www.backmarket.com/en-us/r/l/airpods/345c3c05-8a7b-4d4d-ac21-518b12a0ec17。该网站说有 753 条评论,但当我尝试抓取所有评论时,我只得到 10 条评论。所以,我不知道如何从页面上抓取所有 753 条评论,这是我的代码-
# importing modules
import pandas as pd
from requests import get
from bs4 import BeautifulSoup
# Fetch the web page
url = 'https://www.backmarket.com/en-us/r/l/airpods/345c3c05-8a7b-4d4d-ac21-518b12a0ec17'
response = get(url) # link exlcudes posts with no picures
page = response.text
# Parse the HTML content
soup = BeautifulSoup(page, 'html.parser')
# To see different information
## reviewer's name
reviewers_name = soup.find_all('p', class_='body-1-bold')
[x.text for x in reviewers_name]
name = []
for items in reviewers_name:
name.append(items.text if items else None)
## Purchase Data
purchase_date = soup.find_all('p', class_='text-static-default-low body-2')
[x.text for x in purchase_date]
date = []
for items in purchase_date:
date.append(items.text if items else None)
## Country
country_text = soup.find_all('p', class_='text-static-default-low body-2 mt-32')
[x.text for x in country_text]
country = []
for items in country_text:
country.append(items.text if items else None)
## Reviewed Products
products_text = soup.find_all('span', class_= 'rounded-xs inline-block max-w-full truncate body-2-bold px-4 py-0 bg-static-default-mid text-static-default-hi')
[x.text for x in products_text]
products = []
for items in products_text:
products.append(items.text if items else None)
## Actual Reviews
review_text = soup.find_all('p',class_='body-1 block whitespace-pre-line')
[x.text for x in review_text]
review = []
for items in review_text:
review.append(items.text if items else None)
## Review Ratings
review_ratings_value = soup.find_all('span',class_='ml-4 mt-1 md:mt-2 body-2-bold')
[x.text for x in review_ratings_value]
review_ratings = []
for items in review_ratings_value:
review_ratings.append(items.text if items else None)
# Create the Data Frame
pd.DataFrame({
'reviewers_name': name,
'purchase_date': date,
'country': country,
'products': products,
'review': review,
'review_ratings': review_ratings
})
我的问题是如何抓取所有评论。
您可以尝试(注意:该网站有“请求过多”保护,因此当您收到 HTTP 状态代码 429 时,您必须等待一段时间才能继续):
import time
import requests
url = "https://www.backmarket.com/reviews/product-landings/345c3c05-8a7b-4d4d-ac21-518b12a0ec17/products/reviews"
n, current_url = 1, url
while True:
response = requests.get(current_url)
# too many requests?
if response.status_code == 429:
print("Too many requests...")
time.sleep(2)
continue
data = response.json()
for r in data.get("results", []):
print(n, "-" * 80)
print(r["comment"])
n += 1
next_cursor = data.get("nextCursor")
if not next_cursor:
break
current_url = f"{url}?cursor={next_cursor}"
打印:
...
748 --------------------------------------------------------------------------------
Ordered space gray but received silver
749 --------------------------------------------------------------------------------
Lately, every item we’ve ordered has had to be returned. I’m pretty bummed.
750 --------------------------------------------------------------------------------
They broke.
751 --------------------------------------------------------------------------------
They’re weren’t noise cancelling like it says in the description so it wasn’t what I wanted but other than that they are great.