我设置了一个简单的 python 脚本,用于从 H&M 男装部分抓取每篇帖子的名称和图片。返回的名称没有问题,但图像网址似乎只返回前几个,然后才采用以下格式:“data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAABAAEAAAIBRAA7”我已经尝试过使用 chromedriver 的请求和硒分别地。我错过了什么?
第一次尝试(请求):
import requests
from bs4 import BeautifulSoup
# URL of the H&M men's section
url = "https://www2.hm.com/en_us/men/products/view-all.html?page=1"
# Headers to mimic a browser visit
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Referer": "https://www.google.com/",
"Connection": "keep-alive"
}
# Send a GET request to the webpage
response = requests.get(url, headers=headers)
# Check if the request was successful
if response.status_code == 200:
# Parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')
# Find all the product items
items = soup.find_all('article', class_='f0cf84')
# Iterate over the items and extract the name and image URL
for item in items:
# Extract the product name
name = item.find('a', class_='db7c79')['title']
# Extract the image URL (the 'src' attribute of the <img> tag)
img_tag = item.find('img', imagetype='PRODUCT_IMAGE')
img_url = img_tag['src'] if img_tag else 'No image'
# Print the name and image URL
print(f"Product Name: {name}")
print(f"Image URL: {img_url}\n")
else:
print(f"Failed to retrieve the page. Status code: {response.status_code}")
第二次尝试(硒)
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
# URL of the H&M men's section
url = "https://www2.hm.com/en_us/men/products/view-all.html?page=1"
# Open the webpage
driver.get(url)
# Get the page source and parse it with BeautifulSoup
soup = BeautifulSoup(driver.page_source, 'html.parser')
# Find all the product items
items = soup.find_all('article', class_='f0cf84')
# Iterate over the items and extract the name and image URL
for item in items:
# Extract the product name
name = item.find('a', class_='db7c79')['title']
# Extract the image URL (the 'src' attribute of the <img> tag)
img_tag = item.find('img', imagetype='PRODUCT_IMAGE')
img_url = img_tag['src'] if img_tag else 'No image'
# Print the name and image URL
print(f"Product Name: {name}")
print(f"Image URL: {img_url}\n")
# Quit the WebDriver
driver.quit()
两次的响应都是相同的:
Product Name: Baggy Jeans
Image URL: https://image.hm.com/assets/hm/9e/53/9e53035efef96606bc4b50eaf6a0eee4f08a152c.jpg?imwidth=1536
Product Name: Regular Fit Cotton Shorts
Image URL: https://image.hm.com/assets/hm/8f/d8/8fd8d52f2e2c778041410f9a2727b448053ca8b7.jpg?imwidth=1536
Product Name: Regular Fit Linen-blend Shorts
Image URL: https://image.hm.com/assets/hm/d7/54/d7546a095c04387d1ad98575588c84e0426fb4be.jpg?imwidth=1536
Product Name: Muscle Fit Cotton Shirt
Image URL: https://image.hm.com/assets/hm/c7/d4/c7d49cef60f9d196d2f5347815f416bba7d4b636.jpg?imwidth=1536
Product Name: Slim Fit Ribbed Tank Top
Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
Product Name: Slim Fit Jacket
Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
Product Name: 5-pack Slim Fit T-shirts
Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
Product Name: Regular Fit Linen-blend Resort Shirt
Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
Product Name: Slim Fit Suit Pants
Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
Product Name: Regular Fit Cotton Shorts
Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
Product Name: Slim Fit Suit Pants
Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
Product Name: Slim Fit Polo Shirt
Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
Product Name: Baggy Jeans
Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
Product Name: Slim Fit Half-zip Polo Shirt
Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
Product Name: Slim Fit Linen Jacket
Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
Product Name: Loose Fit Cargo Jeans
Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
Product Name: Regular Fit Nylon Cargo Shorts
Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
Product Name: Loose Fit T-shirt
Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
Product Name: Loose Jeans
Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
Product Name: Swim Shorts
Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
Product Name: Regular Fit Chino Shorts
Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
Product Name: Muscle Fit Polo Shirt
Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
Product Name: Regular Fit Linen-blend Pants
Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
Product Name: 5-pack Short Cotton Boxer Shorts
Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
Product Name: Regular Fit Cropped Cotton Chinos
Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
Product Name: Regular Fit Linen-blend Shirt
Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
Product Name: Slim Fit Linen Suit Pants
Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
Product Name: Regular Fit Linen-blend Shorts
Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
Product Name: Swim Shorts
Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
Product Name: Patterned Swim Shorts
Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
Product Name: Patterned Swim Shorts
Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
Product Name: Regular Fit T-shirt
Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
Product Name: Regular Fit T-shirt
Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
Product Name: Regular Fit Sweatshorts
Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
Product Name: Regular Fit Cotton Shorts
Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
Product Name: Slim Fit T-shirt
Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
它们是base64格式的嵌入图像数据,没有任何外部资源的URL。你可以直接转换base64并以原始格式保存