无法返回图片url,抓取网站时只能获取data:image/gif;base64

问题描述 投票:0回答:1

我设置了一个简单的 python 脚本,用于从 H&M 男装部分抓取每篇帖子的名称和图片。返回的名称没有问题,但图像网址似乎只返回前几个,然后才采用以下格式:“”我已经尝试过使用 chromedriver 的请求和硒分别地。我错过了什么?

第一次尝试(请求):

import requests
from bs4 import BeautifulSoup

# URL of the H&M men's section
url = "https://www2.hm.com/en_us/men/products/view-all.html?page=1"

# Headers to mimic a browser visit
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "Referer": "https://www.google.com/",
    "Connection": "keep-alive"
}

# Send a GET request to the webpage
response = requests.get(url, headers=headers)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find all the product items
    items = soup.find_all('article', class_='f0cf84')

    # Iterate over the items and extract the name and image URL
    for item in items:
        # Extract the product name
        name = item.find('a', class_='db7c79')['title']
        
        # Extract the image URL (the 'src' attribute of the <img> tag)
        img_tag = item.find('img', imagetype='PRODUCT_IMAGE')
        img_url = img_tag['src'] if img_tag else 'No image'

        # Print the name and image URL
        print(f"Product Name: {name}")
        print(f"Image URL: {img_url}\n")
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")

第二次尝试(硒)

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome()

# URL of the H&M men's section
url = "https://www2.hm.com/en_us/men/products/view-all.html?page=1"

# Open the webpage
driver.get(url)

# Get the page source and parse it with BeautifulSoup
soup = BeautifulSoup(driver.page_source, 'html.parser')

# Find all the product items
items = soup.find_all('article', class_='f0cf84')

# Iterate over the items and extract the name and image URL
for item in items:
    # Extract the product name
    name = item.find('a', class_='db7c79')['title']

    # Extract the image URL (the 'src' attribute of the <img> tag)
    img_tag = item.find('img', imagetype='PRODUCT_IMAGE')
    img_url = img_tag['src'] if img_tag else 'No image'

    # Print the name and image URL
    print(f"Product Name: {name}")
    print(f"Image URL: {img_url}\n")

# Quit the WebDriver
driver.quit()

两次的响应都是相同的:

Product Name: Baggy Jeans
Image URL: https://image.hm.com/assets/hm/9e/53/9e53035efef96606bc4b50eaf6a0eee4f08a152c.jpg?imwidth=1536

Product Name: Regular Fit Cotton Shorts
Image URL: https://image.hm.com/assets/hm/8f/d8/8fd8d52f2e2c778041410f9a2727b448053ca8b7.jpg?imwidth=1536

Product Name: Regular Fit Linen-blend Shorts
Image URL: https://image.hm.com/assets/hm/d7/54/d7546a095c04387d1ad98575588c84e0426fb4be.jpg?imwidth=1536

Product Name: Muscle Fit Cotton Shirt
Image URL: https://image.hm.com/assets/hm/c7/d4/c7d49cef60f9d196d2f5347815f416bba7d4b636.jpg?imwidth=1536

Product Name: Slim Fit Ribbed Tank Top
Image URL: 

Product Name: Slim Fit Jacket
Image URL: 

Product Name: 5-pack Slim Fit T-shirts
Image URL: 

Product Name: Regular Fit Linen-blend Resort Shirt
Image URL: 

Product Name: Slim Fit Suit Pants
Image URL: 

Product Name: Regular Fit Cotton Shorts
Image URL: 

Product Name: Slim Fit Suit Pants
Image URL: 

Product Name: Slim Fit Polo Shirt
Image URL: 

Product Name: Baggy Jeans
Image URL: 

Product Name: Slim Fit Half-zip Polo Shirt
Image URL: 

Product Name: Slim Fit Linen Jacket
Image URL: 

Product Name: Loose Fit Cargo Jeans
Image URL: 

Product Name: Regular Fit Nylon Cargo Shorts
Image URL: 

Product Name: Loose Fit T-shirt
Image URL: 

Product Name: Loose Jeans
Image URL: 

Product Name: Swim Shorts
Image URL: 

Product Name: Regular Fit Chino Shorts
Image URL: 

Product Name: Muscle Fit Polo Shirt
Image URL: 

Product Name: Regular Fit Linen-blend Pants
Image URL: 

Product Name: 5-pack Short Cotton Boxer Shorts
Image URL: 

Product Name: Regular Fit Cropped Cotton Chinos
Image URL: 

Product Name: Regular Fit Linen-blend Shirt
Image URL: 

Product Name: Slim Fit Linen Suit Pants
Image URL: 

Product Name: Regular Fit Linen-blend Shorts
Image URL: 

Product Name: Swim Shorts
Image URL: 

Product Name: Patterned Swim Shorts
Image URL: 

Product Name: Patterned Swim Shorts
Image URL: 

Product Name: Regular Fit T-shirt
Image URL: 

Product Name: Regular Fit T-shirt
Image URL: 

Product Name: Regular Fit Sweatshorts
Image URL: 

Product Name: Regular Fit Cotton Shorts
Image URL: 

Product Name: Slim Fit T-shirt
Image URL: 
python selenium-webdriver web-scraping beautifulsoup selenium-chromedriver
1个回答
0
投票

它们是base64格式的嵌入图像数据,没有任何外部资源的URL。你可以直接转换base64并以原始格式保存

© www.soinside.com 2019 - 2024. All rights reserved.