从网页中抓取图像 URL 时遇到问题

问题描述 投票:0回答:1

我正在尝试使用 BeautifulSoup 和 Python 中的请求从网页中抓取图像 URL。我的目标网页是有关方块的 Minecraft wiki 页面。但是,我在第 35 行遇到“IndentationError:意外缩进”,该错误适用于 soup.find_all('img', class_='mw-file-element'): 中的 img_tag。我不确定是什么导致了这个问题。谁能帮我找出并解决问题?

from bs4 import BeautifulSoup
import requests
import re

# URL of the website containing the block names and images
url = 'https://minecraft.wiki/w/Block'

# Send a GET request to the URL
response = requests.get(url)

# Parse the HTML content of the page
soup = BeautifulSoup(response.text, 'html.parser')

# Find all elements containing block names and images
blocks = []
for li in soup.find_all('li'):
    name_element = li.find('a', class_='mw-redirect')
    
    if name_element:
        name = name_element.text.strip()
        url = 'https://minecraft.wiki/w/' + re.sub(r'\s+', '_', name)
    
        blocks.append({'name': name, 'url': url})

for block in blocks:
    print(block['name'])
    print(block['url'])

images = []
for block in blocks:
    block_url = block['url']
    response = requests.get(block_url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
        for i in soup.find_all('img'):
                image_element = i.find('img', class_='mw-file-element')
                if image_element:
                    srcset = image_element.get('srcset', '')
                    original_image_url = re.search(r'^(.+\.png)', srcset)

                    if original_image_url:
                        image_url = 'https://minecraft.wiki' + original_image_url.group(1)
                        images.append({'image:' : image_url})

for image in images:
     print(image['image'])
python screen-scraping
1个回答
0
投票

尝试这样的事情:

from bs4 import BeautifulSoup
import requests
import re
import time

url = 'https://minecraft.wiki/w/Block'

response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')

blocks = []
for li in soup.find_all('li'):
    name_element = li.find('a', class_='mw-redirect')
    
    if name_element:
        name = name_element.text.strip()
        url = 'https://minecraft.wiki/w/' + re.sub(r'\s+', '_', name)
    
        blocks.append({'name': name, 'url': url})

images = []

for block in blocks:
    response = requests.get(block['url'])
    soup = BeautifulSoup(response.text, 'html.parser')

    for img in soup.select("img.mw-file-element"):
        # Get src attribute from <img> tag.
        src = img.get("src")
        # Strip off any parameters.
        src = re.sub("\?.*$", "", src)
        url = "https://minecraft.wiki"+src
        print(url)
        images.append({'image:' : url})

    # Pause.
    time.sleep(10)

我已经纠正了第二个循环中的语法错误。我还查看了您如何提取图像 URL 并做出了一些建议的更改。我希望我正确理解您想要实现的目标。

我在第二个循环中添加了暂停,只是为了降低网站上的流量。

我建议您考虑使用另一种数据结构(例如集合)来存储图像,因为可能存在一些重复。

相同的输出如下所示:

https://minecraft.wiki/images/thumb/Oak_Button_%28S%29_JE4.png/150px-Oak_Button_%28S%29_JE4.png
https://minecraft.wiki/images/thumb/Spruce_Button_JE4_BE2.png/150px-Spruce_Button_JE4_BE2.png
https://minecraft.wiki/images/thumb/Birch_Button_JE3_BE2.png/150px-Birch_Button_JE3_BE2.png
https://minecraft.wiki/images/thumb/Jungle_Button_JE3_BE2.png/150px-Jungle_Button_JE3_BE2.png
https://minecraft.wiki/images/thumb/Acacia_Button_JE3_BE2.png/150px-Acacia_Button_JE3_BE2.png
https://minecraft.wiki/images/thumb/Dark_Oak_Button_JE3_BE2.png/150px-Dark_Oak_Button_JE3_BE2.png
https://minecraft.wiki/images/thumb/Mangrove_Button_JE1_BE1.png/150px-Mangrove_Button_JE1_BE1.png
https://minecraft.wiki/images/thumb/Cherry_Button_JE2.png/150px-Cherry_Button_JE2.png
https://minecraft.wiki/images/thumb/Bamboo_Button_JE3.png/150px-Bamboo_Button_JE3.png
https://minecraft.wiki/images/thumb/Crimson_Button_JE1_BE1.png/150px-Crimson_Button_JE1_BE1.png
最新问题
© www.soinside.com 2019 - 2025. All rights reserved.