我正在尝试使用 BeautifulSoup 和 Python 中的请求从网页中抓取图像 URL。我的目标网页是有关方块的 Minecraft wiki 页面。但是,我在第 35 行遇到“IndentationError:意外缩进”,该错误适用于 soup.find_all('img', class_='mw-file-element'): 中的 img_tag。我不确定是什么导致了这个问题。谁能帮我找出并解决问题?
from bs4 import BeautifulSoup
import requests
import re
# URL of the website containing the block names and images
url = 'https://minecraft.wiki/w/Block'
# Send a GET request to the URL
response = requests.get(url)
# Parse the HTML content of the page
soup = BeautifulSoup(response.text, 'html.parser')
# Find all elements containing block names and images
blocks = []
for li in soup.find_all('li'):
name_element = li.find('a', class_='mw-redirect')
if name_element:
name = name_element.text.strip()
url = 'https://minecraft.wiki/w/' + re.sub(r'\s+', '_', name)
blocks.append({'name': name, 'url': url})
for block in blocks:
print(block['name'])
print(block['url'])
images = []
for block in blocks:
block_url = block['url']
response = requests.get(block_url)
soup = BeautifulSoup(response.text, 'html.parser')
for i in soup.find_all('img'):
image_element = i.find('img', class_='mw-file-element')
if image_element:
srcset = image_element.get('srcset', '')
original_image_url = re.search(r'^(.+\.png)', srcset)
if original_image_url:
image_url = 'https://minecraft.wiki' + original_image_url.group(1)
images.append({'image:' : image_url})
for image in images:
print(image['image'])
尝试这样的事情:
from bs4 import BeautifulSoup
import requests
import re
import time
url = 'https://minecraft.wiki/w/Block'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
blocks = []
for li in soup.find_all('li'):
name_element = li.find('a', class_='mw-redirect')
if name_element:
name = name_element.text.strip()
url = 'https://minecraft.wiki/w/' + re.sub(r'\s+', '_', name)
blocks.append({'name': name, 'url': url})
images = []
for block in blocks:
response = requests.get(block['url'])
soup = BeautifulSoup(response.text, 'html.parser')
for img in soup.select("img.mw-file-element"):
# Get src attribute from <img> tag.
src = img.get("src")
# Strip off any parameters.
src = re.sub("\?.*$", "", src)
url = "https://minecraft.wiki"+src
print(url)
images.append({'image:' : url})
# Pause.
time.sleep(10)
我已经纠正了第二个循环中的语法错误。我还查看了您如何提取图像 URL 并做出了一些建议的更改。我希望我正确理解您想要实现的目标。
我在第二个循环中添加了暂停,只是为了降低网站上的流量。
我建议您考虑使用另一种数据结构(例如集合)来存储图像,因为可能存在一些重复。
相同的输出如下所示:
https://minecraft.wiki/images/thumb/Oak_Button_%28S%29_JE4.png/150px-Oak_Button_%28S%29_JE4.png
https://minecraft.wiki/images/thumb/Spruce_Button_JE4_BE2.png/150px-Spruce_Button_JE4_BE2.png
https://minecraft.wiki/images/thumb/Birch_Button_JE3_BE2.png/150px-Birch_Button_JE3_BE2.png
https://minecraft.wiki/images/thumb/Jungle_Button_JE3_BE2.png/150px-Jungle_Button_JE3_BE2.png
https://minecraft.wiki/images/thumb/Acacia_Button_JE3_BE2.png/150px-Acacia_Button_JE3_BE2.png
https://minecraft.wiki/images/thumb/Dark_Oak_Button_JE3_BE2.png/150px-Dark_Oak_Button_JE3_BE2.png
https://minecraft.wiki/images/thumb/Mangrove_Button_JE1_BE1.png/150px-Mangrove_Button_JE1_BE1.png
https://minecraft.wiki/images/thumb/Cherry_Button_JE2.png/150px-Cherry_Button_JE2.png
https://minecraft.wiki/images/thumb/Bamboo_Button_JE3.png/150px-Bamboo_Button_JE3.png
https://minecraft.wiki/images/thumb/Crimson_Button_JE1_BE1.png/150px-Crimson_Button_JE1_BE1.png