带有请求的re.findall与复制和粘贴的html(由requests.text生成)不匹配

问题描述 投票:0回答:1

我试图从某个 url 的 html 代码中捕获一些元素。 当我将 html 内容直接复制并粘贴到我的 python 代码中时,效果很好。

import re

# Sample HTML content
html_content = """
<<<HTML Code>>>
"""

# Regex pattern
pattern = r'{"order":\d+,"url":"(https:[^"]+\.webp)"}'

# Find matches
matches = re.findall(pattern, html_content)

# Print matches
for match in matches:
    print(match)

^^效果很好。但是当我尝试直接使用 requests.get 执行相同操作时,它不起作用:

import re
import requests
url = "https://asuracomic.net/series/bloodhounds-regression-instinct-2d0edc16/chapter/59"
response = requests.get(url)
html_content = response.text

# Regex pattern
pattern = r'{"order":\d+,"url":"(https:[^"]+\.webp)"}'

# Find matches
matches = re.findall(pattern, html_content)

# Print matches
for match in matches:
    print(match)

请记住,我复制和粘贴的 html 实际上是使用 requests.get 生成的:

with open('raw_html.html', 'w', encoding='utf-8') as f:
    f.write(html_content)
python web-scraping python-requests python-re
1个回答
0
投票

我设法通过添加解决了问题:

no_bs = html_content.replace('\\"', '"')

这会删除手动复制和粘贴 html 代码时未复制的看似退格的内容。 最终代码如下所示:

import re
import requests
url = "https://asuracomic.net/series/bloodhounds-regression-instinct-2d0edc16/chapter/59"
response = requests.get(url)
html_content = response.text
no_bs = html_content.replace('\\"', '"')
# Regex pattern
pattern = r'{"order":\d+,"url":"(https:[^"]+\.webp)"}'

# Find matches
matches = re.findall(pattern, no_bs)

# Print matches
for match in matches:
    print(match)
© www.soinside.com 2019 - 2024. All rights reserved.