我试图从某个 url 的 html 代码中捕获一些元素。 当我将 html 内容直接复制并粘贴到我的 python 代码中时,效果很好。
import re
# Sample HTML content
html_content = """
<<<HTML Code>>>
"""
# Regex pattern
pattern = r'{"order":\d+,"url":"(https:[^"]+\.webp)"}'
# Find matches
matches = re.findall(pattern, html_content)
# Print matches
for match in matches:
print(match)
^^效果很好。但是当我尝试直接使用 requests.get 执行相同操作时,它不起作用:
import re
import requests
url = "https://asuracomic.net/series/bloodhounds-regression-instinct-2d0edc16/chapter/59"
response = requests.get(url)
html_content = response.text
# Regex pattern
pattern = r'{"order":\d+,"url":"(https:[^"]+\.webp)"}'
# Find matches
matches = re.findall(pattern, html_content)
# Print matches
for match in matches:
print(match)
请记住,我复制和粘贴的 html 实际上是使用 requests.get 生成的:
with open('raw_html.html', 'w', encoding='utf-8') as f:
f.write(html_content)
我设法通过添加解决了问题:
no_bs = html_content.replace('\\"', '"')
这会删除手动复制和粘贴 html 代码时未复制的看似退格的内容。 最终代码如下所示:
import re
import requests
url = "https://asuracomic.net/series/bloodhounds-regression-instinct-2d0edc16/chapter/59"
response = requests.get(url)
html_content = response.text
no_bs = html_content.replace('\\"', '"')
# Regex pattern
pattern = r'{"order":\d+,"url":"(https:[^"]+\.webp)"}'
# Find matches
matches = re.findall(pattern, no_bs)
# Print matches
for match in matches:
print(match)