我正在尝试使用 Python 3.10 进行网页抓取,并且库请求 - HTML 0.10.0。
附上代码:
from requests_html import HTMLSession
url = 'https://bodysolid-europe.com/collections/all'
s = HTMLSession()
r = s.get(url)
r.html.render(sleep=1)
products = r.html.xpath('/html/body/div[2]/div[2]/div', first=True)
for item in products.absolute_links:
r = s.get(item)
print(r.html.find('header.product-header', first=True).text)
当我尝试通过 Xpath 从 URL 中提取信息时,控制台中显示以下输出:
[D:urllib3.connectionpool] Starting new HTTPS connection (%d): %s:%s
[D:urllib3.connectionpool] %s://%s:%s "%s %s %s" %s %s
[D:asyncio] Using proactor: %s
[D:websockets.client] = connection is CONNECTING
[D:websockets.client] > GET %s HTTP/1.1
[D:websockets.client] > %s: %s
[D:websockets.client] > %s: %s
[D:websockets.client] > %s: %s
[D:websockets.client] > %s: %s
[D:websockets.client] > %s: %s
[D:websockets.client] > %s: %s
[D:websockets.client] > %s: %s
[D:websockets.client] < HTTP/1.1 %d %s
它不会显示项目的所有信息,仅显示一点点,如下所示:
[D:urllib3.connectionpool] %s://%s:%s "%s %s %s" %s %s
[D:urllib3.connectionpool] %s://%s:%s "%s %s %s" %s %s
Body-Solid Europe
Best Fitness Dumbbell Rack BFDR10
[D:urllib3.connectionpool] %s://%s:%s "%s %s %s" %s %s
[D:urllib3.connectionpool] %s://%s:%s "%s %s %s" %s %s
Best Fitness
Best Fitness Bench BFFID10
[D:urllib3.connectionpool] %s://%s:%s "%s %s %s" %s %s
[D:urllib3.connectionpool] %s://%s:%s "%s %s %s" %s %s
Best Fitness
Best Fitness Mountain Climber BFMC10
[D:urllib3.connectionpool] %s://%s:%s "%s %s %s" %s %s
[D:urllib3.connectionpool] %s://%s:%s "%s %s %s" %s %s
Body-Solid Europe
Best Fitness Multi-Station Gym BFMG30
[D:urllib3.connectionpool] %s://%s:%s "%s %s %s" %s %s
[D:urllib3.connectionpool] %s://%s:%s "%s %s %s" %s %s
Best Fitness
Best Fitness Center Drive Elliptical BFE1
[D:urllib3.connectionpool] %s://%s:%s "%s %s %s" %s %s
[D:urllib3.connectionpool] %s://%s:%s "%s %s %s" %s %s
Best Fitness
Best Fitness Olympic Bench BFOB10
[D:urllib3.connectionpool] %s://%s:%s "%s %s %s" %s %s
[D:urllib3.connectionpool] %s://%s:%s "%s %s %s" %s %s
Best Fitness
Best Fitness Functional Trainer BFFT10
[D:urllib3.connectionpool] %s://%s:%s "%s %s %s" %s %s
[D:urllib3.connectionpool] %s://%s:%s "%s %s %s" %s %s
Best Fitness
Best Fitness Leg Developer and Preacher Curl Attachment BFPL10
[D:urllib3.connectionpool] %s://%s:%s "%s %s %s" %s %s
[D:urllib3.connectionpool] %s://%s:%s "%s %s %s" %s %s
Best Fitness
Best Fitness Inversion Table BFINVER10
[D:urllib3.connectionpool] %s://%s:%s "%s %s %s" %s %s
[D:urllib3.connectionpool] %s://%s:%s "%s %s %s" %s %s
Body-Solid Europe
大部分输出只有:
D:websockets.client] < %s
[D:websockets.client] < %s
[D:websockets.client] < %s
[D:websockets.client] < %s
[D:websockets.client] < %s
[D:websockets.client] < %s
[D:websockets.client] < %s
[D:websockets.client] < %s
[D:websockets.client] < %s
[D:websockets.client] < %s
[D:websockets.client] < %s
[D:websockets.client] < %s
[D:websockets.client] < %s
[D:websockets.client] < %s
[D:websockets.client] < %s
[D:websockets.client] < %s
[D:websockets.client] < %s
[D:websockets.client] < %s
[D:websockets.client] < %s
[D:websockets.client] < %s
[D:websockets.client] < %s
我不知道问题是什么。我已经安装了 pyppeteer==1.0.0,因为之前我有这个:
<?xml version='1.0' encoding='UTF-8'?><Error><Code>NoSuchKey</Code><Message>. The specified key does not exist.</Message><Details> No such object: chromium-browser-snapshots/Win_x64/1181205/chrome-win.zip</Details></Error>
但现在它显示“[D:websockets.client] < %s [D:websockets.client] < %s"
我需要修复输出中的错误,以便通过网络抓取从 URL 获取信息。
我也遇到了同样的问题,然后我将 pyppeteer 的版本更改为 1.0.2,它不再显示那些烦人的事情。 我还添加了这两行:
PYPPETEER_CHROMIUM_REVISION = "1263111"
os.environ["PYPPETEER_CHROMIUM_REVISION"] = PYPPETEER_CHROMIUM_REVISION