我目前正在使用 seleniumbase 中间件运行 scrapy 蜘蛛,由于某种原因它正在抓取
chrome-extension
URL。我正在抓取 https://www.atptour.com
网站,并且我的抓取工具除了该网站的页面之外不会请求任何其他内容。
我已在下面附上发生的事情的日志:
2024-10-21 17:43:47: [INFO] Spider opened
2024-10-21 17:43:47: [INFO] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-10-21 17:43:47: [INFO] Telnet console listening on 127.0.0.1:6027
2024-10-21 17:43:50: [DEBUG] Started executable: `/Users/philipjoss/miniconda3/envs/capra_production/lib/python3.11/site-packages/seleniumbase/drivers/uc_driver` in a child process with pid: 22177 using 0 to output -1
2024-10-21 17:43:51: [DEBUG] Crawled (200) <GET https://www.atptour.com/en/-/tournaments/calendar/tour> (referer: None)
2024-10-21 17:43:54: [DEBUG] Started executable: `/Users/philipjoss/miniconda3/envs/capra_production/lib/python3.11/site-packages/seleniumbase/drivers/uc_driver` in a child process with pid: 22180 using 0 to output -1
2024-10-21 17:43:55: [DEBUG] Crawled (200) <GET https://www.atptour.com/en/-/tournaments/calendar/challenger> (referer: None)
2024-10-21 17:43:55: [DEBUG] Crawled (200) <GET chrome-extension://neajdppkdcdipfabeoofebfddakdcjhd/audio.html> (referer: chrome-extension://neajdppkdcdipfabeoofebfddakdcjhd/audio.html)
我请求的网页有两个成功响应,然后突然出现一个
chrome-extension
URL。同样奇怪的是,引用者被列为以前从未请求过的相同地址。
为了让事情变得更有趣,我在另一台机器上运行了代码,并且使用相同的软件包版本运行良好:scrapy 2.11.2 和 seleniumbase 4.28.5。
这是蜘蛛:
from scrapy import Request, Spider
from scrapy.http.response.html import HtmlResponse
class Production(Spider):
name = "atp_production"
start_urls = [
"https://www.atptour.com/en/-/tournaments/calendar/tour",
"https://www.atptour.com/en/-/tournaments/calendar/challenger",
]
def start_requests(self):
for url in self.start_urls:
yield Request(
url=url,
callback=self._parse_calendar,
meta=dict(dont_redirect=True),
)
def _parse_calendar(self, response: HtmlResponse):
json_str = response.xpath("//body//text()").get()
这是中间件:
class SeleniumBase:
@classmethod
def from_crawler(cls, crawler: Crawler):
middleware = cls(crawler.settings)
crawler.signals.connect(middleware.spider_closed, signals.spider_closed)
return middleware
def __init__(self, settings: dict[str, Any]) -> None:
self.driver = sb.Driver(
uc=settings.get("UNDETECTABLE", None),
headless=settings.get("HEADLESS", None),
user_data_dir=settings.get("USER_DATA_DIR", None),
)
def spider_closed(self, *_) -> None:
self.driver.quit()
def process_request(self, request: Request, spider: Spider) -> HtmlResponse:
self.driver.get(request.url)
return HtmlResponse(
self.driver.current_url,
body=self.driver.page_source,
encoding="utf-8",
request=request,
)
对可能发生的事情有什么想法吗?
更新:
scrapy 现在似乎已经完全失控了。大约 95% 的情况下,它不会向所有下游解析方法(不在上面的 MRE 中)的正确回调发送响应。我无法真正将逻辑添加到上面的 MRE 中,因为它非常复杂,所以会抱怨我的问题中有太多代码。可以说我已经对所有内容进行了三重检查,此外 - 它在我的另一台机器上运行良好,因此引用绝对都是正确的。
我已经重新安装了 scrapy 和 seleniumbase 但这并没有解决问题:(
我也有类似的问题。如果我找到解决方案,我会写。