我正在用 scrapy-playwright 抓取一个电子商务网站,当我用 headless: True 抓取时,我收到 403 错误,但是,用 Headless False 我得到 200,我什至尝试随机化用户代理仍然被阻止。
该废料正在使用 firefox playwright 驱动程序和 webkit 驱动程序运行,但是,它花了很多时间,我想用 chromium 运行它
def make_request_from_data(self, data):
payload = json.loads(data)
isbn = payload["isbn"]
url = f"https://www.barnesandnoble.com/s/{isbn}"
meta = {
"region": self.region,
"isbn": isbn,
"playwright": True,
"playwright_include_page": True,
"playwright_context": f"context-{isbn}",
"playwright_context_kwargs": {
"java_script_enabled": True,
},
}
headers = {
"accept-encoding": "gzip, deflate, br",
"accept-language": "en",
"cache-control": "no-cache",
"pragma": "no-cache",
"sec-fetch-dest": "document",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "none",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36",
}
yield Request(
headers = headers,
url = url,
callback=self.parse,
errback=self.close_context_on_error,
meta=meta,
dont_filter=True,
)
Isbn 是书籍代码,我猜测是 chromium 版本,我不知道如何在 playwright 中降级 chromium 版本
我刚刚通过在浏览器上下文中添加自定义 UserAgent 解决了相同的问题 这个链接很有帮助。
https://jsoverson.medium.com/how-to-bypass-access-denied-pages-with-headless-chrome-87ddd5f3413c
[OneTimeSetUp]
public async Task OneTimeSetUp()
{
// Set up Playwright
_playwright = await Playwright.CreateAsync();
var browserType = _playwright.Chromium;
_browser = await browserType.LaunchAsync(new BrowserTypeLaunchOptions
{
Headless = true
});
var browserContext = await _browser.NewContextAsync(new BrowserNewContextOptions
{
BaseURL = baseURL,
UserAgent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36"
});
_page = await browserContext.NewPageAsync();
await _page.SetViewportSizeAsync(1366, 768 );
}
我正在与 Playwright 合作并遇到了同样的问题,在尝试了一堆替代方案(但不起作用)之后,我发现问题出在我设置无头模式的方式上。以下是使用 Java,但也许相同的解决方案也适用于其他 SDK。
var playwright = Playwright.create();
var launchOptions = new BrowserType.LaunchOptions();
launchOptions.setArgs(List.of("--headless", "--no-sandbox"));
var browser = playwright.chromium().launch(launchOptions);
var contextOptions = new Browser.NewContextOptions();
contextOptions.setUserAgent(USER_AGENT);
contextOptions.setExtraHTTPHeaders(Map.of(
"Accept-Language", "en-US,en;q=0.9",
"Accept-Encoding", "gzip, deflate"
));
contextOptions.setIgnoreHTTPSErrors(true);
var browserContext = getBrowserContext(browser, params)
// ... rest of code
注意,我通过 launchOptions 中的参数启用了无头模式,由于某种原因,通过
setHeadless
设置它不起作用。另请注意,我设置了用户代理和一些标头来绕过反机器人系统。