Scrapy 剧作家(python)在 headless = True 中给出 403,在 headless = false 中给出 200?

问题描述 投票:0回答:2

我正在用 scrapy-playwright 抓取一个电子商务网站,当我用 headless: True 抓取时,我收到 403 错误,但是,用 Headless False 我得到 200,我什至尝试随机化用户代理仍然被阻止。

该废料正在使用 firefox playwright 驱动程序和 webkit 驱动程序运行,但是,它花了很多时间,我想用 chromium 运行它

 def make_request_from_data(self, data):
    payload = json.loads(data)
    isbn = payload["isbn"]

    url = f"https://www.barnesandnoble.com/s/{isbn}"
    meta = {
        "region": self.region,
        "isbn": isbn,
        "playwright": True,
        "playwright_include_page": True,
        "playwright_context": f"context-{isbn}",
        "playwright_context_kwargs": {
            "java_script_enabled": True,
        },
    }
    headers = {
    "accept-encoding": "gzip, deflate, br",
    "accept-language": "en",
    "cache-control": "no-cache",
    "pragma": "no-cache",
    "sec-fetch-dest": "document",
    "sec-fetch-mode": "navigate",
    "sec-fetch-site": "none",
    "sec-fetch-user": "?1",
    "upgrade-insecure-requests": "1",
    "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36",
    }  
    yield Request(
        headers = headers,
        url = url, 
        callback=self.parse,
        errback=self.close_context_on_error,
        meta=meta,
        dont_filter=True,
    )

Isbn 是书籍代码,我猜测是 chromium 版本,我不知道如何在 playwright 中降级 chromium 版本

python scrapy playwright-python
2个回答
0
投票

我刚刚通过在浏览器上下文中添加自定义 UserAgent 解决了相同的问题 这个链接很有帮助。

https://jsoverson.medium.com/how-to-bypass-access-denied-pages-with-headless-chrome-87ddd5f3413c

如何在 Playwright 中添加自定义标题

[OneTimeSetUp]
public async Task OneTimeSetUp()
{
    // Set up Playwright
    _playwright = await Playwright.CreateAsync();
    var browserType = _playwright.Chromium;
    _browser = await browserType.LaunchAsync(new BrowserTypeLaunchOptions
    {
        Headless = true
    });
    var browserContext = await _browser.NewContextAsync(new BrowserNewContextOptions
    {
        BaseURL = baseURL,
        UserAgent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36"
    });
    _page = await browserContext.NewPageAsync();
    await _page.SetViewportSizeAsync(1366, 768 );
}

0
投票

我正在与 Playwright 合作并遇到了同样的问题,在尝试了一堆替代方案(但不起作用)之后,我发现问题出在我设置无头模式的方式上。以下是使用 Java,但也许相同的解决方案也适用于其他 SDK。

var playwright = Playwright.create();

var launchOptions = new BrowserType.LaunchOptions();
launchOptions.setArgs(List.of("--headless", "--no-sandbox"));
var browser = playwright.chromium().launch(launchOptions);

var contextOptions = new Browser.NewContextOptions();
contextOptions.setUserAgent(USER_AGENT);
contextOptions.setExtraHTTPHeaders(Map.of(
    "Accept-Language", "en-US,en;q=0.9",
    "Accept-Encoding", "gzip, deflate"
));
contextOptions.setIgnoreHTTPSErrors(true);
var browserContext = getBrowserContext(browser, params)

// ... rest of code

注意,我通过 launchOptions 中的参数启用了无头模式,由于某种原因,通过

setHeadless
设置它不起作用。另请注意,我设置了用户代理和一些标头来绕过反机器人系统。

© www.soinside.com 2019 - 2024. All rights reserved.