使用 scrapy + Playwright 处理同步与异步

Question

我正在使用 scrapy 和 Playwright 来加载 Google 乔布斯搜索结果页面。剧作家需要能够在浏览器设置中加载页面，然后单击不同的作业以显示作业的详细信息。

我想从中提取信息的示例网址：https://www.google.com/search?q=product+designer+nyc&ibp=htl;jobs

虽然我可以获得在 Playwright 浏览器中打开该页面的代码并在交互式 python 环境中解析我想要的字段，但我不确定如何将 Playwright 顺利集成到 scrapy 中。我已正确设置

start_requests

功能，从某种意义上说，Playwright 已设置，它将打开浏览器到所需的页面，如上面的 URL。

这是迄今为止我对

parse

功能的了解：

async def parse(self, response):
    page = response.meta["playwright_page"]

    jobs = page.locator("//li")
    num_jobs = jobs.count()

    for idx in range(num_jobs):
        # For each job found, first need to click on it
        await jobs.nth(idx).click()

        # Then grab this large section of the page that has details about the job
        # In that large section, first click a couple of "More" buttons
        job_details = page.locator("#tl_ditsc")
        more_button1 = job_details.get_by_text("More job highlights")
        await more_button1.click()
        more_button2 = job_details.get_by_text("Show full description")
        await more_button2.click()

        # Then take that large section and pass it to another function for parsing
        soup = BeautifulSoup(job_details, 'html.parser')
        data = self.parse_single_jd(soup)

    ...
    yield {data here}
    return

当我尝试运行上面的代码时，它在

for idx in range(num_jobs)

行上出现错误，并显示“TypeError：‘coroutine’对象无法解释为整数”。在交互式 python shell 中运行时，使用

page.locator

、

jobs.count()

、

jobs.nth(#).click()

等都可以。这让我相信我误解了解析的异步性质的一些基本知识，我认为这是需要的，以便能够执行诸如单击页面之类的操作（根据此文档）。就像我需要强迫

num_jobs = jobs.count()

进行“评估”，但它并没有这样做。

（请注意，再往下一点，如果我想在

if more_button1.count()

行之前创建

await more_button1.click()

检查，我会遇到同样的错误 - 就好像我需要强制

.count()

进行“评估” )

有什么建议吗？

Answer 1

您遇到的错误，

“TypeError: ‘coroutine’

对象不能被解释为整数，”发生

because jobs.count()

返回一个协程对象，而不是一个整数。

使用 Scrapy 的

asyncio

事件循环来管理 Scrapy 回调中的 Playwright 操作

import asyncio
import scrapy
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup

class MySpider(scrapy.Spider):
    name = "my_spider"

    async def start_requests(self):
        # Initialize Playwright
        async with async_playwright() as p:
            browser = await p.chromium.launch()
            page = await browser.new_page()

            # Navigate to the desired URL
            await page.goto("https://www.google.com/search?q=product+designer+nyc&ibp=htl;jobs")

            # Pass the page to the callback
            yield scrapy.Request(url="https://www.google.com/search?q=product+designer+nyc&ibp=htl;jobs",
                                 meta={"playwright_page": page},
                                 callback=self.parse)

    async def parse(self, response):
        page = response.meta["playwright_page"]

        jobs = await page.locator("//li")
        num_jobs = await jobs.count()

        for idx in range(num_jobs):
            # For each job found, first need to click on it
            await jobs.nth(idx).click()

            # Then grab this large section of the page that has details about the job
            # In that large section, first click a couple of "More" buttons
            job_details = await page.locator("#tl_ditsc")
            more_button1 = await job_details.locator('text="More job highlights"')
            await more_button1.click()
            more_button2 = await job_details.locator('text="Show full description"')
            await more_button2.click()


            job_html = await job_details.inner_html()

            # Pass the HTML content to another function for parsing
            soup = BeautifulSoup(job_html, 'html.parser')
            data = self.parse_single_jd(soup)


            yield data


        await page.context.close()

    def parse_single_jd(self, soup):
        # parsing logic here......

        # extract job title
        job_title = soup.find("h1").text.strip()

        return {"job_title": job_title}


process = CrawlerProcess()
process.crawl(MySpider)
process.start()

使用 scrapy + Playwright 处理同步与异步

问题描述投票：0回答：1

1个回答

最新问题

使用 scrapy + Playwright 处理同步与异步

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1