使用 scrapy + Playwright 处理同步与异步

问题描述 投票:0回答:1

我正在使用 scrapy 和 Playwright 来加载 Google 乔布斯搜索结果页面。剧作家需要能够在浏览器设置中加载页面,然后单击不同的作业以显示作业的详细信息。

我想从中提取信息的示例网址:https://www.google.com/search?q=product+designer+nyc&ibp=htl;jobs

虽然我可以获得在 Playwright 浏览器中打开该页面的代码并在交互式 python 环境中解析我想要的字段,但我不确定如何将 Playwright 顺利集成到 scrapy 中。我已正确设置

start_requests
功能,从某种意义上说,Playwright 已设置,它将打开浏览器到所需的页面,如上面的 URL。

这是迄今为止我对

parse
功能的了解:

async def parse(self, response):
    page = response.meta["playwright_page"]

    jobs = page.locator("//li")
    num_jobs = jobs.count()

    for idx in range(num_jobs):
        # For each job found, first need to click on it
        await jobs.nth(idx).click()

        # Then grab this large section of the page that has details about the job
        # In that large section, first click a couple of "More" buttons
        job_details = page.locator("#tl_ditsc")
        more_button1 = job_details.get_by_text("More job highlights")
        await more_button1.click()
        more_button2 = job_details.get_by_text("Show full description")
        await more_button2.click()

        # Then take that large section and pass it to another function for parsing
        soup = BeautifulSoup(job_details, 'html.parser')
        data = self.parse_single_jd(soup)

    ...
    yield {data here}
    return

当我尝试运行上面的代码时,它在

for idx in range(num_jobs)
行上出现错误,并显示“TypeError:‘coroutine’对象无法解释为整数”。在交互式 python shell 中运行时,使用
page.locator
jobs.count()
jobs.nth(#).click()
等都可以。这让我相信我误解了解析的异步性质的一些基本知识,我认为这是需要的,以便能够执行诸如单击页面之类的操作(根据此文档)。就像我需要强迫
num_jobs = jobs.count()
进行“评估”,但它并没有这样做。

(请注意,再往下一点,如果我想在

if more_button1.count()
行之前创建
await more_button1.click()
检查,我会遇到同样的错误 - 就好像我需要强制
.count()
进行“评估” )

有什么建议吗?

python scrapy playwright
1个回答
0
投票

您遇到的错误,

“TypeError: ‘coroutine’
对象不能被解释为整数,”发生
because jobs.count()
返回一个协程对象,而不是一个整数。

使用 Scrapy 的

asyncio
事件循环来管理 Scrapy 回调中的 Playwright 操作

import asyncio
import scrapy
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup

class MySpider(scrapy.Spider):
    name = "my_spider"

    async def start_requests(self):
        # Initialize Playwright
        async with async_playwright() as p:
            browser = await p.chromium.launch()
            page = await browser.new_page()

            # Navigate to the desired URL
            await page.goto("https://www.google.com/search?q=product+designer+nyc&ibp=htl;jobs")

            # Pass the page to the callback
            yield scrapy.Request(url="https://www.google.com/search?q=product+designer+nyc&ibp=htl;jobs",
                                 meta={"playwright_page": page},
                                 callback=self.parse)

    async def parse(self, response):
        page = response.meta["playwright_page"]

        jobs = await page.locator("//li")
        num_jobs = await jobs.count()

        for idx in range(num_jobs):
            # For each job found, first need to click on it
            await jobs.nth(idx).click()

            # Then grab this large section of the page that has details about the job
            # In that large section, first click a couple of "More" buttons
            job_details = await page.locator("#tl_ditsc")
            more_button1 = await job_details.locator('text="More job highlights"')
            await more_button1.click()
            more_button2 = await job_details.locator('text="Show full description"')
            await more_button2.click()


            job_html = await job_details.inner_html()

            # Pass the HTML content to another function for parsing
            soup = BeautifulSoup(job_html, 'html.parser')
            data = self.parse_single_jd(soup)


            yield data


        await page.context.close()

    def parse_single_jd(self, soup):
        # parsing logic here......

        # extract job title
        job_title = soup.find("h1").text.strip()

        return {"job_title": job_title}


process = CrawlerProcess()
process.crawl(MySpider)
process.start()
© www.soinside.com 2019 - 2024. All rights reserved.