我正在使用 scrapy 和 Playwright 来加载 Google 乔布斯搜索结果页面。剧作家需要能够在浏览器设置中加载页面,然后单击不同的作业以显示作业的详细信息。
我想从中提取信息的示例网址:https://www.google.com/search?q=product+designer+nyc&ibp=htl;jobs
虽然我可以获得在 Playwright 浏览器中打开该页面的代码并在交互式 python 环境中解析我想要的字段,但我不确定如何将 Playwright 顺利集成到 scrapy 中。我已正确设置
start_requests
功能,从某种意义上说,Playwright 已设置,它将打开浏览器到所需的页面,如上面的 URL。
这是迄今为止我对
parse
功能的了解:
async def parse(self, response):
page = response.meta["playwright_page"]
jobs = page.locator("//li")
num_jobs = jobs.count()
for idx in range(num_jobs):
# For each job found, first need to click on it
await jobs.nth(idx).click()
# Then grab this large section of the page that has details about the job
# In that large section, first click a couple of "More" buttons
job_details = page.locator("#tl_ditsc")
more_button1 = job_details.get_by_text("More job highlights")
await more_button1.click()
more_button2 = job_details.get_by_text("Show full description")
await more_button2.click()
# Then take that large section and pass it to another function for parsing
soup = BeautifulSoup(job_details, 'html.parser')
data = self.parse_single_jd(soup)
...
yield {data here}
return
当我尝试运行上面的代码时,它在
for idx in range(num_jobs)
行上出现错误,并显示“TypeError:‘coroutine’对象无法解释为整数”。在交互式 python shell 中运行时,使用 page.locator
、jobs.count()
、jobs.nth(#).click()
等都可以。这让我相信我误解了解析的异步性质的一些基本知识,我认为这是需要的,以便能够执行诸如单击页面之类的操作(根据此文档)。就像我需要强迫 num_jobs = jobs.count()
进行“评估”,但它并没有这样做。
(请注意,再往下一点,如果我想在
if more_button1.count()
行之前创建 await more_button1.click()
检查,我会遇到同样的错误 - 就好像我需要强制 .count()
进行“评估” )
有什么建议吗?
您遇到的错误,
“TypeError: ‘coroutine’
对象不能被解释为整数,”发生because jobs.count()
返回一个协程对象,而不是一个整数。
使用 Scrapy 的
asyncio
事件循环来管理 Scrapy 回调中的 Playwright 操作
import asyncio
import scrapy
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup
class MySpider(scrapy.Spider):
name = "my_spider"
async def start_requests(self):
# Initialize Playwright
async with async_playwright() as p:
browser = await p.chromium.launch()
page = await browser.new_page()
# Navigate to the desired URL
await page.goto("https://www.google.com/search?q=product+designer+nyc&ibp=htl;jobs")
# Pass the page to the callback
yield scrapy.Request(url="https://www.google.com/search?q=product+designer+nyc&ibp=htl;jobs",
meta={"playwright_page": page},
callback=self.parse)
async def parse(self, response):
page = response.meta["playwright_page"]
jobs = await page.locator("//li")
num_jobs = await jobs.count()
for idx in range(num_jobs):
# For each job found, first need to click on it
await jobs.nth(idx).click()
# Then grab this large section of the page that has details about the job
# In that large section, first click a couple of "More" buttons
job_details = await page.locator("#tl_ditsc")
more_button1 = await job_details.locator('text="More job highlights"')
await more_button1.click()
more_button2 = await job_details.locator('text="Show full description"')
await more_button2.click()
job_html = await job_details.inner_html()
# Pass the HTML content to another function for parsing
soup = BeautifulSoup(job_html, 'html.parser')
data = self.parse_single_jd(soup)
yield data
await page.context.close()
def parse_single_jd(self, soup):
# parsing logic here......
# extract job title
job_title = soup.find("h1").text.strip()
return {"job_title": job_title}
process = CrawlerProcess()
process.crawl(MySpider)
process.start()