在scrapy中逐个选择结果

问题描述 投票:1回答:2

我从Indeed下载了一页的源代码,我正试图从那里获得所有的职称,因为我正在使用这个xpath:

response.xpath('//*[@class="  row  result"]//*[@class="jobtitle"]//text()').extract()

问题是结果不是一行而是得到这个结果:

[u'\n    ',
 u'Data',
 u' ',
 u'Scientist',
 u' Experto SQL con conocimiento en R',
 u'\n    ',
 u'\n    ',
 u'Data',
 u' Analytic con Python',
 u'\n    ',
 u'\n    ',
 u'Data',
 u' Analytic con R',

与其他数据进行映射存在问题,我想要的是逐个选择处理作业,类似于extract_first()

response.xpath('//*[@class="  row  result"]').extract_first()

但是对于任何给定的索引并且可以选择继续处理数据。我试过这个:

current_job = response.xpath('//*[@class="  row  result"]').extract_first()
current_job = TextResponse(url='',body=current_job,encoding='utf-8') 

但它只适用于第一个结果,它对我来说看起来不像是一个pythonic方法。

python web-scraping scrapy
2个回答
2
投票

首先我只得到a(没有text()extract())然后我会使用fortext()extract()与每个a单独使用,并且join()将元素连接到带有标题的字符串。

import scrapy

class MySpider(scrapy.Spider):

    name = 'myspider'

    start_urls = ['https://www.indeed.cl/trabajo?q=Data%20scientist&l=']

    def parse(self, response):
        print('url:', response.url)

        results = response.xpath('//h2[@class="jobtitle"]/a')
        print('number:', len(results))

        for item in results:
            title = ''.join(item.xpath('.//text()').extract())
            print('title:', title)

# --- it runs without project and saves in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(MySpider)
c.start()

结果:

number: 10
title: Data Scientist
title: CONSULTOR DATA SCIENCE SANTIAGO DE CHILE
title: Líder Análisis de Datos MCoE Minerals Americas
title: Ingeniero Inteligencia Mercado, BI
title: Ingeniero Inteligencia de Mercado, Business Intelligence
title: Data Scientist
title: Data Scientist
title: Data Scientist (Machine Learning)
title: Data Scientist / Ml Scientist
title: Young Professional - Spanish LatAm

1
投票

搏一搏。您需要稍微更改我的脚本以适合您的项目。它可以解决您上面提到的问题。

import requests
from scrapy import Selector

res = requests.get("https://www.indeed.cl/trabajo?q=Data%20scientist")
sel = Selector(res)
for item in sel.css("h2.jobtitle a"):
    title = ' '.join(item.css("::text").extract())
    print(title)

输出:

Data   Scientist
CONSULTOR  DATA  SCIENCE SANTIAGO DE CHILE
Líder Análisis de Datos MCoE Minerals Americas
Ingeniero Inteligencia Mercado, BI
Ingeniero Inteligencia de Mercado, Business Intelligence
Data   Scientist
Data   Scientist
Young Professional - Spanish LatAm
Data   Scientist  (Machine Learning)
Data   Scientist  / Ml  Scientist
© www.soinside.com 2019 - 2024. All rights reserved.