使用 Scrapy,如何向 Web 表单发送 POST 请求(无需登录)并检索结果?

问题描述 投票:0回答:1

我正在尝试使用 Scrapy (http://www.umdata.org/SearchChurches.aspx) 向此表单发送 POST 请求。我是否正确发送了有效负载(表单数据)?

  1. 我想填写并提交我的有效载荷到表单字段Form Fields
  2. 使用我的 parse_after_search 函数从结果 Table 中检索结果
import scrapy
from scrapy.http import FormRequest

class QuotesSpider(scrapy.Spider):
    name = 'greatriver'
    login_url = 'http://www.umdata.org/SearchChurches.aspx'
    start_urls = [login_url]


    def parse(self, response):
        login_url = 'http://www.umdata.org/SearchChurches.aspx'
        return FormRequest.from_response(
                response,
                url=login_url,
                clickdata={'name': 'ctl00$ContentPlaceHolder1$btnSearch'},
                formdata = {'ctl00$ContentPlaceHolder1$ddlState': 'AK'},
                callback = self.parse_after_search,
        )

    def parse_after_search (self, response):
        print (response.xpath('//*[@id="ContentPlaceHolder1_gvResults"]/tbody/tr[3]/td[1]/a'))

我运行这个脚本时的结果:


2023-04-20 20:00:04 [scrapy.core.engine] INFO: Spider opened
2023-04-20 20:00:04 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-04-20 20:00:04 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
2023-04-20 20:00:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.umdata.org/SearchChurches.aspx> (referer: None)
2023-04-20 20:00:06 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://www.umdata.org/SearchChurches.aspx> (referer: http://www.umdata.org/SearchChurches.aspx)
[]
2023-04-20 20:00:06 [scrapy.core.engine] INFO: Closing spider (finished)
2023-04-20 20:00:06 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

我期待我的 response.xpath 返回一个字符串

web-scraping scrapy web-crawler
1个回答
0
投票

你的要求很好,你得到一个空列表,因为选择器是错误的(有时使用 scrapy 时没有

tbody
)。

import scrapy
from scrapy.http import FormRequest


class QuotesSpider(scrapy.Spider):
    name = 'greatriver'
    login_url = 'http://www.umdata.org/SearchChurches.aspx'
    start_urls = [login_url]

    def parse(self, response):
        login_url = 'http://www.umdata.org/SearchChurches.aspx'
        return FormRequest.from_response(
            response,
            url=login_url,
            clickdata={'name': 'ctl00$ContentPlaceHolder1$btnSearch'},
            formdata={'ctl00$ContentPlaceHolder1$ddlState': 'AK'},
            callback=self.parse_after_search,
        )

    def parse_after_search (self, response):
        for line in response.xpath('//table[@id="ContentPlaceHolder1_gvResults"]//tr[not(@style)]'):
            yield {
                'Church Name': line.xpath('./td[1]//text()').get(),
                'Conference': line.xpath('./td[2]/text()').get(),
                'District': line.xpath('./td[3]/text()').get(),
                'Location': line.xpath('./td[4]/text()').get(),
            }

输出:

...
...
{'Church Name': 'FIRST UNITED METHODIST CHURCH OF ANCHORAGE', 'Conference': 'ALASKA', 'District': 'ALASKA', 'Location': 'ANCHORAGE, AK'}

{'Church Name': 'FIRST UNITED METHODIST CHURCH OF FAIRBANKS', 'Conference': 'ALASKA', 'District': 'ALASKA', 'Location': 'FAIRBANKS, AK'}
...
...
© www.soinside.com 2019 - 2024. All rights reserved.