带有 xpath 的 Scrapy Splash 不返回任何结果

问题描述 投票:0回答:1

我想要抓取的页面是 https://www.biggerpockets.com/forums/88/topics/895460-cap-rate-vs-interest-rate

开发者控制台中的 xpath 返回与帖子标题对应的文本元素

Developer Console xpath

但是,运行scrapy时,同样的xpath不起作用,标题返回'None'

yield SplashRequest("https://www.biggerpockets.com/forums/88/topics/895460-cap-rate-vs-interest-rate", self.parse_post, args={'wait': 2})

def parse_post(self, response):
  title = response.xpath('//div[contains(@class, "simplified-forums__discussion")]//div[contains(@class, "simplified-forums__discussion__first-post")]//div[contains(@class, "simplified-forums__card__content")]//h1/text()').get()
  print(title)
2023-11-01 00:16:11 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.biggerpockets.com/forums/49/topics/276013-interest-rate>
None

当我访问时

http://localhost:8050/render.html?url=https://www.biggerpockets.com/forums/88/topics/895460-cap-rate-vs-interest-rate

页面也渲染得很好,不确定到底出了什么问题,因为我确信 xpath 是正确的。

如果我遗漏了什么,请帮助我

python web-scraping scrapy scrapy-splash
1个回答
0
投票

正如我在评论中提到的,你的 xpath 似乎是错误的。

import scrapy

class biggerpockets(scrapy.Spider):
    name ='biggerpockets'
    start_urls = ['https://www.biggerpockets.com/forums/88/topics/895460-cap-rate-vs-interest-rate']
    
    def parse(self,response):

        title = response.xpath("//h1[@class='simplified-forums__topic-content__title']/text()").get()
        print("-------Extracted text-----------------")
        print(title)
        print("------------------------")

© www.soinside.com 2019 - 2024. All rights reserved.