我想从 id="hotel_address" 下的此页面抓取位置坐标。
class CrawlerSpider(scrapy.Spider):
name='crawler'
headers={'User-Agent':
'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Mobile Safari/537.36'}
start_urls=['https://www.booking.com/hotel/it/heart-of-san-lorenzo-roma12345678910111213141516171819202122.it.html?group_adults=1;no_rooms=1;']
#
def start_requests(self):
for link in self.start_urls:
yield scrapy.Request(url=link,headers=self.headers, callback=self.parse)
def parse(self, response):
coordinate=""
print('===========================================================')
coordinate=response.xpath('//*[@id="hotel_address"]/@data-atlas-latlng/text()').get()
print(coordinate)
print('===========================================================')
process=CrawlerProcess()
process.crawl(CrawlerSpider)
process.start()
但它返回 None 值。我的错误是什么?
原因是你使用了text()。 data-atlas-latlng 属性包含坐标作为值,而不是作为子文本节点。要解决此问题,您需要在 XPath 表达式中使用 @data-atlas-latlng 直接获取 data-atlas-latlng 属性的值。
我提供更新的代码。
class CrawlerSpider(scrapy.Spider):
name = 'crawler'
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Mobile Safari/537.36'}
start_urls = ['https://www.booking.com/hotel/it/heart-of-san-lorenzo-roma12345678910111213141516171819202122.it.html?group_adults=1;no_rooms=1;']
def start_requests(self):
for link in self.start_urls:
yield scrapy.Request(url=link, headers=self.headers, callback=self.parse)
def parse(self, response):
print('===========================================================')
# Extract the coordinates from the data-atlas-latlng attribute
coordinate = response.xpath('//*[@id="hotel_address"]/@data-atlas-latlng').get()
print(f"Coordinates: {coordinate}") # Prints the coordinates
print('===========================================================')
process = CrawlerProcess()
process.crawl(CrawlerSpider)
process.start()