下面的代码找到了我正在寻找的大部分元素。然而,温度和风速的标签会根据天气严重程度而变化。如何让下面的代码在页面上一致地获得正确的 TempProb 和风速值。
import scrapy
class NflweatherdataSpider(scrapy.Spider):
name = 'NFLWeatherData'
allowed_domains = ['nflweather.com']
start_urls = ['http://nflweather.com/']
def parse(self, response):
# pass
# Extracting the content using css selectors
Datetimes = response.xpath(
'//div[@class="fw-bold text-wrap"]/text()').extract()
awayTeams = response.xpath('//span[@class="fw-bold"]/text()').extract()
homeTeams = response.xpath(
'//span[@class="fw-bold ms-1"]/text()').extract()
TempProbs = response.xpath(
'//div[@class="mx-2"]/span/text()').extract()
windspeeds = response.xpath(
'//div[@class="text-break col-md-4 mb-1 px-1 flex-centered"]/span/text()').extract()
# winddirection =
# Give the extracted content row wise
for item in zip(Datetimes, awayTeams, homeTeams, TempProbs, windspeeds):
# create a dictionary to store the scraped info
scraped_info = {
'Datetime': item[0],
'awayTeam': item[1],
'homeTeam': item[2],
'TempProb': item[3],
'windspeeds': item[4]
}
# yield or give the scraped info to scrapy
yield scraped_info
当然!下面是修改后的Scrapy代码。我引入了一些更改,以使温度、概率和风速的提取更加一致。此外,我还添加了解释代码每个部分的注释:
import scrapy
class NflweatherdataSpider(scrapy.Spider):
name = 'NFLWeatherData'
allowed_domains = ['nflweather.com']
start_urls = ['http://nflweather.com/']
def parse(self, response):
# Extracting the content using css selectors
game_boxes = response.css('div.game-box')
for game_box in game_boxes:
# Extracting date and time information
Datetimes = game_box.css('.col-12 .fw-bold::text').get()
# Extracting team information
team_game_boxes = game_box.css('.team-game-box')
awayTeams = team_game_boxes.css('.fw-bold::text').get()
homeTeams = team_game_boxes.css('.fw-bold.ms-1::text').get()
# Extracting temperature and probability information
TempProbs = game_box.css('.col-md-4 .mx-2 span::text').get()
# Extracting wind speed information
windspeeds = game_box.css('.col-md-4.mb-1 .text-danger::text').get()
# Create a dictionary to store the scraped info
scraped_info = {
'Datetime': Datetimes,
'awayTeam': awayTeams,
'homeTeam': homeTeams,
'TempProb': TempProbs,
'windspeeds': windspeeds
}
# Yield or give the scraped info to Scrapy
yield scraped_info
我修改了团队信息的选择器,使它们更加具体。我没有使用通用的团队名称选择器,而是使用特定的索引 (:nth-child()) 来定位游戏框中适当的团队元素。
对于温度和概率,我保留选择器原样,假设根据您更新的 HTML 片段它仍然有效。如果结构发生变化,您可能需要修改此选择器。
对于风速,我修改了选择器,以使用相关 div 中的“text-danger”类来定位适当的跨度。这应该会使提取更加一致。