很高兴你看到这个问题。我真的需要帮助...
我曾经在网站 www.britishhorseracing.com 上抓取类似固定装置的结果
https://www.britishhorseracing.com/racing/results/fixture-results/#!/2024/702
我想要为每场比赛列出的距离变化,以及在路上更远的距离,也是“进行”的部分。
在一位好用户的帮助下,我设法找到并抓取了 api,它给了我一个像这样的结果页面 https://api09.horseracing.software/bha/v1/fixtures/2024/702
我在 scrapy 中使用了一个简单的脚本来循环访问我需要的站点编号,并事先在浏览器中检查它们的正确日期
import scrapy
import json
class ApicrawlerSpider(scrapy.Spider):
name = 'apicrawler'
allowed_domains = ['britishhorseracing.com']
start_urls = ['http://britishhorseracing.com/']
allowed_domains = ['www.britishhorseracing.com']
urls = 'https://api09.horseracing.software/bha/v1/fixtures?page='
start_urls = []
page_number1 = 4214
while page_number1 <= 4220:
page_number = str(page_number1)
start_urls.append('https://api09.horseracing.software/bha/v1/fixtures?page=' + page_number)
page_number1 +=1
print(start_urls)
def parse(self, response):
data = json.loads(response.body)
print(data)
data2 = data['data']
for data3 in data2:
fixtureID = data3['fixtureId']
fixtureYear = data3['fixtureYear']
base_url = 'https://api09.horseracing.software/bha/v1/fixtures'
races_url = f'{base_url}/{fixtureYear}/{fixtureID}/races'
#going_url = f'{base_url}/{fixtureYear}/{fixtureID}/going'
yield{
'url' : races_url
}
现在的问题是: 他们更新了他们的网站,现在我得到的只是一个空白的徽标网站,无论我尝试抓取或访问什么 api。
我仍然可以看到正在获取的不同 api,但我无法再直接访问它们。
我猜有某种保护,但这远远超出了我在 Scrapy 中的技能,我看不到那里发生了什么 - 我猜是某种 cookie 或握手?
如果有人能指出我正确的方向,我将非常感激。
附注: 我还尝试通过循环浏览页面来抓取网页,但我无法绕过网址中的!#...
根据您的目标 API 行为,将您的
allowed_domains
值 www.britishhorseracing.com
更改为 https://www.britishhorseracing.com
,这是有效的来源,
Origin: www.britishhorseracing.com
Origin: https://www.britishhorseracing.com
import requests
for i in range(4214,4220):
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:130.0) Gecko/20100101 Firefox/130.0",
"Origin": "https://www.britishhorseracing.com"
}
url = f"https://api09.horseracing.software/bha/v1/fixtures?page={i}"
resp = requests.get(url, headers=headers).json()
for i in resp['data']:
races_url = f"https://api09.horseracing.software/bha/v1/fixtures/{i['fixtureYear']}/{i['fixtureId']}/races"
print(races_url)
https://api09.horseracing.software/bha/v1/fixtures/2024/11672/races
https://api09.horseracing.software/bha/v1/fixtures/2024/1430/races
https://api09.horseracing.software/bha/v1/fixtures/2024/21901/races
https://api09.horseracing.software/bha/v1/fixtures/2024/861/races
https://api09.horseracing.software/bha/v1/fixtures/2024/1266/races
https://api09.horseracing.software/bha/v1/fixtures/2024/1739/races
https://api09.horseracing.software/bha/v1/fixtures/2024/16065/races
https://api09.horseracing.software/bha/v1/fixtures/2024/504/races
https://api09.horseracing.software/bha/v1/fixtures/2024/455/races
https://api09.horseracing.software/bha/v1/fixtures/2024/1452/races