我正在构建一个脚本来抓取 cs2 比赛的 hltv.org/results 页面。但是,我遇到了很多与此相关的问题,具体来说,网站 hltv.org/results?offset={} 有多个名为 results-sublist 的 div,其中结果存储在名为 result-con 的 div 中。我的脚本仅抓取第一个结果子列表,无法进一步抓取,导致大量丢失匹配项。
我的功能被定义为
import urllib.request
from urllib.error import URLError, HTTPError
from urllib.request import Request, urlopen # Import Request and urlopen
import bs4
import time
import pandas as pd
from tqdm import tqdm
import random
# Base URL for HLTV results with pagination
base_url = 'https://www.hltv.org/results?offset={}'
def scrape_match_links(base_url, num_pages):
"""
Scrapes match links from the HLTV results pages.
Parameters:
base_url (str): The base URL for the HLTV results page.
num_pages (int): The number of pages to scrape (should be 100 res pr. page).
Returns:
list: A list of match links.
"""
offset = 0
match_links = []
headers = {'User-Agent': 'Mozilla/5.0'}
while offset <= num_pages * 100:
url = base_url.format(offset)
req = urllib.request.Request(url, headers=headers)
try:
with urllib.request.urlopen(req) as response:
webpage = response.read().decode('utf-8')
soup = bs4.BeautifulSoup(webpage, 'html.parser')
# Find all results-sublist divs
results_sublists = soup.find_all('div', class_='results-sublist')
if not results_sublists:
print("No results-sublist found on this page.")
break # Stop if no results are found on this page
for sublist in results_sublists:
results = sublist.find_all('div', class_='result-con')
for result in results:
match_link = result.find('a')
if match_link:
full_link = 'https://www.hltv.org' + match_link['href']
match_links.append(full_link) # Append full URL
offset += 100
except HTTPError as e:
print(f'HTTPError: {e.code} - {e.reason}')
break # Stop if there's an HTTP error (e.g., too many requests)
except URLError as e:
print(f'URLError: {e.reason}')
break
except Exception as e:
print(f'Unexpected error: {str(e)}')
break
return match_links
非常感谢任何关于我为什么会遇到这个问题和/或解决这个问题的想法的见解。 最好的,乔纳斯
我尝试使用selenium,它在第一页上工作,产生 6 个 div 类 results-sublist,但是,在后续页面上它返回 0 个 results-sublist 实例。
您好,您的目标站点似乎受 Cloudflare 保护,因此要运行此脚本,您必须通过在请求中添加
user-agent
、cookie
值来绕过机器人检测机制,以下是获取所需输出的代码示例:
import requests
from bs4 import BeautifulSoup
header = {
#your cookie and user-agent header is here
}
for i in range(100, 1000, 100):
url = f"https://www.hltv.org/results?offset={i}"
session = requests.Session()
resp = session.get(url, headers=header).text
soup = BeautifulSoup(resp, 'lxml')
for i in soup.find_all('a', href=True):
if 'matches/' in i['href']:
print(f"https://www.hltv.org{i['href']}")
https://www.hltv.org/matches/2376628/tsm-impact-vs-fluffy-mafia-esl-impact-league-season-6-north-america
https://www.hltv.org/matches/2376630/blue-otter-karma-vs-nouns-fe-esl-impact-league-season-6-north-america
https://www.hltv.org/matches/2376629/flyquest-red-vs-aware-fe-esl-impact-league-season-6-north-america
https://www.hltv.org/matches/2376631/lotus-fe-vs-imp-pact-fe-esl-impact-league-season-6-north-america
https://www.hltv.org/matches/2376796/bestia-vs-red-canids-esl-challenger-league-season-48-south-america
https://www.hltv.org/matches/2376624/fluxo-demons-vs-furia-fe-esl-impact-league-season-6-south-america
https://www.hltv.org/matches/2376626/atrix-vs-thekillaz-fe-esl-impact-league-season-6-south-america
https://www.hltv.org/matches/2376625/mibr-fe-vs-insanity-fe-esl-impact-league-season-6-south-america
https://www.hltv.org/matches/2376627/capivaras-vs-peak-fe-esl-impact-league-season-6-south-america
https://www.hltv.org/matches/2376615/big-equipa-vs-hsg-fe-esl-impact-league-season-6-europe
https://www.hltv.org/matches/2376754/3dmax-vs-saw-esl-challenger-league-season-48-europe
https://www.hltv.org/matches/2376617/let-her-cook-vs-nip-impact-esl-impact-league-season-6-europe
https://www.hltv.org/matches/2376616/dream-catchers-fe-vs-permitta-w-esl-impact-league-season-6-europe
https://www.hltv.org/matches/2376619/spirit-fe-vs-ence-athena-esl-impact-league-season-6-europe
https://www.hltv.org/matches/2376618/navi-javelins-vs-astralis-w-esl-impact-league-season-6-europe
https://www.hltv.org/matches/2376614/imperial-fe-vs-crescent-fe-esl-impact-league-season-6-europe
https://www.hltv.org/matches/2376739/ence-vs-9z-elisa-masters-espoo-2024
https://www.hltv.org/matches/2376814/unpaid-vs-parivision-res-regional-champions-2024
https://www.hltv.org/matches/2376725/aurora-young-blud-vs-cph-wolves-winline-insight-season-6
https://www.hltv.org/matches/2376738/b8-vs-jano-elisa-masters-espoo-2024
https://www.hltv.org/matches/2376766/bromo-vs-ihc-esl-challenger-league-season-48-asia-pacific
请告诉我这是否适合您