HLTV/结果抓取工具无法工作。多个相同命名的div

问题描述 投票:0回答:1

我正在构建一个脚本来抓取 cs2 比赛的 hltv.org/results 页面。但是,我遇到了很多与此相关的问题,具体来说,网站 hltv.org/results?offset={} 有多个名为 results-sublist 的 div,其中结果存储在名为 result-con 的 div 中。我的脚本仅抓取第一个结果子列表,无法进一步抓取,导致大量丢失匹配项。

我的功能被定义为

import urllib.request
from urllib.error import URLError, HTTPError
from urllib.request import Request, urlopen  # Import Request and urlopen
import bs4
import time
import pandas as pd
from tqdm import tqdm
import random

# Base URL for HLTV results with pagination
base_url = 'https://www.hltv.org/results?offset={}'

def scrape_match_links(base_url, num_pages):
    """
    Scrapes match links from the HLTV results pages.

    Parameters:
    base_url (str): The base URL for the HLTV results page.
    num_pages (int): The number of pages to scrape (should be 100 res pr. page).

    Returns:
    list: A list of match links.
    """
    offset = 0
    match_links = []
    headers = {'User-Agent': 'Mozilla/5.0'}

    while offset <= num_pages * 100:
        url = base_url.format(offset)
        req = urllib.request.Request(url, headers=headers)
        
        try:
            with urllib.request.urlopen(req) as response:
                webpage = response.read().decode('utf-8')
                soup = bs4.BeautifulSoup(webpage, 'html.parser')
                
                # Find all results-sublist divs
                results_sublists = soup.find_all('div', class_='results-sublist')
                if not results_sublists:
                    print("No results-sublist found on this page.")
                    break  # Stop if no results are found on this page

                for sublist in results_sublists:
                    results = sublist.find_all('div', class_='result-con')
                    for result in results:
                        match_link = result.find('a')
                        if match_link:
                            full_link = 'https://www.hltv.org' + match_link['href']
                            match_links.append(full_link)  # Append full URL

            offset += 100
            
        except HTTPError as e:
            print(f'HTTPError: {e.code} - {e.reason}')
            break  # Stop if there's an HTTP error (e.g., too many requests)
        except URLError as e:
            print(f'URLError: {e.reason}')
            break
        except Exception as e:
            print(f'Unexpected error: {str(e)}')
            break

    return match_links

非常感谢任何关于我为什么会遇到这个问题和/或解决这个问题的想法的见解。 最好的,乔纳斯

我尝试使用selenium,它在第一页上工作,产生 6 个 div 类 results-sublist,但是,在后续页面上它返回 0 个 results-sublist 实例。

python selenium-webdriver web-scraping beautifulsoup
1个回答
0
投票

您好,您的目标站点似乎受 Cloudflare 保护,因此要运行此脚本,您必须通过在请求中添加

user-agent
cookie
值来绕过机器人检测机制,以下是获取所需输出的代码示例:

代码:

import requests
from bs4 import BeautifulSoup

header = {
    #your cookie and user-agent header is here
}
for i in range(100, 1000, 100):
    url = f"https://www.hltv.org/results?offset={i}"
    session = requests.Session()
    resp = session.get(url, headers=header).text
    soup = BeautifulSoup(resp, 'lxml')
    for i in soup.find_all('a', href=True):
        if 'matches/' in i['href']:
            print(f"https://www.hltv.org{i['href']}")

输出:

https://www.hltv.org/matches/2376628/tsm-impact-vs-fluffy-mafia-esl-impact-league-season-6-north-america
https://www.hltv.org/matches/2376630/blue-otter-karma-vs-nouns-fe-esl-impact-league-season-6-north-america
https://www.hltv.org/matches/2376629/flyquest-red-vs-aware-fe-esl-impact-league-season-6-north-america
https://www.hltv.org/matches/2376631/lotus-fe-vs-imp-pact-fe-esl-impact-league-season-6-north-america
https://www.hltv.org/matches/2376796/bestia-vs-red-canids-esl-challenger-league-season-48-south-america
https://www.hltv.org/matches/2376624/fluxo-demons-vs-furia-fe-esl-impact-league-season-6-south-america
https://www.hltv.org/matches/2376626/atrix-vs-thekillaz-fe-esl-impact-league-season-6-south-america
https://www.hltv.org/matches/2376625/mibr-fe-vs-insanity-fe-esl-impact-league-season-6-south-america
https://www.hltv.org/matches/2376627/capivaras-vs-peak-fe-esl-impact-league-season-6-south-america
https://www.hltv.org/matches/2376615/big-equipa-vs-hsg-fe-esl-impact-league-season-6-europe
https://www.hltv.org/matches/2376754/3dmax-vs-saw-esl-challenger-league-season-48-europe
https://www.hltv.org/matches/2376617/let-her-cook-vs-nip-impact-esl-impact-league-season-6-europe
https://www.hltv.org/matches/2376616/dream-catchers-fe-vs-permitta-w-esl-impact-league-season-6-europe
https://www.hltv.org/matches/2376619/spirit-fe-vs-ence-athena-esl-impact-league-season-6-europe
https://www.hltv.org/matches/2376618/navi-javelins-vs-astralis-w-esl-impact-league-season-6-europe
https://www.hltv.org/matches/2376614/imperial-fe-vs-crescent-fe-esl-impact-league-season-6-europe
https://www.hltv.org/matches/2376739/ence-vs-9z-elisa-masters-espoo-2024
https://www.hltv.org/matches/2376814/unpaid-vs-parivision-res-regional-champions-2024
https://www.hltv.org/matches/2376725/aurora-young-blud-vs-cph-wolves-winline-insight-season-6
https://www.hltv.org/matches/2376738/b8-vs-jano-elisa-masters-espoo-2024
https://www.hltv.org/matches/2376766/bromo-vs-ihc-esl-challenger-league-season-48-asia-pacific

请告诉我这是否适合您

© www.soinside.com 2019 - 2024. All rights reserved.