从结构复杂的网站抓取数据

Question

我正在尝试使用 Python 从 TransferMarkt 网站上抓取数据。然而，网站结构复杂。我尝试使用 requests 和 Beautiful Soup 模块以及以下代码。然而，我得到的最终结果是两个用于“输入”和“输出”传输的空数据帧。我想将表（如图所示）中的信息提取到两个单独的数据框中。 in_transfers_df 应包含“In”表中显示的信息，out_transfers_df 应包含“Out”表中显示的信息。应该对每个标头重复此操作，例如阿森纳、阿斯顿维拉

我附上了一张显示网站结构和我的代码尝试的照片。任何帮助将不胜感激。

import requests
from bs4 import BeautifulSoup
import pandas as pd

# URL of the Transfermarkt page
url = 'https://www.transfermarkt.com/premier-league/transfers/wettbewerb/GB1/plus/?saison_id=2023&s_w=&leihe=0&intern=0'

# Send a GET request to the URL
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
response = requests.get(url, headers=headers)
response.raise_for_status()  # Raise an exception if the request was unsuccessful

# Parse the page content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# Function to extract transfer data
def extract_transfer_data(table):
    transfers = []
    rows = table.find_all('tr', class_=['odd', 'even'])
    for row in rows:
        cols = row.find_all('td')
        if len(cols) >= 5:  # Ensure there are enough columns
            transfers.append({
                'Player': cols[0].text.strip(),
                'Age': cols[1].text.strip(),
                'Club': cols[2].text.strip(),
                'Fee': cols[4].text.strip()
            })
    return transfers

# Locate the main transfer table container
transfer_containers = soup.find_all('div', class_='grid-view')

# Debugging: print the number of transfer containers found
print(f"Found {len(transfer_containers)} transfer containers.")

# Extract 'In' and 'Out' transfers data
in_transfers = []
out_transfers = []

for container in transfer_containers:
    headers = container.find_all('h2')
    tables = container.find_all('table')
    for header, table in zip(headers, tables):
        if 'In' in header.text:
            in_transfers.extend(extract_transfer_data(table))
        elif 'Out' in header.text:
            out_transfers.extend(extract_transfer_data(table))

# Convert to DataFrames
in_transfers_df = pd.DataFrame(in_transfers)
out_transfers_df = pd.DataFrame(out_transfers)

Answer 1

正如@GTK正确指出的那样，你的手册已经过时了。如果您仔细观察，现在您需要的数据位于具有类

div

的

box

元素中。您需要“坚持”它们才能检索必要的数据。

然而，除了这些元素之外，它可能不是我们正在寻找的。例如，该块也具有类似的结构。所以，你一定要小心。

因此，如果您正在寻求帮助，这里有一个我即时概述的适合您的解决方案。但是，您应该一步一步地进行并了解它最终是如何运作的。改进我的错误处理，另外，如有必要，将数据加载到

pandas

。

from collections import defaultdict
from pprint import pprint

import requests
from bs4 import BeautifulSoup

start_url = (
    'https://www.transfermarkt.com/premier-league/transfers/wettbewerb/GB1/'
    'plus/?saison_id=2023&s_w=&leihe=0&intern=0'
)
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
response = requests.get(start_url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')


def extract_club_name(node):
    try:
        return node.find('a')['title']
    except (TypeError, KeyError):
        return None


def parse_transfers_table(node):
    for tr in node.find('tbody').find_all('tr'):
        national = tr.find('td', class_='nat-transfer-cell')
        prev_club_data = tr.find(
            'td',
            class_='no-border-links verein-flagge-transfer-cell',
        )
        previous_club = (
            '' if prev_club_data.find('a') is None
            else prev_club_data.find('a')['title']
        )

        yield {
            'name': tr.find('span').find('a')['title'],
            'age': tr.find('td', class_='alter-transfer-cell').text,
            'national': [c['title'] for c in national if c.has_attr('title')],
            'position': tr.find('td', class_='kurzpos-transfer-cell').text,
            'market_price': tr.find('td', class_='mw-transfer-cell').text,
            'previous_club': previous_club,
            'transfer_value': tr.find('td', class_='rechts').text,
        }


result = defaultdict(defaultdict)
for club_info in soup.find_all('div', class_='box'):
    club_name = extract_club_name(club_info)
    if club_name is None:
        continue

    in_transfers_table, out_transfers_table = (
        club_info.find_all('div', class_='responsive-table')
    )
    result[club_name]['in'] = [*parse_transfers_table(in_transfers_table)]
    result[club_name]['out'] = [*parse_transfers_table(out_transfers_table)]

pprint(result)

从结构复杂的网站抓取数据

问题描述投票：0回答：1

1个回答

最新问题

从结构复杂的网站抓取数据

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1