inspect 元素中的 Html 与屏幕上显示的不同

问题描述 投票:0回答:1

我正在尝试从该网站删除数据 https://www.eurobasket.com/Basketball-Box-Score.aspx?Game=2009_1211_2563_2684-黎巴嫩

该网站包含两个表格: 但是表格行中显示的数据与 html 源中显示的数据不同(在执行检查元素之后)。

例如,这是第一行的数据:

<tr class="my_pStats1" onmouseover="this.style.backgroundColor='#C3C3C3';" onmouseout="this.style.backgroundColor='#FFFFFF';" valign="center" height="17" style="background-color: rgb(255, 255, 255);">
<td class="headcol">&nbsp;</td>
<td class="headcol2 my_playerName" align="left"><a class="my_playerB" href="https://basketball.asia-basket.com/player/Jean-Abdel-Nour/45278"><font color="#0066cc">SMdRl-XIuQ, zRij</font></a></td>
<td>45</td>
<td>4-9 (38.7%)</td>
<td>0-9 (96.3%)</td>
<td>5-5 (5%)</td>
<td class="hiddensmall">5</td>
<td class="hiddensmall">6</td>
<td>6</td>
<td>1</td>
<td>6</td>
<td class="hiddensmall">5</td>
<td class="hiddensmall">5</td>
<td class="hiddensmall">6</td>
<td class="hiddensmall">5</td>
<td class="hiddensmall">8</td>
<td>86</td>
<td class="hiddensmall">5</td>
<td class="hiddensmall">5</td></tr>

但玩家的名字是

Jean Abdel-Nour
而不是
SMdRl-XIuQ, zRij
以及类似的数字。

我尝试过硒但没用

import pandas as pd
import requests
from bs4 import BeautifulSoup
from selenium import webdriver

def extract_box_score_from_url(url):
    # Fetch the webpage content
    driver = webdriver.Chrome()  # Ensure ChromeDriver is installed and in PATH
    driver.get(url)
    html_content = driver.page_source
    soup = BeautifulSoup(html_content, 'html.parser')
    driver.quit()
    
    # Extract team and opponent names
    team = soup.find('table', {'id': 'aannew'}).find('a').text.strip()
    opponent = soup.find_all('table', {'id': 'aannew'})[1].find('a').text.strip()

    # Extract headers
    stats_divs = soup.find_all('div', class_='dvbs')
    header_rows = stats_divs[0].find('thead').find_all('tr')
    
    # Flatten headers by concatenating main headers and subheaders
    headers = []
    for th in header_rows[1].find_all('th'):  # Process the second header row
        main_header = th.get('colspan', None)
        sub_header = th.get_text(strip=True)
        headers.append(sub_header)

    # Add Team and Opponent columns to headers
    headers += ['Team', 'Opponent']

    # Function to extract stats table for a team
    def extract_team_stats(dvbs):
        rows = dvbs.find('tbody').find_all('tr', class_=['my_pStats1', 'my_pStats2'])
        stats = []
        for row in rows:
            cols = row.find_all('td')
            player_data = [col.get_text(strip=True) for col in cols]
            stats.append(player_data)
        return stats

    # Extract stats for both teams
    team_stats = extract_team_stats(stats_divs[0])
    opponent_stats = extract_team_stats(stats_divs[1])

    # Add Team and Opponent columns
    num_columns = len(headers)
    team_stats = [row + [team, opponent] for row in team_stats if len(row) + 2 == num_columns]
    opponent_stats = [row + [opponent, team] for row in opponent_stats if len(row) + 2 == num_columns]

    # Combine data
    combined_stats = team_stats + opponent_stats

    # Create dataframe
    df = pd.DataFrame(combined_stats, columns=headers)

    return df

url = "https://www.eurobasket.com/Basketball-Box-Score.aspx?Game=2009_1211_2563_2684-Lebanon"
df = extract_box_score_from_url(url)

df

你能帮我找到一种方法来抓取这些数据吗?我尝试过硒

import pandas as pd
import requests
from bs4 import BeautifulSoup
from selenium import webdriver

def extract_box_score_from_url(url):
    # Fetch the webpage content
    driver = webdriver.Chrome()  # Ensure ChromeDriver is installed and in PATH
    driver.get(url)
    html_content = driver.page_source
    soup = BeautifulSoup(html_content, 'html.parser')
    driver.quit()
    
    # Extract team and opponent names
    team = soup.find('table', {'id': 'aannew'}).find('a').text.strip()
    opponent = soup.find_all('table', {'id': 'aannew'})[1].find('a').text.strip()

    # Extract headers
    stats_divs = soup.find_all('div', class_='dvbs')
    header_rows = stats_divs[0].find('thead').find_all('tr')
    
    # Flatten headers by concatenating main headers and subheaders
    headers = []
    for th in header_rows[1].find_all('th'):  # Process the second header row
        main_header = th.get('colspan', None)
        sub_header = th.get_text(strip=True)
        headers.append(sub_header)

    # Add Team and Opponent columns to headers
    headers += ['Team', 'Opponent']

    # Function to extract stats table for a team
    def extract_team_stats(dvbs):
        rows = dvbs.find('tbody').find_all('tr', class_=['my_pStats1', 'my_pStats2'])
        stats = []
        for row in rows:
            cols = row.find_all('td')
            player_data = [col.get_text(strip=True) for col in cols]
            stats.append(player_data)
        return stats

    # Extract stats for both teams
    team_stats = extract_team_stats(stats_divs[0])
    opponent_stats = extract_team_stats(stats_divs[1])

    # Add Team and Opponent columns
    num_columns = len(headers)
    team_stats = [row + [team, opponent] for row in team_stats if len(row) + 2 == num_columns]
    opponent_stats = [row + [opponent, team] for row in opponent_stats if len(row) + 2 == num_columns]

    # Combine data
    combined_stats = team_stats + opponent_stats

    # Create dataframe
    df = pd.DataFrame(combined_stats, columns=headers)

    return df

url = "https://www.eurobasket.com/Basketball-Box-Score.aspx?Game=2009_1211_2563_2684-Lebanon"
df = extract_box_score_from_url(url)

df
python html web-scraping
1个回答
0
投票

您正在抓取的页面使用的是 Arial 字体的自定义版本,可将字形交换到不同的位置。这样做的效果是,它们可以提供不正确的数据,但当以该字体呈现时,在屏幕上显示正确。

他们几乎肯定这样做是因为他们不希望你抓取他们的数据。我最好的建议是在继续之前查看他们的服务条款。

© www.soinside.com 2019 - 2024. All rights reserved.