我想抓取一个足球网站以在 Pandas 中创建数据集。我不知道如何将抓取到的球员数据信息输入到 3 列(姓名、联赛、足球队)中,并添加国家/地区以适合表格/数据框。
信息已被抓取,尽管不是很整齐,但我不确定(也不知道如何)我应该创建一个数组并将信息循环到列表或数组中。
from bs4 import BeautifulSoup
import requests
url = 'https://ng.soccerway.com/players/players_abroad/nigeria/'
req = requests.get(url,headers={'User-Agent':'Mozilla/5.0'})
page = req
soup = BeautifulSoup(page.text, 'html')
table = soup.find_all('table', class_="playersabroad table")
player_country = soup.find_all('th')
player_country_header = [country.text.strip() for country in player_country]
print(player_country_header)
import pandas as pd
import numpy as np
df = pd.DataFrame(columns = ['player-name', 'League', 'team_name'])
#df = pd.DataFrame(columns = player_country_header ) df
table_data = soup.find_all('td')
player_data_list=[data.text.strip() for data in table_data]
#length = len(df)
#df.loc[length] = player_data_list
print(player_data_list)
对于 pandas,这是一个带有 后处理
read_html
的提议:
cols = ["player-name", "League", "team_name"]
tmp = pd.read_html(requests.get(
url, headers={"User-Agent": "Mozilla/5.0"}).content)[0]
df = (
tmp.T.reset_index().T # to slip down the incorrect 'England' header
.assign(country=lambda x: x.pop(3).str.split(".").str[0].ffill())
.iloc[1:].loc[tmp.iloc[:, -1].isna()]
.set_axis(cols + ["country"], axis=1)
)
输出:
print(df)
player-name League team_name country
0 A. Iwobi Premier League Fulham England
1 T. Awoniyi Premier League Nottingham Forest England
2 O. Aina Premier League Nottingham Forest England
3 F. Onyeka Premier League Brentford England
4 C. Bassey Premier League Fulham England
... ... ... ... ...
1078 S. Danjuma Yemeni League Al Ahli San'a Yemen
1079 M. Alhassan Yemeni League Yarmuk al Rawda Yemen
1080 A. Nweze Yemeni League Yarmuk al Rawda Yemen
1081 A. Olalekan Yemeni League Al Sha'ab Ibb Yemen
1082 A. Adisa Yemeni League Al Urooba Yemen
[975 rows x 4 columns]