我正在研究NFL统计数据的抓取,说实话,活动并不重要。我花了很多时间进行调试,因为我不敢相信它在做什么,要么是我疯了,要么是程序包或python本身存在某种错误。这是我正在使用的代码:
import pandas as pd
from bs4 import BeautifulSoup as bs
import requests
import string
import numpy as np
#get player list
players = pd.DataFrame({"name":[],"url":[],"positions":[],"startYear":[],"endYear":[]})
letters = list(string.ascii_uppercase)
for letter in letters:
print(letter)
players_html = requests.get("https://www.pro-football-reference.com/players/"+letter+"/")
soup = bs(players_html.content,"html.parser")
for player in soup.find("div",{"id":"div_players"}).find_all("p"):
temp_row = {}
temp_row["url"] = "https://www.pro-football-reference.com"+player.find("a")["href"]
temp_row["name"] = player.text.split("(")[0].strip()
years = player.text.split(")")[1].strip()
temp_row["startYear"] = int(years.split("-")[0])
temp_row["endYear"] = int(years.split("-")[1])
temp_row["positions"] = player.text.split("(")[1].split(")")[0]
players = players.append(temp_row,ignore_index=True)
players = players[players.endYear > 2000]
players.reset_index(inplace=True,drop=True)
game_df = pd.DataFrame()
def apply_test(row):
#print(row)
url = row['url']
#print(list(range(int(row['startYear']),int(row['endYear'])+1)))
for yr in range(int(row['startYear']),int(row['endYear'])+1):
print(yr)
content = requests.get(url.split(".htm")[0]+"/gamelog/"+str(yr)).content
soup = bs(content,'html.parser').find("div",{"id":"all_stats"})
#overheader
over_headers = []
for over in soup.find("thead").find("tr").find_all("th"):
if("colspan" in over.attrs.keys()):
for i in range(0,int(over['colspan'])):
over_headers = over_headers + [over.text]
else:
over_headers = over_headers + [over.text]
#headers
headers = []
for header in soup.find("thead").find_all("tr")[1].find_all("th"):
headers = headers + [header.text]
all_headers = [a+"___"+b for a,b in zip(over_headers,headers)]
#remove first column, it's meaningless
all_headers = all_headers[1:len(all_headers)]
for row in soup.find("tbody").find_all("tr"):
temp_row = {}
for i,col in enumerate(row.find_all("td")):
temp_row[all_headers[i]] = col.text
game_df = game_df.append(temp_row,ignore_index=True)
players.apply(apply_test,axis=1)
现在,我可以开始尝试做的事情,但是这里似乎存在一个更高层次的问题。 for循环中的startYear和endYear是2013和2014,因此循环应将yr变量设置为2013,然后是2014。但是,当您查看由于print(yr)
打印出的内容时,您会意识到它打印了2013年两次。但是,如果仅注释掉game_df = game_df.append(temp_row,ignore_index=True)
行,则yr的打印输出是正确的。在前两行之后不久出现一个错误,但这是预料之中的,我很乐意进行调试。但是,将事实附加到全局数据帧会导致for循环的行为不同,这一事实现在让我震惊。有人可以帮忙吗?
感谢。
我并没有真正遵循总体目标,但我确实注意到两件事:
您要么需要在game_df
之前将本地global game_df
声明为game_df = game_df.append(temp_row,ignore_index=True)
,要么最好在def签名中作为arg传递,尽管您需要对此进行相应的修改:players.apply(apply_test,axis=1)
。
您需要处理find返回None的情况,例如soup.find("thead").find_all("tr")[1].find_all("th")
页面的https://www.pro-football-reference.com/players/A/AaitIs00/gamelog/2014。可能放在try except
块中并提供适当的默认值。