我正在研究网络抓取 NBA 统计数据,并希望能够按得分、助攻和盖帽等统计数据进行排序。
我有我的 Pandas 数据框,它可以正确打印玩家和统计数据,包括按年龄等整数排序,如下所示。
但是,当我尝试按分数排序时,它并没有正确地从最高值到最低值排序,而是从最高初始值排序,例如从 9.9 到 0,尽管显然有每场比赛得分超过 10.0 分的玩家。
数据框中存储的数字实际上是字符串,因此字符串的比较导致了此问题?
这是我正在运行的代码:
year = 2021
# URL page we will scraping (see image above)
url = "https://www.basketball-reference.com/leagues/NBA_{}_per_game.html".format(year)
# this is the HTML from the given URL
html = urlopen(url)
soup = BeautifulSoup(html, features="html.parser")
table = soup.find_all(class_="full_table")
head = soup.find(class_="thead")
headers_raw = [head.text for item in head][0]
headers = headers_raw.replace("\n", ",").split(",")[2:-1]
players = []
for i in range(len(table)):
player = []
for td in table[i].find_all("td"):
player.append(td.text)
players.append(player)
stats = pd.DataFrame(players, columns = headers)
sorted_by_points = stats.sort_values('PTS', ascending=False)
dtypes()
来查看。
是的,将
PTS
列转换为浮点数:
stats["PTS"] = pd.to_numeric(stats["PTS"])
sorted_by_points = stats.sort_values(by="PTS", ascending=False)
print(sorted_by_points)
打印:
Player Pos Age Tm G GS MP FG FGA FG% 3P 3PA 3P% 2P 2PA 2P% eFG% FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PTS
37 Bradley Beal SG 27 WAS 47 47 35.4 10.9 22.7 .483 2.2 6.5 .338 8.7 16.2 .541 .531 7.0 7.9 .897 1.2 3.6 4.8 4.8 1.1 0.3 3.3 2.4 31.1
112 Stephen Curry PG 32 GSW 49 49 34.0 10.2 20.8 .491 5.1 12.0 .427 5.1 8.8 .579 .614 5.5 6.0 .922 0.5 5.1 5.6 5.9 1.2 0.1 3.2 1.8 31.0
140 Joel Embiid C 26 PHI 38 38 32.2 9.4 18.2 .516 1.2 3.1 .379 8.2 15.1 .544 .548 10.1 11.8 .855 2.2 8.9 11.1 3.0 1.0 1.4 3.1 2.4 30.0
286 Damian Lillard PG 30 POR 52 52 35.9 8.8 20.1 .440 4.1 10.8 .379 4.8 9.3 .510 .541 6.9 7.5 .925 0.5 3.7 4.2 7.7 0.9 0.3 3.2 1.6 28.7
125 Luka Dončić PG 21 DAL 51 51 35.1 10.2 20.9 .486 3.0 8.3 .359 7.2 12.6 .569 .557 5.3 7.3 .727 0.8 7.1 7.9 8.7 1.0 0.7 4.3 2.3 28.6
11 Giannis Antetokounmpo PF 26 MIL 47 47 33.7 10.3 18.3 .565 1.1 3.7 .299 9.2 14.6 .632 .595 6.7 9.8 .683 1.8 9.5 11.2 6.1 1.1 1.3 3.7 2.7 28.4
276 Zach LaVine SG 25 CHI 53 53 35.2 9.8 19.4 .506 3.4 8.2 .416 6.4 11.2 .572 .594 4.4 5.2 .848 0.6 4.5 5.1 5.1 0.8 0.5 3.6 2.3 27.5
236 Kyrie Irving PG 28 BRK 41 41 35.1 10.4 20.4 .511 2.7 7.0 .389 7.7 13.5 .573 .577 3.7 4.0 .916 1.1 3.7 4.8 6.1 1.3 0.6 2.4 2.7 27.3
134 Kevin Durant PF 32 BRK 24 22 32.7 9.3 17.1 .543 2.5 5.5 .462 6.8 11.6 .581 .617 6.1 7.0 .870 0.4 6.3 6.7 5.2 0.6 1.3 3.7 2.0 27.3
511 Zion Williamson PF 20 NOP 52 52 33.1 10.3 16.8 .614 0.2 0.6 .300 10.1 16.2 .625 .619 6.0 8.6 .699 2.6 4.6 7.2 3.7 0.9 0.7 2.6 2.3 26.8
335 Donovan Mitchell PG 24 UTA 53 53 33.4 9.0 20.6 .438 3.4 8.7 .386 5.7 11.9 .476 .520 5.0 6.0 .845 0.9 3.5 4.4 5.2 1.0 0.3 2.8 2.2 26.4
253 Nikola Jokić C 25 DEN 56 56 35.2 10.2 18.1 .567 1.4 3.3 .422 8.8 14.8 .599 .605 4.2 4.9 .855 2.9 8.1 11.0 8.8 1.5 0.6 3.1 2.7 26.1
458 Jayson Tatum SF 22 BOS 51 51 35.6 9.4 20.4 .462 2.9 7.5 .388 6.5 12.9 .505 .534 4.2 4.8 .873 0.7 6.5 7.1 4.2 1.2 0.4 2.5 1.8 26.0
282 Kawhi Leonard SF 29 LAC 46 46 34.4 9.4 18.2 .516 2.0 5.0 .393 7.4 13.2 .562 .569 5.0 5.7 .878 1.1 5.5 6.7 5.1 1.7 0.4 2.0 1.7 25.7
57 Devin Booker SG 24 PHO 52 52 33.6 9.3 19.2 .485 2.0 5.8 .349 7.3 13.3 .545 .539 4.8 5.6 .862 0.4 3.8 4.1 4.5 0.9 0.3 3.2 2.8 25.5
243 LeBron James PG 36 LAL 41 41 33.9 9.5 18.4 .513 2.4 6.5 .368 7.1 12.0 .592 .578 4.1 5.8 .703 0.6 7.3 7.9 7.9 1.0 0.6 3.7 1.6 25.4
521 Trae Young PG 22 ATL 52 52 34.3 7.7 17.9 .431 2.3 6.5 .357 5.4 11.4 .472 .495 7.7 8.8 .871 0.6 3.3 3.9 9.5 0.8 0.2 4.3 1.9 25.4
156 De'Aaron Fox PG 23 SAC 56 56 35.2 9.2 19.2 .481 1.8 5.5 .323 7.4 13.6 .545 .527 5.1 7.1 .713 0.6 2.9 3.5 7.2 1.5 0.5 3.0 2.9 25.3
194 James Harden PG-SG 31 TOT 42 42 37.1 8.0 17.2 .463 2.8 7.8 .358 5.2 9.4 .549 .544 6.5 7.5 .870 0.8 7.2 8.0 10.9 1.2 0.7 4.1 2.3 25.2
475 Karl-Anthony Towns C 25 MIN 36 36 34.2 8.6 17.6 .491 2.4 6.1 .395 6.2 11.5 .542 .560 5.0 5.8 .874 2.6 8.2 10.8 4.6 0.7 1.4 3.2 3.6 24.7
71 Jaylen Brown SG 24 BOS 52 52 34.1 9.3 18.8 .493 2.7 6.8 .400 6.6 12.0 .546 .566 3.3 4.3 .754 1.2 4.6 5.8 3.4 1.2 0.6 2.7 2.9 24.6
437 Collin Sexton SG 22 CLE 47 47 35.7 8.8 18.4 .479 1.6 4.4 .367 7.2 14.0 .514 .523 4.9 6.0 .816 0.8 2.1 2.9 4.1 1.1 0.1 2.6 2.7 24.2
235 Brandon Ingram SF 23 NOP 52 52 34.7 8.5 18.2 .466 2.4 6.3 .382 6.1 12.0 .510 .532 4.8 5.4 .886 0.6 4.4 5.0 4.8 0.7 0.7 2.5 2.0 24.2
487 Nikola Vučević C 30 TOT 57 57 33.6 9.7 20.1 .485 2.6 6.3 .417 7.1 13.8 .516 .550 2.1 2.5 .836 2.0 9.3 11.3 3.8 1.0 0.7 1.7 1.9 24.1
170 Shai Gilgeous-Alexander SG 22 OKC 35 35 33.7 8.2 16.1 .508 2.0 4.9 .418 6.2 11.3 .547 .571 5.3 6.5 .808 0.5 4.2 4.7 5.9 0.8 0.7 3.0 2.0 23.7
407 Julius Randle PF 26 NYK 57 57 37.4 8.4 18.3 .461 2.1 5.2 .405 6.3 13.1 .483 .519 4.8 5.9 .802 1.3 9.2 10.5 6.1 1.0 0.3 3.5 3.2 23.7
167 Paul George SF 30 LAC 44 44 33.6 8.4 17.5 .480 3.3 7.5 .437 5.1 9.9 .513 .574 3.5 4.0 .886 0.9 5.4 6.3 5.5 1.2 0.5 3.2 2.4 23.6
315 CJ McCollum SG 29 POR 31 31 33.5 8.6 19.4 .444 3.8 9.6 .396 4.8 9.8 .492 .542 2.3 2.8 .818 0.6 3.3 3.9 4.7 1.1 0.5 1.3 1.9 23.4
114 Anthony Davis PF 27 LAL 23 23 32.8 8.9 16.7 .533 0.7 2.5 .293 8.1 14.1 .575 .555 4.0 5.7 .715 2.0 6.3 8.4 3.0 1.3 1.8 2.0 1.8 22.5
178 Jerami Grant SF 26 DET 50 50 34.3 7.4 17.3 .428 2.2 6.2 .354 5.2 11.2 .469 .491 5.4 6.3 .860 0.7 4.0 4.7 2.9 0.7 1.1 2.1 2.3 22.4
500 Russell Westbrook PG 32 WAS 49 49 35.4 8.3 18.9 .441 1.3 4.1 .312 7.1 14.8 .477 .475 4.0 6.3 .624 1.7 9.3 10.9 10.8 1.3 0.4 4.9 2.8 21.9
67 Malcolm Brogdon PG 28 IND 50 50 35.1 8.1 17.6 .459 2.7 6.7 .399 5.4 10.9 .496 .535 2.6 3.0 .859 1.1 3.9 5.0 6.0 0.9 0.2 2.0 2.0 21.4
80 Jimmy Butler SF 31 MIA 41 41 33.7 7.1 14.4 .495 0.5 2.0 .232 6.7 12.4 .537 .511 6.7 7.9 .852 1.9 5.3 7.2 7.2 2.1 0.4 2.1 1.4 21.4
346 Jamal Murray PG 23 DEN 48 48 35.5 7.9 16.5 .477 2.7 6.6 .408 5.2 9.9 .523 .559 2.8 3.2 .869 0.8 3.3 4.0 4.8 1.3 0.3 2.3 2.0 21.2
119 DeMar DeRozan PF 31 SAS 47 47 34.0 7.4 14.9 .495 0.4 1.4 .281 7.0 13.5 .517 .508 6.1 6.9 .880 0.7 3.6 4.3 7.2 0.9 0.3 1.9 2.0 21.2
517 Christian Wood C 25 HOU 34 34 31.9 8.2 15.4 .531 1.8 4.7 .385 6.4 10.7 .595 .590 2.9 4.6 .641 1.7 7.6 9.4 1.6 0.9 1.2 1.9 2.1 21.1
440 Pascal Siakam PF 26 TOR 46 46 35.6 7.5 16.5 .453 1.2 4.2 .282 6.3 12.3 .512 .489 4.7 5.5 .839 1.7 5.5 7.2 4.7 1.1 0.7 2.2 3.3 20.8
427 Terry Rozier SG 26 CHO 53 53 34.0 7.5 16.0 .468 3.4 8.3 .405 4.1 7.7 .536 .573 2.4 2.9 .824 0.6 3.7 4.2 3.8 1.3 0.4 1.9 1.8 20.7
492 John Wall PG 30 HOU 37 37 32.1 7.4 18.1 .405 2.1 6.3 .328 5.3 11.9 .446 .462 3.8 5.2 .734 0.5 2.8 3.2 6.8 1.1 0.8 3.5 1.2 20.6
201 Tobias Harris PF 28 PHI 49 49 33.3 7.9 15.2 .521 1.4 3.5 .407 6.5 11.7 .555 .568 3.2 3.6 .886 1.0 6.2 7.2 3.6 0.9 0.9 1.9 2.0 20.5
399 Kristaps Porziņģis C 25 DAL 37 37 31.4 7.7 16.4 .473 2.3 6.3 .359 5.5 10.0 .544 .542 2.8 3.2 .850 2.0 7.4 9.4 1.6 0.4 1.5 1.4 2.6 20.5
330 Khris Middleton SF 29 MIL 54 54 33.4 7.4 15.5 .479 2.2 5.1 .433 5.2 10.4 .502 .550 3.0 3.4 .886 0.8 5.2 6.1 5.6 1.1 0.2 2.8 2.4 20.1
430 Domantas Sabonis PF 24 IND 53 53 35.7 7.5 14.4 .520 0.8 2.6 .302 6.7 11.8 .568 .547 4.1 5.6 .731 2.5 9.1 11.6 6.0 1.1 0.5 3.4 3.4 19.9
371 Victor Oladipo SG 28 TOT 33 33 32.7 7.1 17.5 .408 2.4 7.2 .326 4.8 10.2 .466 .476 3.2 4.2 .754 0.4 4.5 4.8 4.6 1.4 0.4 2.5 2.5 19.8
207 Gordon Hayward SF 30 CHO 44 44 34.0 7.1 15.0 .473 1.9 4.7 .415 5.1 10.3 .499 .537 3.5 4.2 .843 0.8 5.0 5.9 4.1 1.2 0.3 2.1 1.7 19.6
38 Malik Beasley SG 24 MIN 37 36 32.8 7.1 16.2 .440 3.5 8.7 .399 3.7 7.5 .487 .547 1.8 2.2 .850 0.8 3.6 4.4 2.4 0.8 0.2 1.6 1.7 19.6
483 Fred VanVleet SG 26 TOR 45 45 36.1 6.4 16.4 .391 3.3 8.9 .366 3.1 7.4 .422 .491 3.4 3.9 .885 0.6 3.5 4.2 6.1 1.7 0.8 2.0 2.4 19.5
401 Norman Powell SG-SF 27 TOT 54 43 31.1 6.5 13.4 .486 2.6 6.2 .421 3.9 7.1 .543 .584 3.4 4.0 .860 0.6 2.5 3.1 1.8 1.2 0.3 1.8 2.4 19.1
429 D'Angelo Russell PG 24 MIN 28 19 27.7 6.7 15.6 .429 2.8 7.1 .399 3.9 8.6 .454 .519 2.8 3.4 .802 0.4 2.0 2.4 4.9 1.0 0.4 2.8 1.9 19.0
3 Bam Adebayo C 23 MIA 51 51 33.4 7.2 12.7 .566 0.0 0.2 .250 7.2 12.6 .570 .568 4.5 5.7 .803 2.3 7.0 9.3 5.2 1.0 1.1 2.8 2.3 19.0
339 Ja Morant PG 21 MEM 47 47 31.9 6.7 15.1 .444 1.0 3.6 .274 5.7 11.5 .497 .477 4.3 5.8 .739 0.9 2.7 3.6 7.3 0.8 0.2 3.1 1.4 18.7
155 Evan Fournier SF-SG 28 TOT 30 26 30.1 6.2 13.6 .457 2.8 7.0 .398 3.4 6.5 .520 .560 3.4 4.2 .795 0.2 2.6 2.7 3.4 1.1 0.4 1.9 2.2 18.6
284 Caris LeVert SG 26 TOT 32 24 30.7 7.0 16.2 .431 1.8 5.6 .315 5.2 10.6 .491 .485 2.6 3.2 .804 0.8 3.7 4.5 4.8 1.4 0.5 2.0 2.0 18.3
505 Andrew Wiggins PF 25 GSW 57 57 32.7 6.9 14.4 .476 2.0 5.2 .387 4.9 9.2 .527 .546 2.4 3.4 .697 1.0 3.7 4.7 2.3 0.9 0.9 1.8 2.2 18.2
135 Anthony Edwards SG 19 MIN 58 41 31.3 6.6 16.5 .397 2.2 6.9 .320 4.3 9.6 .453 .465 2.8 3.5 .788 0.8 3.6 4.4 2.7 1.1 0.4 2.1 1.7 18.1
101 John Collins PF 23 ATL 48 48 30.1 7.0 12.8 .545 1.3 3.3 .377 5.8 9.5 .604 .594 2.7 3.3 .840 2.1 5.6 7.6 1.4 0.5 1.0 1.2 3.3 18.0
176 Eric Gordon SG 32 HOU 27 13 29.2 5.9 13.6 .433 2.6 7.8 .329 3.3 5.8 .573 .527 3.5 4.2 .825 0.3 1.9 2.1 2.6 0.5 0.5 1.9 1.6 17.8
490 Kemba Walker PG 30 BOS 37 37 31.4 6.1 15.2 .401 2.8 8.0 .345 3.4 7.2 .464 .492 2.8 3.0 .937 0.3 3.6 3.9 5.1 1.1 0.3 2.1 1.4 17.8
...
def comparison_summarize(compare_df):
orig_col=list(compare_df.columns.levels[0])
numeric_cols=[]
results_dict_cat={}
results_dict_num={}
for ele in compare_df._get_numeric_data().columns:
numeric_cols.append(ele[0])
categorical_cols=list(set(orig_col)-set(numeric_cols))
numeric_cols=list(set(numeric_cols))
compare_df.columns = ['@'.join(col) for col in compare_df.columns.values]
for col in categorical_cols:
nomatch_cat=compare_df[compare_df[col+'@self']!=compare_df[col+'@other']]
if nomatch_cat.shape[0]==0:
results_dict_cat[col]={}
results_dict_cat[col]['Difference']="no difference in 2 dataframes"
else:
results_dict_cat[col]={}
results_dict_cat[col]['Difference']="there is difference present"
results_dict_cat[col]['difference_summary']=nomatch_cat[[col+'@self',col+'@other']].groupby([col+'@self',col+'@other']).size().reset_index().rename(columns={0:'count'})
tempdf=compare_df.copy()
for col in numeric_cols:
tempdf[col+'diff']=tempdf[col+'@self']-tempdf[col+'@other']
tempdf[col+'diff']=tempdf[col+'diff'].abs()
nomatch_num=tempdf[tempdf[col+'diff']!=0]
if nomatch_num.shape[0]==0:
results_dict_num[col]={}
results_dict_num[col]['Difference']="no difference in 2 dataframes"
else:
results_dict_num[col]={}
results_dict_num[col]['Difference']="there is difference present"
results_dict_num[col]['difference_summary']=nomatch_num[[col+'@self',col+'@other',col+'diff']].describe()
return compare_df,results_dict_cat,results_dict_num
compare_df,results_dict_cat,results_dict_num=comparison_summarize(df)