使用多个表而不是类来刮取网站

问题描述 投票:2回答:2

我需要刮下标有'Fielding'的底部桌子。我无法通过网站上的第一个表格。该网站有一些奇怪的HTML,似乎不容易刮。

Link here

我已经尝试使用带有'stats-fullbox clearfix'类的表,但它只会给我第一个表。如果我使用'stats-wrapper clearfix',它会给我整个网站。我只需要底部守备桌。我将对所有D1 JUCO站点进行此操作,因此我不想进入并编辑每个团队的数据。

url = ('http://njcaa.org/sports/bsb/2018-19/div1/teams/arizonawesterncollege?view=lineup')
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

response = requests.get(url, headers=headers)
thing = response.content
soup = BeautifulSoup(thing , 'html.parser')
table = soup.find('table')

list_of_rows = []
for row in table.findAll('tr'):
    list_of_cells = []
    for cell in row.findAll(["th","td"]):
        text = cell.text
        list_of_cells.append(text)
    list_of_rows.append(list_of_cells)

for item in list_of_rows:
    print(','.join(item))

我只需要将最后一个表格输出到CSV格式的文件中(我已经完成了与此类似的网站和数据,所以我不需要帮助)

python web-scraping beautifulsoup
2个回答
0
投票

您可以使用pandas并获得一个漂亮整洁的数据帧作为输出

import pandas as pd

table = pd.read_html('http://njcaa.org/sports/bsb/2018-19/div1/teams/arizonawesterncollege?view=lineup')[8]
table.columns = table.iloc[0]
table = table.iloc[1:].fillna('')
print(table)
table.to_csv(r"C:\Users\User\Desktop\Data.csv", sep=',', encoding='utf-8-sig',index = False )

示例输出(不是所有列显示):

enter image description here


0
投票

您可以使用soup.find_all找到最后一个表:

import requests, re
from bs4 import BeautifulSoup as soup
d = soup(requests.get('http://njcaa.org/sports/bsb/2018-19/div1/teams/arizonawesterncollege?view=lineup').text, 'html.parser')
t = d.find_all('table')[-1]
header, [_, *data] = [i.text for i in t.find_all('th')], [[i.text for i in b.find_all('td')] for b in t.find_all('tr')]
header, data = [re.sub('^[\n\s]+|[\n\s]+$', '', i) for i in header], [[re.sub('^[\n\s]+|[\n\s]+$', '', c) for c in b] for b in data]

输出:

['No.', 'Name', 'Yr', 'Pos', 'g', 'tc', 'po', 'a', 'e', 'fpct', 'dp', 'sba', 'rcs', 'rcs%', 'pb', 'ci']
[['15', 'Sam  Hackl', 'So', 'C', '50', '289', '269', '17', '3', '.990', '-', '23', '6', '.207', '14', '-'], ['27', 'Cristian  Castellanos', 'Fr', '1B', '37', '183', '171', '10', '2', '.989', '16', '-', '-', '-', '-', '-'], ['24', 'Jarrod  Belbin', 'Fr', 'UTL', '56', '226', '168', '44', '14', '.938', '10', '-', '-', '-', '-', '-'], ['14', 'Oliver  Frias', 'Fr', 'C', '32', '167', '150', '15', '2', '.988', '1', '5', '4', '.444', '4', '-'], ['9', 'Zach  Huffins', 'So', 'OF', '50', '129', '122', '4', '3', '.977', '-', '-', '-', '-', '-', '-'], ['17', 'Justin  Greene', 'So', 'UTL', '56', '124', '95', '27', '2', '.984', '6', '-', '-', '-', '-', '-'], ['2', 'Carlos  Arellano', 'So', 'SS', '53', '205', '91', '101', '13', '.937', '15', '-', '-', '-', '-', '-'], ['3', 'Tyson  Zamora', 'Fr', 'OF/LHP', '52', '53', '52', '0', '1', '.981', '-', '-', '-', '-', '-', '-'], ['6', 'Robbie  Wilkes', 'So', 'INF', '54', '149', '45', '91', '13', '.913', '13', '-', '-', '-', '-', '-'], ['12', 'Arjun  Huerta', 'Fr', 'UTL', '40', '33', '30', '3', '0', '1.000', '1', '-', '-', '-', '-', '-'], ['7', 'Anthony  Guzman', 'Fr', 'OF', '40', '28', '27', '0', '1', '.964', '-', '-', '-', '-', '-', '-'], ['1', 'Ramon  Miranda', 'So', 'RHP/INF', '54', '76', '18', '52', '6', '.921', '-', '-', '-', '-', '-', '-'], ['4', 'Water  Batista', 'Fr', 'C', '11', '17', '17', '0', '0', '1.000', '-', '3', '-', '-', '-', '-'], ['20', 'Ian  Drays', 'Fr', '1B/C', '6', '18', '17', '1', '0', '1.000', '-', '2', '1', '.333', '1', '-'], ['8', 'Robert  Gonzalez', 'So', 'LHP', '20', '24', '7', '16', '1', '.958', '-', '-', '-', '-', '-', '-'], ['16', 'Gabriel  Ponce', 'So', 'RHP', '16', '16', '4', '11', '1', '.938', '2', '-', '-', '-', '-', '-'], ['11', 'Arturo  Gil', 'Fr', 'RHP', '9', '5', '3', '2', '0', '1.000', '-', '-', '-', '-', '-', '-'], ['13', 'Chris  Koeiman', 'So', 'RHP', '16', '19', '3', '16', '0', '1.000', '-', '-', '-', '-', '-', '-'], ['22', 'Mason  Myhre', 'So', 'LHP', '8', '5', '3', '2', '0', '1.000', '-', '-', '-', '-', '-', '-'], ['10', 'Damien  Hernandez', 'Fr', 'LHP', '8', '2', '2', '0', '0', '1.000', '-', '-', '-', '-', '-', '-'], ['25', 'Austin  Cole', 'So', 'RHP', '5', '2', '1', '1', '0', '1.000', '-', '-', '-', '-', '-', '-'], ['28', 'Angelo  Wicklert', 'Fr', 'RHP', '12', '4', '1', '2', '1', '.750', '-', '-', '-', '-', '-', '-'], ['31', 'Zach  Hansen', 'So', 'RHP', '16', '5', '0', '5', '0', '1.000', '-', '-', '-', '-', '-', '-'], ['23', 'Kevin  Ortega', 'Fr', 'RHP', '6', '3', '0', '2', '1', '.667', '-', '-', '-', '-', '-', '-'], ['', 'Totals', '', '', '56', '1782', '1296', '422', '64', '.964', '13', '42', '0', '.000', '19', '-'], ['', 'Opponent', '', '', '56', '1516', '1075', '382', '59', '.961', '25', '113', '5', '.042', '11', '-']]
© www.soinside.com 2019 - 2024. All rights reserved.