我想使用beautifulsoup来刮取HTML,从一个表中的每一行只拉出两列。但是,每个“tr”行有10个“td”单元格,我只希望每行有[1]和[8]“td”单元格。什么是最pythonic的方式来做到这一点?
根据我在下面的输入,我有一个表,一个正文,三行,每行10个单元格。
<table id ="tblMain">
<tbody>
<tr>
<td "text"</td>
<td "data1"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "data2"</td>
<td "text"</td>
<tr>
<td "text"</td>
<td "data1"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "data2"</td>
<td "text"</td>
<tr>
<td "text"</td>
<td "data1"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "data2"</td>
<td "text"</td>
我理解如何使用单元格的索引来循环并在[1]和[8]获得“td”。但是,当我试图将一行上的数据写回csv时,我感到很困惑。
table = soup.find('table', {'id':'tblMain'} )
table_body = table.find('tbody')
rows = table_body.findAll('tr')
data1_columns = []
data2_columns = []
for row in rows[1:]:
data1 = row.findAll('td')[1]
data1_columns.append(data1.text)
data2 = row.findAll('td')[8]
data2_columns.append(data2.text)
这是我当前的代码,它查找表,行和所有“td”单元格并将它们正确打印到.csv。但是,不是每行将所有10个“td”单元写回csv线,我只想抓住“td”[1]和“td”[8]。
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table', {'id':'tblMain'} )
table_body = table.find('tbody')
rows = table_body.findAll('tr')
filename = '%s.csv' % reportname
with open(filename, "wt+", newline="") as f:
writer = csv.writer(f)
for row in rows:
csv_row = []
for cell in row.findAll("td"):
csv_row.append(cell.get_text())
writer.writerow(csv_row)
我希望能够将“td”[1]和“td”[8]写回我的csv_row,以便将每个列表写回csv writer.writerow。
将行写回csv_row然后写入我的csv文件:
['data1', 'data2']
['data1', 'data2']
['data1', 'data2']
你几乎做到了
for row in rows:
row = row.findAll("td")
csv_row = [row[1].get_text(), row[8].get_text()]
writer.writerow(csv_row)
完整代码
html ='''<table id ="tblMain">
<tbody>
<tr>
<td>text</td>
<td>data1</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>data2</td>
<td>text</td>
<tr>
<td>text</td>
<td>data1</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>data2</td>
<td>text</td>
<tr>
<td>text</td>
<td>data1</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>data2</td>
<td>text</td>
'''
from bs4 import BeautifulSoup
import csv
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table', {'id':'tblMain'} )
table_body = table.find('tbody')
rows = table_body.findAll('tr')
reportname = 'output'
filename = '%s.csv' % reportname
with open(filename, "wt+", newline="") as f:
writer = csv.writer(f)
for row in rows:
row = row.findAll("td")
csv_row = [row[1].get_text(), row[8].get_text()]
writer.writerow(csv_row)
你应该能够使用nth-of-type pseudo class css选择器
from bs4 import BeautifulSoup as bs
import pandas as pd
html = 'actualHTML'
soup = bs(html, 'lxml')
results = []
for row in soup.select('#tblMain tr'):
out_row = [item.text.strip() for item in row.select('td:nth-of-type(2), td:nth-of-type(9)')]
results.append(out_row)
df = pd.DataFrame(results)
print(df)
df.to_csv(r'C:\Users\User\Desktop\data.csv', sep=',', encoding='utf-8-sig',index = False )
每当我需要拉桌子并且它有<table>
标签时,我让Pandas为我做的工作,然后只需maniuplate它返回的数据帧,如果需要的话。这就是我在这里要做的事情:
html = '''<table id ="tblMain">
<tbody>
<tr>
<td> text</td>
<td> data1</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> data2</td>
<td> text</td>
<tr>
<td> text</td>
<td> data1</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> data2</td>
<td> text</td>
<tr>
<td> text</td>
<td> data1</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> data2</td>
<td> text</td>'''
import pandas as pd
# .read_html() returns a list of dataframes
tables = pd.read_html(html)[0]
# we want the dataframe from that list in position [0]
df = tables[0]
# Use .iloc to say I want all the rows, and columns 1, 8
df = df.iloc[:,[1,8]]
# Write the dataframe to file
df.to_csv('path.filename.csv', index=False)