这个网站https://itportal.ogauthority.co.uk/information/well_data/lithostratigraphy_hierarchy/rptLithoStrat_1Page2.html似乎有一个组织不好的html表。表格单元格的唯一标识符是每个tr标记内的宽度。我想抓取所有60页的信息。我怎么能找到一种方法来适当地刮掉每一排表?我知道标题的大小是10列但是因为对于一些tr
标签,我有5个td
标签,而对于其他一些我有或多或少的td
标签,根据其列精确地刮取数据并不容易。
在这里,您可以看到一部分代码,它只提取与一行相关的数据,但不保留空单元格的空值。
soup = BeautifulSoup(page.content, 'lxml') # Parse the HTML as a string
table = soup.find_all('table')[0] # Grab the first table
new_table = pd.DataFrame(columns=range(0,10), index = [0]) # I know the size
row_marker = 0
for row in table.find_all('tr'):
column_marker = 0
columns = row.find_all('td')
for column in columns:
new_table.iat[row_marker,column_marker] = column.get_text()
column_marker += 1
这是我从这段代码得到的输出(将所有值放在一行,它们之间没有任何间隙):
0 1 2 3 4 5 6 7 8 9
0 62.00 PACL Palaeocene Claystones SWAP NaN NaN NaN NaN NaN
但实际输出应该是这样的:
0 1 2 3 4 5 6 7 8 9
0 62.00 NaN NaN PACL Palaeocene Claystones NaN NaN NaN NaN SWAP
我使用了我在评论中提到的方法(使用宽度)来确定数据中的空值。这是Python代码:
import requests
import bs4
URL = 'https://itportal.ogauthority.co.uk/information/well_data/lithostratigraphy_hierarchy/rptLithoStrat_1Page2.html'
response = requests.get(URL)
soup = bs4.BeautifulSoup(response.text, 'lxml')
tables = soup.find_all('table')
count = 0
cells_count = 0
for table in tables:
count +=1
if count >2:
row = table.tr
cells = row.find_all('td')
print ''
x = 0
width_diff = 0
cell_text = []
for cell in cells:
width = cell.get('width')
if int(width) < 10:
continue
if width_diff > 0:
cell_text.append('NaN ')
if width_diff > 50:
x += 2
cell_text.append('Nan ')
else:
x += 1
width_diff = 0
if x == 0 or x == 1 or x == 2 or x == 3 or x == 4 or x == 6:
width_range = [35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50]
elif x == 5:
width_range = [220,221,222,223,224,225,226,227,228,229,230]
elif x == 7:
width_range = [136]
if cell.text:
cell_text.append(cell.text.strip() + ' ')
else:
cell_text.append('NaN ')
if int(width) not in width_range:
width_diff = int(width) - width_range[-1]
x += 1
#print x,
length = len(cell_text)
for i in range(0, length):
print cell_text[i],
diff = 9 - length
if diff > 0:
for j in range(0, diff):
print 'NaN ',
如您所见,我注意到每列中都使用了一定的宽度范围。通过将每个单元格与其假定的宽度进行比较,我们可以确定它需要多少空间。如果宽度差异太大,则意味着它占用了下两个单元格的空间。
它可能需要一些改进,您需要针对所有URL测试脚本以确保数据绝对干净。
以下是运行此代码的示例输出:
61.00 SED TERT WBDS NaN Woolwich Beds GP NaN WLDB
62.00 NaN NaN PACL NaN Palaeocene Claystones NaN Nan SWAP
63.00 NaN NaN SMFC NaN Shallow Marine Facies NaN Nan SONS
64.00 NaN NaN DMFC NaN Deep Marine Facies NaN NaN NaN
65.00 NaN NaN SLSY NaN Selsey Member GN NaN WSXB
66.00 NaN NaN MFM NaN Marsh Farm Member NaN NaN NaN
67.00 NaN NaN ERNM NaN Earnley Member NaN NaN NaN
68.00 NaN NaN WITT NaN Wittering Member NaN NaN NaN
69.00 NaN NaN WHI NaN Whitecliff Beds GZ NaN NaN
70.00 NaN NaN Nan WFSM NaN Whitecliff Sand Member NaN Nan GN
71.00 NaN WESQ NaN Nan Westray Group Equivalent NL GW WESH
72.00 NaN WESR NaN Nan Westray Group NM GO CNSB
73.00 NaN NaN THEF NaN Thet Formation NaN Nan MOFI
74.00 NaN NaN SKAD NaN Skade Formation NB NaN NONS
75.00 NaN NORD NaN Nan Nordland NP Q CNSB
75.50 NaN NaN SWCH NaN Swatchway Formation Q NaN MOFI
75.60 NaN NaN CLPT NaN Coal Pit Formation NaN NaN NaN
75.70 NaN NaN LNGB NaN Ling Bank Formation NaN NaN NaN
76.00 NaN NaN SHKL NaN Shackleton Formation GO QP ROCK
77.00 NaN NaN UGNS NaN Upper Tertiary sands NaN NM NONS
78.00 NaN NaN CLSD NaN Claret Sand NP NaN SVIG
79.00 NaN NaN BLUE NaN Blue Sand NaN NaN NaN
80.00 NaN NaN ABGF NaN Aberdeen Ground Formation QH NaN CNSB
81.00 NaN NaN NUGU NaN Upper Glauconitic Unit NB NA MOFI
82.00 NaN NaN POWD NaN Powder Sand GN NaN SVIG
83.00 NaN NaN BASD NaN Basin Sand NaN Nan CNSB
84.00 NaN NaN CRND NaN Crenulate Sand NaN NaN NaN
85.00 NaN NaN NORS NaN Nordland Sand QP NaN SONS
86.00 NaN NaN MIOS NaN Miocene Sand NM NaN ESHB
87.00 NaN NaN MIOL NaN Miocene Limestone NaN Nan CNSB
88.00 NaN NaN FLSF NaN Fladen Sand Formation GP GO WYGG
注意:我不知道你的例子的第一个单元格中的0是如何创建的,所以我把它留在了答案之外。我不知道它是否应该被刮掉,因为我没有在任何地方找到它。
@samy非常感谢您使用这种方法来抓取这个网站:
它需要稍加修改才能确保它适用于所有网页,我将在您的代码上应用这些修改以确保所有内容都被删除。
import requests
import bs4
URL = 'https://itportal.ogauthority.co.uk/information/well_data/lithostratigraphy_hierarchy/rptLithoStrat_1Page2.html'
dfcolname=['OrderNo', 'Type', 'Group', 'Formation', 'Member', 'Description', 'Upper Age', 'Lower Age', 'Basin']
response = requests.get(URL)
soup = bs4.BeautifulSoup(response.text, 'lxml')
tables = soup.find_all('table')
count = 0
cells_count = 0
for table in tables:
count +=1
cell_text = []
if count > 2 and table!=tables[-1]:
row = table.tr
cells = row.find_all('td')
print ('')
x = 0
width_diff = 0
cell_text = []
for cell in cells:
width = cell.get('width')
if int(width) < 10:
continue
if width_diff > 2:
cell_text.append('NaN ')
if width_diff > 50:
x += 2
cell_text.append('Nan ')
else:
x += 1
width_diff = 0
if x == 0 or x == 1 or x == 2 or x == 3 or x == 4 or x == 6:
width_range = [35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50]
elif x == 5:
width_range = [220,221,222,223,224,225,226,227,228,229,230]
elif x == 7:
width_range = [136]
if cell.text:
cell_text.append(cell.text.strip() + ' ')
else:
cell_text.append('NaN ')
if int(width) not in width_range:
width_diff = int(width) - width_range[-1]
x += 1
length = len(cell_text)
for i in range(0, length):
rlist.append(cell_text[i])
diff = 8 - length
if diff > 0:
for j in range(0, diff):
cell_text.append('NaN ')
print(cell_text)
实际上,0是数据帧的标识符。我首先尝试将结果保存在数据帧中,因为我之前的结果中有一个0。