如何在Python中使用Beautifulsoup刮掉结构不好的html表?

问题描述 投票:2回答:2

这个网站https://itportal.ogauthority.co.uk/information/well_data/lithostratigraphy_hierarchy/rptLithoStrat_1Page2.html似乎有一个组织不好的html表。表格单元格的唯一标识符是每个tr标记内的宽度。我想抓取所有60页的信息。我怎么能找到一种方法来适当地刮掉每一排表?我知道标题的大小是10列但是因为对于一些tr标签,我有5个td标签,而对于其他一些我有或多或少的td标签,根据其列精确地刮取数据并不容易。

在这里,您可以看到一部分代码,它只提取与一行相关的数据,但不保留空单元格的空值。

soup = BeautifulSoup(page.content, 'lxml') # Parse the HTML as a string
table = soup.find_all('table')[0] # Grab the first table
new_table = pd.DataFrame(columns=range(0,10), index = [0]) # I know the size
row_marker = 0
for row in table.find_all('tr'):
     column_marker = 0
     columns = row.find_all('td')
     for column in columns:
           new_table.iat[row_marker,column_marker] = column.get_text()
           column_marker += 1

这是我从这段代码得到的输出(将所有值放在一行,它们之间没有任何间隙):

     0     1     2                  3     4   5    6    7    8    9  

0  62.00    PACL  Palaeocene Claystones  SWAP  NaN  NaN  NaN  NaN  NaN

但实际输出应该是这样的:

   0        1    2   3                        4   5    6    7    8    9  

0  62.00   NaN NaN  PACL  Palaeocene Claystones  NaN  NaN  NaN  NaN  SWAP
python html web-scraping html-table beautifulsoup
2个回答
3
投票

我使用了我在评论中提到的方法(使用宽度)来确定数据中的空值。这是Python代码:

import requests                                                                                                                                                                                                                  
import bs4                                                                                                                                                                                                                       

URL = 'https://itportal.ogauthority.co.uk/information/well_data/lithostratigraphy_hierarchy/rptLithoStrat_1Page2.html'                                                                                                           

response = requests.get(URL)                                                                                                                                                                                                     
soup = bs4.BeautifulSoup(response.text, 'lxml')                                                                                                                                                                                  

tables = soup.find_all('table')                                                                                                                                                                                                  
count = 0                                                                                                                                                                                                                        
cells_count = 0                                                                                                                                                                                                                  

for table in tables:                                                                                                                                                                                                             
        count +=1                                                                                                                                                                                                                
        if count >2:                                                                                                                                                                                                             
                row = table.tr                                                                                                                                                                                                   
                cells = row.find_all('td')                                                                                                                                                                                       
                print ''                                                                                                                                                                                                         
                x = 0                                                                                                                                                                                                            
                width_diff = 0                                                                                                                                                                                                   
                cell_text = []                                                                                                                                                                                                   
                for cell in cells:                                                                                                                                                                                               
                        width = cell.get('width')                                                                                                                                                                                
                        if int(width) < 10:                                                                                                                                                                                      
                                continue                                                                                                                                                                                         

                        if width_diff > 0:                                                                                                                                                                                       
                                cell_text.append('NaN ')                                                                                                                                                                         
                                if width_diff > 50:                                                                                                                                                                              
                                        x += 2                                                                                                                                                                                   
                                        cell_text.append('Nan ')                                                                                                                                                                 
                                else:
                                        x += 1
                                width_diff = 0

                        if x == 0 or x == 1 or x == 2 or x == 3 or x == 4 or x == 6:
                                width_range = [35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50]
                        elif x == 5:
                                width_range = [220,221,222,223,224,225,226,227,228,229,230]
                        elif x == 7:
                                width_range = [136]


                        if cell.text:
                                cell_text.append(cell.text.strip() + ' ')
                        else:
                                cell_text.append('NaN ')

                        if int(width) not in width_range:
                                width_diff = int(width) - width_range[-1]
                        x += 1
                        #print x,
                length = len(cell_text)
                for i in range(0, length):
                        print cell_text[i],
                diff = 9 - length
                if diff > 0:
                        for j in range(0, diff):
                                print 'NaN ',

如您所见,我注意到每列中都使用了一定的宽度范围。通过将每个单元格与其假定的宽度进行比较,我们可以确定它需要多少空间。如果宽度差异太大,则意味着它占用了下两个单元格的空间。

它可能需要一些改进,您需要针对所有URL测试脚本以确保数据绝对干净。

以下是运行此代码的示例输出:

61.00  SED  TERT  WBDS  NaN  Woolwich Beds  GP  NaN  WLDB                                                                                                                                                                        
62.00  NaN  NaN  PACL  NaN  Palaeocene Claystones  NaN  Nan  SWAP                                                                                                                                                                
63.00  NaN  NaN  SMFC  NaN  Shallow Marine Facies  NaN  Nan  SONS                                                                                                                                                                
64.00  NaN  NaN  DMFC  NaN  Deep Marine Facies  NaN  NaN  NaN                                                                                                                                                                    
65.00  NaN  NaN  SLSY  NaN  Selsey Member  GN  NaN  WSXB                                                                                                                                                                         
66.00  NaN  NaN  MFM  NaN  Marsh Farm Member  NaN  NaN  NaN                                                                                                                                                                      
67.00  NaN  NaN  ERNM  NaN  Earnley Member  NaN  NaN  NaN                                                                                                                                                                        
68.00  NaN  NaN  WITT  NaN  Wittering Member  NaN  NaN  NaN                                                                                                                                                                      
69.00  NaN  NaN  WHI  NaN  Whitecliff Beds  GZ  NaN  NaN                                                                                                                                                                         
70.00  NaN  NaN  Nan  WFSM  NaN  Whitecliff Sand Member  NaN  Nan  GN                                                                                                                                                            
71.00  NaN  WESQ  NaN  Nan  Westray Group Equivalent  NL  GW  WESH                                                                                                                                                               
72.00  NaN  WESR  NaN  Nan  Westray Group  NM  GO  CNSB                                                                                                                                                                          
73.00  NaN  NaN  THEF  NaN  Thet Formation  NaN  Nan  MOFI                                                                                                                                                                       
74.00  NaN  NaN  SKAD  NaN  Skade Formation  NB  NaN  NONS                                                                                                                                                                       
75.00  NaN  NORD  NaN  Nan  Nordland  NP  Q  CNSB                                                                                                                                                                                
75.50  NaN  NaN  SWCH  NaN  Swatchway Formation  Q  NaN  MOFI                                                                                                                                                                    
75.60  NaN  NaN  CLPT  NaN  Coal Pit Formation  NaN  NaN  NaN                                                                                                                                                                    
75.70  NaN  NaN  LNGB  NaN  Ling Bank Formation  NaN  NaN  NaN                                                                                                                                                                   
76.00  NaN  NaN  SHKL  NaN  Shackleton Formation  GO  QP  ROCK                                                                                                                                                                   
77.00  NaN  NaN  UGNS  NaN  Upper Tertiary sands  NaN  NM  NONS                                                                                                                                                                  
78.00  NaN  NaN  CLSD  NaN  Claret Sand  NP  NaN  SVIG                                                                                                                                                                           
79.00  NaN  NaN  BLUE  NaN  Blue Sand  NaN  NaN  NaN                                                                                                                                                                             
80.00  NaN  NaN  ABGF  NaN  Aberdeen Ground Formation  QH  NaN  CNSB                                                                                                                                                             
81.00  NaN  NaN  NUGU  NaN  Upper Glauconitic Unit  NB  NA  MOFI                                                                                                                                                                 
82.00  NaN  NaN  POWD  NaN  Powder Sand  GN  NaN  SVIG                                                                                                                                                                           
83.00  NaN  NaN  BASD  NaN  Basin Sand  NaN  Nan  CNSB                                                                                                                                                                           
84.00  NaN  NaN  CRND  NaN  Crenulate Sand  NaN  NaN  NaN                                                                                                                                                                        
85.00  NaN  NaN  NORS  NaN  Nordland Sand  QP  NaN  SONS                                                                                                                                                                         
86.00  NaN  NaN  MIOS  NaN  Miocene Sand  NM  NaN  ESHB                                                                                                                                                                          
87.00  NaN  NaN  MIOL  NaN  Miocene Limestone  NaN  Nan  CNSB                                                                                                                                                                    
88.00  NaN  NaN  FLSF  NaN  Fladen Sand Formation  GP  GO  WYGG                                                                                                                                                                  

注意:我不知道你的例子的第一个单元格中的0是如何创建的,所以我把它留在了答案之外。我不知道它是否应该被刮掉,因为我没有在任何地方找到它。


0
投票

@samy非常感谢您使用这种方法来抓取这个网站:

它需要稍加修改才能确保它适用于所有网页,我将在您的代码上应用这些修改以确保所有内容都被删除。

import requests                                                                                                                                                                                                                  
import bs4 


URL = 'https://itportal.ogauthority.co.uk/information/well_data/lithostratigraphy_hierarchy/rptLithoStrat_1Page2.html'  

dfcolname=['OrderNo', 'Type', 'Group', 'Formation', 'Member', 'Description', 'Upper Age', 'Lower Age', 'Basin']

response = requests.get(URL)                                                                                                                                                                                                     
soup = bs4.BeautifulSoup(response.text, 'lxml')                                                                                                                                                                                  

tables = soup.find_all('table')                                                                                                                                                                                                  
count = 0                                                                                                                                                                                                                        
cells_count = 0                                                                                                                                                                                                                  

for table in tables:                                                                                                                                                                                                   
    count +=1
    cell_text = [] 
    if count > 2 and table!=tables[-1]:
        row = table.tr 
        cells = row.find_all('td') 
        print ('')                                                                                                                                                                                                        
        x = 0                                                                                                                                                                                                            
        width_diff = 0                                                                                                                                                                                                   
        cell_text = []                                                                                                                                                                                                   
        for cell in cells:                                                                                                                                                                                               
                                width = cell.get('width') 
                                if int(width) < 10:                                                                                                                                                                                      
                                        continue                                                                                                                                                                                         

                                if width_diff > 2:
                                        cell_text.append('NaN ')                                                                                                                                                                         
                                        if width_diff > 50:                                                                                                                                                                              
                                                x += 2                                                                                                                                                                                   
                                                cell_text.append('Nan ')                                                                                                                                                                 
                                        else:
                                                x += 1
                                        width_diff = 0
                                if x == 0 or x == 1 or x == 2 or x == 3 or x == 4 or x == 6:
                                    width_range = [35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50]
                                elif x == 5:
                                     width_range = [220,221,222,223,224,225,226,227,228,229,230]
                                elif x == 7:
                                        width_range = [136]

                                if cell.text:
                                        cell_text.append(cell.text.strip() + ' ')
                                else:
                                        cell_text.append('NaN ')

                                if int(width) not in width_range:
                                        width_diff = int(width) - width_range[-1]
                                x += 1
                                length = len(cell_text)
                                for i in range(0, length):
                                    rlist.append(cell_text[i])

        diff = 8 - length
        if diff > 0:
            for j in range(0, diff):
                cell_text.append('NaN ')
    print(cell_text)

实际上,0是数据帧的标识符。我首先尝试将结果保存在数据帧中,因为我之前的结果中有一个0。

© www.soinside.com 2019 - 2024. All rights reserved.