我正在尝试解析一组表,这些表列出了有关智能手机的信息。例如this link。我只是想获得我需要的4个特定领域,而获得第四个领域使我发疯。
似乎HTML格式错误。我们将几个表顺序放置到html中。前五个可以,但是第六个表以</td></tr></table>
结尾,关闭之前未打开的<td>
和<tr>
(或者至少我认为这是问题所在):
<table cellspacing="0">
<tr>
<th rowspan="5" scope="row">Memory</th>
<td class="ttl"><a href="glossary.php3?term=memory-card-slot">Card slot</a></td>
<td class="nfo" data-spec="memoryslot">microSD, up to 256 GB (uses shared SIM slot)</td></tr>
<tr>
<td class="ttl"><a href="glossary.php3?term=dynamic-memory">Internal</a></td>
<td class="nfo" data-spec="internalmemory">64GB 6GB RAM, 128GB 6GB RAM, 128GB 8GB RAM, 256GB 8GB RAM</td>
</tr>
<tr><td class="ttl"> </td><td class="nfo" data-spec="memoryother">UFS2.1</td></tr>
</td>
</tr>
</table>
而且,第七张表的表格式很差,但我想这对于bs4应该不是问题。
因此,如果我尝试使用CSS选择器从表7th到最后一个表获取任何值,那么选择器将返回None。实际上,如果我仅使用选择器来获取所有表,它只会选择前6个表:
dsoup = BeautifulSoup(dr.content, 'html.parser')
dsel = dsoup.select('#specs-list > table')
print('Found {} tables'.format(len(dsel))) # Prints 6 tables
dsel = dsoup.select_one('#specs-list > table:nth-of-type(10) > tbody > tr:nth-of-type(3) > td.nfo')
print(dsel.text.split('\n')) # None
所以问题是,有没有办法解析像这种格式错误的HTML的情况,还是不可能?
[不使用'html.parser'
,但使用'html5lib'
-它根据(大多数)HTML5规则进行解析:
import requests
from bs4 import BeautifulSoup
url = 'https://www.gsmarena.com/xiaomi_redmi_note_8_pro-9812.php'
soup = BeautifulSoup(requests.get(url).text, 'html5lib')
for th in soup.select('#specs-list th'):
table = th.find_previous('table')
for ttl in table.select('.ttl'):
print('{:<20} {:<20} {}'.format( th.text, ttl.text, ttl.find_next_sibling('td', {'class':'nfo'}).get_text(strip=True, separator=' ')) )
打印:
Network Technology GSM / HSPA / LTE
Network 2G bands GSM 850 / 900 / 1800 / 1900 - SIM 1 & SIM 2
Network 3G bands HSDPA 850 / 900 / 1900 / 2100
Network 4G bands LTE band 1(2100), 3(1800), 5(850), 7(2600), 8(900), 40(2300), 41(2500)
Network Speed HSPA 42.2/5.76 Mbps, LTE-A
Launch Announced 2019, August
Launch Status Available. Released 2019, September
Body Dimensions 161.4 x 76.4 x 8.8 mm (6.35 x 3.01 x 0.35 in)
Body Weight 200 g (7.05 oz)
Body Build Front/back glass (Gorilla Glass 5)
Body SIM Hybrid Dual SIM (Nano-SIM, dual stand-by)
Display Type IPS LCD capacitive touchscreen, 16M colors
Display Size 6.53 inches, 104.7 cm 2 (~84.9% screen-to-body ratio)
Display Resolution 1080 x 2340 pixels, 19.5:9 ratio (~395 ppi density)
Display Protection Corning Gorilla Glass 5
Display 500 nits max brightness HDR
Platform OS Android 9.0 (Pie); MIUI 10
Platform Chipset Mediatek Helio G90T (12nm)
... and so on.