无法修改脚本的逻辑以从不同长度的表中抓取准确的字段

问题描述 投票:0回答:1

我正在尝试创建一个脚本来根据标题从表中抓取一些字段。问题是所有表及其表头的长度并不相同。以下是两个不同长度表格的 HTML 元素供您参考。

from bs4 import BeautifulSoup

html_table1 = """
<thead>
<tr>
    <td rowspan="1" data-sort-key="survey_date_surrkey">&nbsp;</td>
    <th scope="col" colspan="1" data-sort-key="cell-1"><span>2 Bedroom</span></th>
    <th scope="col" colspan="1" data-sort-key="cell-2"><span>3 Bedroom +</span></th>
    <th scope="col" colspan="1" data-sort-key="cell-3"><span>Total</span></th>
</tr>            
</thead>
<tbody>
<tr>
    <th scope="row" class="first-cell" data-field="survey_date_surrkey" data-value="0">2010 October</th>
    <td class="numericalData" data-field="cell-1" data-value="23">23</td>
    <td class="numericalData" data-field="cell-2" data-value="5">5</td>
    <td class="numericalData" data-field="cell-3" data-value="28">28</td>
</tr>
</tbody>
"""

html_table = """
<thead>
<tr>
    <td rowspan="1" data-sort-key="survey_date_surrkey">&nbsp;</td>
    <th scope="col" colspan="2" data-sort-key="cell-1"><span>Bachelor</span></th>
    <th scope="col" colspan="2" data-sort-key="cell-3"><span>1 Bedroom</span></th>
    <th scope="col" colspan="2" data-sort-key="cell-5"><span>2 Bedroom</span></th>
    <th scope="col" colspan="2" data-sort-key="cell-7"><span>3 Bedroom +</span></th>
    <th scope="col" colspan="2" data-sort-key="cell-9"><span>Total</span></th>
</tr>            
</thead>
<tbody>
<tr>
    <th scope="row" class="first-cell" data-field="survey_date_surrkey" data-value="0">1990 October</th>
    <td class="numericalData" data-field="cell-1" data-value="0.0">0.0</td>
    <td class="" data-field="cell-2" data-value="b ">b </td>
    <td class="numericalData" data-field="cell-3" data-value="0.2">0.2</td>
    <td class="" data-field="cell-4" data-value="a ">a </td>
    <td class="numericalData" data-field="cell-5" data-value="0.0">0.0</td>
    <td class="" data-field="cell-6" data-value="b ">b </td>
    <td class="numericalData" data-field="cell-7" data-value="0.0">0.0</td>
    <td class="" data-field="cell-8" data-value="b ">b </td>
    <td class="numericalData" data-field="cell-9" data-value="0.1">0.1</td>
    <td class="" data-field="cell-10" data-value="a ">a </td>
</tr>
</tbody>
"""

soup = BeautifulSoup(html_table1,'html.parser')

headers = [header.text.strip() for header in soup.select('thead th')]
non_class_presence = [item.text.strip() for item in soup.select("tbody tr td[class='']")]
index_num = len(soup.select("td.numericalData"))

try:
    bachelor = soup.select("td.numericalData")[0].text
except IndexError: bachelor = ""
try:
    one_bed = soup.select("td.numericalData")[1].text
except IndexError: one_bed = ""
try:
    two_bed = soup.select("td.numericalData")[2].text
except IndexError: two_bed = ""
try:
    three_bed = soup.select("td.numericalData")[3].text
except IndexError: three_bed = ""
try:
    total_bed = soup.select("td.numericalData")[4].text
except IndexError: total_bed = ""
print(f"bachelor: {bachelor}\none_bed: {one_bed}\ntwo_bed: {two_bed}\nthree_bed: {three_bed}\ntotal_bed: {total_bed}")

当我使用第二个

html_table
运行脚本时,我得到以下输出:(
correct result
)

bachelor: 0.0
one_bed: 0.2
two_bed: 0.0
three_bed: 0.0
total_bed: 0.1

但是,当我使用

html_table1
运行脚本时,该脚本会产生错误结果:(
wrong result
)

bachelor: 23
one_bed: 5
two_bed: 28
three_bed: 
total_bed:

html_table1
的正确输出:

bachelor: 
one_bed: 
two_bed: 23
three_bed: 5
total_bed: 28

如何更改脚本的逻辑,以便无论表及其标题的长度如何,都能准确解析结果?

python python-3.x web-scraping beautifulsoup
1个回答
0
投票

您已经接近正确的结果,只需更精确地选择数值,类似于您对

non_class_presence
zip()
结果所做的操作:

headers = [header.get_text(strip=True) for header in soup.select('thead th')]
numeric_values = [item.get_text(strip=True) for item in soup.select("tbody tr td.numericalData")]

dict(zip(headers,numeric_values))

我替换了

.text.strip()
.get_text(strip=True)
没有什么问题,因为它更接近文档


结果

html_table

{'Bachelor': '0.0',
 '1 Bedroom': '0.2',
 '2 Bedroom': '0.0',
 '3 Bedroom +': '0.0',
 'Total': '0.1'}

结果

html_table1

{'2 Bedroom': '23', '3 Bedroom +': '5', 'Total': '28'}
© www.soinside.com 2019 - 2024. All rights reserved.