我正在尝试解析网页中某个类的所有td标签的内容,但是我希望具有某种占位符内容,即使标签本身没有任何占位符。例如,html包含如下td标签:
<td class="odds bdevtt moneylineodds " cfg="">+134</td>
<td class="odds bdevtt moneylineodds " cfg=""></td>
<td class="odds bdevtt moneylineodds " cfg="">-140</td>
我正在尝试获取类似['+134','-','-140']的列表作为输出,因此列表中的条目数等于带有'-'的匹配标签的数量表示标签为空的占位符。但是,以下内容仅返回['+134','-140']。
soup.find_all('td', attrs={'class': 'odds bdevtt moneylineodds '})
from bs4 import BeautifulSoup
html = """
<td class="odds bdevtt moneylineodds " cfg="">+134</td>
<td class="odds bdevtt moneylineodds " cfg=""></td>
<td class="odds bdevtt moneylineodds " cfg="">-140</td>
"""
soup = BeautifulSoup(html,"html.parser")
all = [i.text if i.text != "" else "-" for i in soup.find_all('td', attrs={'class': 'odds bdevtt moneylineodds '})]
print(all)
# output: ['+134', '-', '-140']
从class
属性的值中删除尾随空格,您将获得预期的结果。
代码:
for elm in soup.find_all('td', attrs={'class': 'odds bdevtt moneylineodds'}):
print(elm.text)
输出:
+134
-140
原因是执行代码时
html = """
<td class="odds bdevtt moneylineodds " cfg="">+134</td>
<td class="odds bdevtt moneylineodds " cfg=""></td>
<td class="odds bdevtt moneylineodds " cfg="">-140</td>
"""
soup = BeautifulSoup(html,"html.parser") # <-- It will trim the trailing spaces from class value
print(soup)
输出:
<td cfg="" class="odds bdevtt moneylineodds">+134</td>
<td cfg="" class="odds bdevtt moneylineodds"></td>
<td cfg="" class="odds bdevtt moneylineodds">-140</td>