Beautifulsoup过滤器结果在“ for i”循环中引发KeyError

问题描述 投票:2回答:3

我正在尝试最小化代码,以使其更有效率。但是,我被这辆KeyError卡车撞倒了,我不知道出了什么问题。请帮帮我,酋长,并指出为什么我的表情不好吗? PS我是业余水平。

使用这些代码:

recommended = soup.select('table:has(font:contains("推荐主题")), '
                          'table:has(font:contains("版块主题"))')
for item in recommended:
    for i in item.select(".folder:has(a)"):

我将拥有DOM:

<td class="folder"><a href="thread-10439294-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>
<td class="folder"><a href="thread-10439293-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>
<td class="folder"><a href="thread-10439292-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>
<td class="folder"><a href="thread-10439290-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>

但是当我再添加一行时,

for item in recommended:
    for i in item.select(".folder:has(a)"):
        url_tail = i['href']

我将收到此KeyError:

    return self.attrs[key]
KeyError: 'href'

我想摆脱的是href链接,谢谢大家。

python python-3.x web-scraping beautifulsoup keyerror
3个回答
2
投票

@@ facelessuser很好地解释了错误(+),并给出了我的首选选择器。计划Bs似乎还有另外两个attribute = value选择器可能性

任何一个:

[href^="thread-"]

或:

[title="新窗口打开"]

可用于列表理解,例如

links =  [item['href'] for item in soup.select('[href^='thread-']')]

您的select可能关闭了item,而不是soup。如果最终的匹配范围太大.folder [title="新窗口打开"]

,您总是可以抛出父类

2
投票

.folder:has(a)正在选择td元素,因为该元素是类别为.folder的元素,并且其子元素为a。它不是选择a元素,只是检查带有.folder的元素是否具有a元素。

您可能想要.folder a之类的东西。


1
投票

您可以尝试这样。

由于您没有完整的HTML或Url,我只是尝试从您粘贴的HTML文本中检索href的值。

1)导入并创建BeautifulSoup对象»

>>> from bs4 import BeautifulSoup
>>> 
>>> html_text = """<td class="folder"><a href="thread-10439294-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>
... <td class="folder"><a href="thread-10439293-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>
... <td class="folder"><a href="thread-10439292-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>
... <td class="folder"><a href="thread-10439290-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>"""
>>> 
>>> soup = BeautifulSoup(html_text, "html.parser")
>>>
>>> soup
<td class="folder"><a href="thread-10439294-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>
<td class="folder"><a href="thread-10439293-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>
<td class="folder"><a href="thread-10439292-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>
<td class="folder"><a href="thread-10439290-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>
>>> 

2)查找所有tds»

>>> tds = soup.find_all("td", class_="folder")
>>> tds
[<td class="folder"><a href="thread-10439294-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>, <td class="folder"><a href="thread-10439293-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>, <td class="folder"><a href="thread-10439292-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>, <td class="folder"><a href="thread-10439290-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>]
>>> 

3)检查(仅测试)»

>>> tds[0]
<td class="folder"><a href="thread-10439294-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>
>>> 
>>> tds[0].a
<a href="thread-10439294-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a>
>>> 
>>> tds[0].a.get("href")
'thread-10439294-1-1.html'
>>> 

4)最后,检索链接(2种方式)»

>>> # Using loop
... 
>>> for td in tds:
...     print(td.a.get("href"))
... 
thread-10439294-1-1.html
thread-10439293-1-1.html
thread-10439292-1-1.html
thread-10439290-1-1.html
>>> 
>>> for td in tds:
...     print(td.a["href"])
... 
thread-10439294-1-1.html
thread-10439293-1-1.html
thread-10439292-1-1.html
thread-10439290-1-1.html
>>> 
>>> 
© www.soinside.com 2019 - 2024. All rights reserved.