我正在尝试最小化代码,以使其更有效率。但是,我被这辆KeyError卡车撞倒了,我不知道出了什么问题。请帮帮我,酋长,并指出为什么我的表情不好吗? PS我是业余水平。
使用这些代码:
recommended = soup.select('table:has(font:contains("推荐主题")), '
'table:has(font:contains("版块主题"))')
for item in recommended:
for i in item.select(".folder:has(a)"):
我将拥有DOM:
<td class="folder"><a href="thread-10439294-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>
<td class="folder"><a href="thread-10439293-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>
<td class="folder"><a href="thread-10439292-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>
<td class="folder"><a href="thread-10439290-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>
但是当我再添加一行时,
for item in recommended:
for i in item.select(".folder:has(a)"):
url_tail = i['href']
我将收到此KeyError:
return self.attrs[key]
KeyError: 'href'
我想摆脱的是href链接,谢谢大家。
@@ facelessuser很好地解释了错误(+),并给出了我的首选选择器。计划Bs似乎还有另外两个attribute = value选择器可能性
任何一个:
[href^="thread-"]
或:
[title="新窗口打开"]
可用于列表理解,例如
links = [item['href'] for item in soup.select('[href^='thread-']')]
您的select
可能关闭了item
,而不是soup
。如果最终的匹配范围太大.folder [title="新窗口打开"]
.folder:has(a)
正在选择td
元素,因为该元素是类别为.folder
的元素,并且其子元素为a
。它不是选择a
元素,只是检查带有.folder
的元素是否具有a
元素。
您可能想要.folder a
之类的东西。
您可以尝试这样。
由于您没有完整的HTML或Url,我只是尝试从您粘贴的HTML文本中检索href的值。
1)导入并创建BeautifulSoup对象»
>>> from bs4 import BeautifulSoup
>>>
>>> html_text = """<td class="folder"><a href="thread-10439294-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>
... <td class="folder"><a href="thread-10439293-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>
... <td class="folder"><a href="thread-10439292-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>
... <td class="folder"><a href="thread-10439290-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>"""
>>>
>>> soup = BeautifulSoup(html_text, "html.parser")
>>>
>>> soup
<td class="folder"><a href="thread-10439294-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>
<td class="folder"><a href="thread-10439293-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>
<td class="folder"><a href="thread-10439292-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>
<td class="folder"><a href="thread-10439290-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>
>>>
2)查找所有tds»
>>> tds = soup.find_all("td", class_="folder")
>>> tds
[<td class="folder"><a href="thread-10439294-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>, <td class="folder"><a href="thread-10439293-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>, <td class="folder"><a href="thread-10439292-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>, <td class="folder"><a href="thread-10439290-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>]
>>>
3)检查(仅测试)»
>>> tds[0]
<td class="folder"><a href="thread-10439294-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>
>>>
>>> tds[0].a
<a href="thread-10439294-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a>
>>>
>>> tds[0].a.get("href")
'thread-10439294-1-1.html'
>>>
4)最后,检索链接(2种方式)»
>>> # Using loop
...
>>> for td in tds:
... print(td.a.get("href"))
...
thread-10439294-1-1.html
thread-10439293-1-1.html
thread-10439292-1-1.html
thread-10439290-1-1.html
>>>
>>> for td in tds:
... print(td.a["href"])
...
thread-10439294-1-1.html
thread-10439293-1-1.html
thread-10439292-1-1.html
thread-10439290-1-1.html
>>>
>>>