Python + BeautifulSoup：如何获得'a'元素的'href'属性？

Question

我有以下内容：

  html =
  '''<div class=“file-one”>
    <a href=“/file-one/additional” class=“file-link">
      <h3 class=“file-name”>File One</h3>
    </a>
    <div class=“location”>
      Down
    </div>
  </div>'''

并希望得到href的/file-one/additional文本。所以我做了：

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

link_text = “”

for a in soup.find_all(‘a’, href=True, text=True):
    link_text = a[‘href’]

print “Link: “ + link_text

但它只是打印一个空白，没有。只是Link:。所以我在另一个网站上测试了它，但是使用了不同的HTML，并且它有效。

我能做错什么？或者是否有可能该网站故意编程不返回href？

提前谢谢，一定会upvote /接受答复！

Answer 1

html中的'a'标记没有直接的文字，但它包含一个带有文字的'h3'标签。这意味着text为None，.find_all()无法选择标签。如果标记包含除文本内容之外的任何其他html元素，则通常不使用text参数。

如果仅使用标记的名称（和href关键字参数）来选择元素，则可以解决此问题。然后在循环中添加一个条件以检查它们是否包含文本。

soup = BeautifulSoup(html, 'html.parser')
links_with_text = []
for a in soup.find_all('a', href=True): 
    if a.text: 
        links_with_text.append(a['href'])

或者你可以使用列表理解，如果你更喜欢单行。

links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]

或者你可以将lambda传递给.find_all()。

tags = soup.find_all(lambda tag: tag.name == 'a' and tag.get('href') and tag.text)

如果您想收集所有链接是否有文本，只需选择所有具有'href'属性的'a'标签。锚标签通常有链接，但这不是必需的，所以我认为最好使用href参数。

使用.find_all()。

links = [a['href'] for a in soup.find_all('a', href=True)]

使用.select()和CSS选择器。

links = [a['href'] for a in soup.select('a[href]')]

Answer 2

首先，使用不使用引号的不同文本编辑器。
其次，从text=True中删除soup.find_all标志

Answer 3

您还可以使用attrs来获取带有正则表达式搜索的href标记

soup.find('a', href = re.compile(r'[/]([a-z]|[A-Z])\w+')).attrs['href']

Python + BeautifulSoup：如何获得'a'元素的'href'属性？

问题描述投票：7回答：3

3个回答

最新问题

Python + BeautifulSoup：如何获得'a'元素的'href'属性？

问题描述 投票：7回答：3

3个回答

最新问题

问题描述投票：7回答：3