如何使用 beautiful soup 从 HTML 内容中选择特定的 div 或 pragraph 标签？

Question

我正在使用 beautiful soup 从 HTML 数据中提取一些文本内容。我有一个 div 和几个段落标签，最后一段是带有版权徽标、年份和更多信息的版权信息。根据内容的年份不同，年份也不同，所以我无法查找确切的文本，但除了可变年份之外，其余部分始终相同。

我可以删除/忽略最后一段吗？

from bs4 import BeautifulSoup

text_content = '<div><p>here is the header information </p><p> some text content </p> <p> another block of text</p> .....<p> 2024 copyright , all rights reserved </p>'

bs = BeautifulSoup(text_content, "html.parser")

only_text = " ".join([p.text for p in soup.find_all("p")])

我使用了 beautiful soup 来获取所有文本内容，现在我想删除特定段落。

Answer 1

您可以使用

find_all

运算符对列表中的最后一项进行切片（

[:-1]

的结果）：

only_text = " ".join([p.text for p in bs.find_all("p")[:-1] ])

所以完整的代码变成：

from bs4 import BeautifulSoup

text_content = '<div><p>here is the header information </p><p> some text content </p> <p> another block of text</p> .....<p> 2024 copyright , all rights reserved </p>'

bs = BeautifulSoup(text_content, "html.parser")

only_text = " ".join([p.text for p in bs.find_all("p")[:-1] ])

print(only_text)

输出：

here is the header information   some text content   another block of text

如何使用 beautiful soup 从 HTML 内容中选择特定的 div 或 pragraph 标签？

问题描述投票：0回答：1

1个回答

最新问题

如何使用 beautiful soup 从 HTML 内容中选择特定的 div 或 pragraph 标签？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1