html如下:
<body>
<div name='tag-i-want'>
<span>I don't want this</span>
</div>
</body>
我正在尝试获取所有 div 并将它们转换为字符串:
divs = [str(i) for i in soup.find_all('div')]
但是,他们也会有自己的孩子:
>>> ["<div name='tag-i-want'><span>I don't want this</span></div>"]
我想要的是:
>>> ["<div name='tag-i-want'></div>"]
我想有
unwrap()
会返回这个,但它也改变了汤;我希望汤保持原样。
使用
clear
您可以删除标签的内容。
在不改变汤的情况下,您可以使用 copy
进行硬拷贝或使用 DIY 方法。这是一个带有 copy
的示例
from bs4 import BeautifulSoup
import copy
html = """<body>
<div name='tag-i-want'>
<span>I don't want this</span>
</div>
</body>"""
soup = BeautifulSoup(html, 'lxml')
div = soup.find('div')
div_only = copy.copy(div)
div_only.clear()
print(div_only)
print(soup.find_all('span') != [])
输出
<div name="tag-i-want"></div>
True
备注:DIY方法:无
copy
Tag
类from bs4 import BeautifulSoup, Tag
...
div_only = Tag(name='div', attrs=div.attrs)
div_only = '<div {}></div>'.format(' '.join(map(lambda p: f'{p[0]}="{p[1]}"', div.attrs.items())))
@cards 用
copy()
为我指明了正确的方向。这就是我最终使用的:
from bs4 import BeautifulSoup
import copy
html = """<body>
<div name='tag-i-want'>
<span>I don't want this</span>
</div>
</body>"""
soup = BeautifulSoup(html, 'lxml')
def remove_children(tag):
tag.clear()
return tag
divs = [str(remove_children(copy.copy(i))) for i in soup.find_all('div')]
我是这样解决的:
from bs4 import BeautifulSoup, Tag
html = """<body>
<div name='tag-i-want'>
<span>I don't want this</span>
</div>
</body>"""
soup = BeautifulSoup(html, 'html.parser')
# print the first div without it's children:
print(Tag(name=soup.div.name, attrs=soup.div.attrs))
# print all divs without it's children:
for i in soup.find_all("div"):
print(Tag(name=i.name, attrs=i.attrs))
这样你就可以把它写成一行行了。如果您使用 copy.copy 和clear,那么您所做的就超出了您必须做的。 (如果您只想打印标签及其属性)