我正在使用 BeautifulSoup 提取两个特定 HTML 标签之间的内容。这些标签没有任何特定的属性或 ID,我想提取标签第一次和第二次出现之间的所有内容。例如,给定以下 HTML:
<h1></h1>
Text <i>here</i> has no tag
<div>This is in a div</div>
<h1></h1>
我想要的输出是:
Text <i>here</i> has no tag
<div>This is in a div</div>
但是,如果标签具有不同的 id 属性,例如:
<h1 id="start"></h1>
Text <i>here</i> has no tag
<div>This is in a div</div>
<h1 id="end"></h1>
我仍然想提取相同的内容,但使用特定的 id 属性作为边界。
代码:
这是我使用 BeautifulSoup 编写的 Python 代码来实现此目的:
from bs4 import BeautifulSoup
html_doc = '''
This I <b>don't</b> want
<h1></h1>
Text <i>here</i> has no tag
<div>This is in a div</div>
<h1></h1>
This I <b>don't</b> want too
'''
# Parse the HTML
soup = BeautifulSoup(html_doc, 'html.parser')
# Find the first and second <h1> tags
h1_tags = soup.find_all('h1')
if len(h1_tags) >= 2:
# Extract content between the two <h1> tags
between_tags = []
for element in h1_tags[0].next_siblings:
if element == h1_tags[1]:
break
between_tags.append(str(element))
# Join and print the result
print(''.join(between_tags).strip())
打印:
Text <i>here</i> has no tag
<div>This is in a div</div>
无需 id 即可工作。
如果 ht 标签具有不同的 idsd
"""
This I <b>don't</b> want
<h1 id="start"></h1>
Text <i>here</i> has no tag
<div>This is in a div</div>
<h1 id="end"></h1>
This I <b>don't</b> want too
'''
"""
parent 和 decompose 方法可能对您有帮助。
# 1. Find the first item you are looking for.
soup = BeautifulSoup(html_doc, 'html.parser')
hElem = soup.find("h1", {'id': 'beautiful'})
# 2. Find the second condition.
endElem = soup.find('h1', {'id': 'good'})
# 3. Get parent element that contains both.
hParent = hElem.parent # Can be made more complex if multiple ancestors are needed to contain both conditions.
# 4. Iterate through children and remove all children outside the conditions.
childrenElems = hParent.children
inBetween = true
for child in childrenElems:
if not inBetween:
child.decompose()
if child == endElem:
inBetween = false
# Remaining data.
print(childrenElems)
您可以自己迭代汤内容并构建每个
<h1>
标签之间可见的元素块:
from bs4 import BeautifulSoup
from bs4.element import Tag
html = """
<h1 id = '1' ></h1>
Text1 <i>here</i> has no tag
<div>This is in a div</div>
<h1 id = '2' ></h1>
Text2 <i>here</i> has no tag
<div>This is in a div</div>
<h1 id = '3' ></h1>
Text3 <i>here</i> has no tag
"""
soup = BeautifulSoup(html, "html.parser")
block = []
blocks = []
h1 = False
for el in soup.contents:
if type(el) == Tag and el.name == 'h1':
# Has a h1 tag been seen yet?
if h1:
blocks.append(block)
block = []
h1 = True
elif h1:
block.append(el)
# Add any final elements (missing a next h1)
if block:
blocks.append(block)
# Display each block as html soup
for b in blocks:
soup.contents = b
print(soup)
print("--------------")
这个例子有 3 个这样的元素块:
Text1 <i>here</i> has no tag
<div>This is in a div</div>
--------------
Text2 <i>here</i> has no tag
<div>This is in a div</div>
--------------
Text3 <i>here</i> has no tag
--------------