使用 BeautifulSoup 提取两个 h1 标签之间的数据

问题描述 投票:0回答:2

美丽的汤:提取两个标签之间的所有内容

我正在使用 BeautifulSoup 提取两个特定 HTML 标签之间的内容。这些标签没有任何特定的属性或 ID,我想提取标签第一次和第二次出现之间的所有内容。例如,给定以下 HTML:

    <h1></h1>
    Text <i>here</i> has no tag
    <div>This is in a div</div>
    <h1></h1>

我想要的输出是:

    Text <i>here</i> has no tag
    <div>This is in a div</div>

但是,如果标签具有不同的 id 属性,例如:

<h1 id="start"></h1>
Text <i>here</i> has no tag
<div>This is in a div</div>
<h1 id="end"></h1>

我仍然想提取相同的内容,但使用特定的 id 属性作为边界。

代码:

这是我使用 BeautifulSoup 编写的 Python 代码来实现此目的:

from bs4 import BeautifulSoup

html_doc = '''
This I <b>don't</b> want
<h1></h1>
Text <i>here</i> has no tag
<div>This is in a div</div>
<h1></h1>
This I <b>don't</b> want too
'''

# Parse the HTML
soup = BeautifulSoup(html_doc, 'html.parser')

# Find the first and second <h1> tags
h1_tags = soup.find_all('h1')

if len(h1_tags) >= 2:
    # Extract content between the two <h1> tags
    between_tags = []
    for element in h1_tags[0].next_siblings:
        if element == h1_tags[1]:
            break
        between_tags.append(str(element))

    # Join and print the result
    print(''.join(between_tags).strip())

打印:

    Text <i>here</i> has no tag
    <div>This is in a div</div>

无需 id 即可工作。

如果 ht 标签具有不同的 idsd

"""
This I <b>don't</b> want
<h1 id="start"></h1>
Text <i>here</i> has no tag
<div>This is in a div</div>
<h1 id="end"></h1>
This I <b>don't</b> want too
'''
"""
python html beautifulsoup html-tag-summary
2个回答
2
投票

parentdecompose 方法可能对您有帮助。

# 1. Find the first item you are looking for. 

soup = BeautifulSoup(html_doc, 'html.parser')
hElem = soup.find("h1", {'id': 'beautiful'})


# 2. Find the second condition. 

endElem = soup.find('h1', {'id': 'good'})


# 3. Get parent element that contains both. 

hParent = hElem.parent  # Can be made more complex if multiple ancestors are needed to contain both conditions.


# 4. Iterate through children and remove all children outside the conditions.

childrenElems = hParent.children
inBetween = true
for child in childrenElems:
  if not inBetween:  
    child.decompose()
  if child == endElem:
    inBetween = false 

#  Remaining data.
print(childrenElems) 

0
投票

您可以自己迭代汤内容并构建每个

<h1>
标签之间可见的元素块:

from bs4 import BeautifulSoup
from bs4.element import Tag


html = """
<h1 id = '1' ></h1>
Text1 <i>here</i> has no tag
<div>This is in a div</div>
<h1 id = '2' ></h1>
Text2 <i>here</i> has no tag
<div>This is in a div</div>
<h1 id = '3' ></h1>
Text3 <i>here</i> has no tag
"""

soup = BeautifulSoup(html, "html.parser")

block = []
blocks = []
h1 = False

for el in soup.contents:
    if type(el) == Tag and el.name == 'h1':
        # Has a h1 tag been seen yet?
        if h1:
            blocks.append(block)
            block = []
        h1 = True
    elif h1:
        block.append(el)

# Add any final elements (missing a next h1)
if block:
    blocks.append(block)
        
# Display each block as html soup
for b in blocks:        
    soup.contents = b
    print(soup)        
    print("--------------")
    

这个例子有 3 个这样的元素块:

Text1 <i>here</i> has no tag
<div>This is in a div</div>

--------------

Text2 <i>here</i> has no tag
<div>This is in a div</div>

--------------

Text3 <i>here</i> has no tag

--------------
        
    
© www.soinside.com 2019 - 2024. All rights reserved.