如何在 html 中获取多次出现,查找部分

问题描述 投票:0回答:1

我有一个 html 字符串:

<section id="post-2603507" class="caja-anuncio ">
<section class="foto"> ... OMITTED...
    </section> 
<section class="caja-datos">
 ... OMITTED...
     </h5>

 </section>
</section>
          
 ... OMITTED...  



   <section id="post-1149642" class="caja-anuncio ">
<section class="foto"> ... OMITTED...
    </section> 
<section class="caja-datos">
 ... OMITTED...
     </h5>

 </section>
</section>

如何获取组中的多个出现情况(从第一个节点到最后一个节点)?

每个部分是:

 <section id="post-1149642" class="caja-anuncio ">
    <section class="foto"> ... OMITTED...
        </section> 
    <section class="caja-datos">
     ... OMITTED...
         </h5>
    
     </section>
    </section>

我尝试正则表达式:

<section id="post-\d+" class="caja-anuncio\s*">((?s).*) 
<section id="post-\d+" class="caja-anuncio\s*">((?s).*)<\/section>\s*<\/section>\s* 
<section id="post-\d+" class="caja-anuncio\s*">((?s).*)<\/section>\s*<\/section>\s*<section

有什么建议吗?

html regex sections
1个回答
0
投票

您不应该使用正则表达式来解析 HTML。

然而,这并不是一件小事,因为解析在类字符串中引入了尾随空格,因此我们需要修剪来处理它

# HTML content - I inline it but you can read from a file - I added a missing <h5>
$htmlContent = @'
<section id="post-2603507" class="caja-anuncio ">
<section class="foto"> ... OMITTED...   </section> 
<section class="caja-datos">
 ... OMITTED...
     <h5></h5>

 </section>
</section>
          


   <section id="post-1149642" class="caja-anuncio ">
<section class="foto"> ... OMITTED...    </section> 
<section class="caja-datos">
 ... OMITTED...
     <h5></h5>

 </section>
</section>
'@

# Convert HTML to XML by wrapping it in a root node
try {
  [xml]$xmlContent = [xml]("<root>" + $htmlContent + "</root>")
  Write-Host "XML Content:" $xmlContent.OuterXml  
} catch {
  Write-Host "Error parsing XML:" $_.Exception.Message
}  

# Debug to see we successfully parsed the HTML
Write-Host "Root Element:" $xmlContent.DocumentElement.Name
Write-Host "First-Level Children:" 
$xmlContent.DocumentElement.ChildNodes | ForEach-Object { $_.OuterXml }

# Loop the root sections
foreach ($node in $xmlContent.root.section) {
    Write-Host "Section ID:" $node.id
    Write-Host "Class Attribute:" $node.class
}

# Use ArrayList to store matching sections
$sections = New-Object System.Collections.ArrayList

# Manually filter sections with class "caja-anuncio"
foreach ($section in $xmlContent.root.section) {
    # Trim the class attribute to remove any trailing spaces
    $classValue = $section.Attributes["class"].Value.Trim()
    Write-Host "Trimmed Class Value:" $classValue

    # Check if the trimmed class attribute matches "caja-anuncio"
    if ($classValue -eq "caja-anuncio") {
        [void]$sections.Add($section)
        Write-Host "Added section, current count in ArrayList:" $sections.Count
    }
}

# Output the final count and the results
Write-Host "Final count in ArrayList after loop:" $sections.Count
foreach ($section in $sections) {
    Write-Host "Section ID:" $section.Attributes["id"].Value
    
    # Access nested sections
    $foto = $section.section | Where-Object { $_.Attributes["class"].Value.Trim() -eq "foto" }
    $datos = $section.section | Where-Object { $_.Attributes["class"].Value.Trim() -eq "caja-datos" }
    
    Write-Host "Foto Section:" $foto.OuterXml
    Write-Host "Datos Section:" $datos.OuterXml
    Write-Host "`n---`n"
}
© www.soinside.com 2019 - 2024. All rights reserved.