我有一个 html 字符串:
<section id="post-2603507" class="caja-anuncio ">
<section class="foto"> ... OMITTED...
</section>
<section class="caja-datos">
... OMITTED...
</h5>
</section>
</section>
... OMITTED...
<section id="post-1149642" class="caja-anuncio ">
<section class="foto"> ... OMITTED...
</section>
<section class="caja-datos">
... OMITTED...
</h5>
</section>
</section>
如何获取组中的多个出现情况(从第一个节点到最后一个节点)?
每个部分是:
<section id="post-1149642" class="caja-anuncio ">
<section class="foto"> ... OMITTED...
</section>
<section class="caja-datos">
... OMITTED...
</h5>
</section>
</section>
我尝试正则表达式:
<section id="post-\d+" class="caja-anuncio\s*">((?s).*)
<section id="post-\d+" class="caja-anuncio\s*">((?s).*)<\/section>\s*<\/section>\s*
<section id="post-\d+" class="caja-anuncio\s*">((?s).*)<\/section>\s*<\/section>\s*<section
有什么建议吗?
您不应该使用正则表达式来解析 HTML。
然而,这并不是一件小事,因为解析在类字符串中引入了尾随空格,因此我们需要修剪来处理它
# HTML content - I inline it but you can read from a file - I added a missing <h5>
$htmlContent = @'
<section id="post-2603507" class="caja-anuncio ">
<section class="foto"> ... OMITTED... </section>
<section class="caja-datos">
... OMITTED...
<h5></h5>
</section>
</section>
<section id="post-1149642" class="caja-anuncio ">
<section class="foto"> ... OMITTED... </section>
<section class="caja-datos">
... OMITTED...
<h5></h5>
</section>
</section>
'@
# Convert HTML to XML by wrapping it in a root node
try {
[xml]$xmlContent = [xml]("<root>" + $htmlContent + "</root>")
Write-Host "XML Content:" $xmlContent.OuterXml
} catch {
Write-Host "Error parsing XML:" $_.Exception.Message
}
# Debug to see we successfully parsed the HTML
Write-Host "Root Element:" $xmlContent.DocumentElement.Name
Write-Host "First-Level Children:"
$xmlContent.DocumentElement.ChildNodes | ForEach-Object { $_.OuterXml }
# Loop the root sections
foreach ($node in $xmlContent.root.section) {
Write-Host "Section ID:" $node.id
Write-Host "Class Attribute:" $node.class
}
# Use ArrayList to store matching sections
$sections = New-Object System.Collections.ArrayList
# Manually filter sections with class "caja-anuncio"
foreach ($section in $xmlContent.root.section) {
# Trim the class attribute to remove any trailing spaces
$classValue = $section.Attributes["class"].Value.Trim()
Write-Host "Trimmed Class Value:" $classValue
# Check if the trimmed class attribute matches "caja-anuncio"
if ($classValue -eq "caja-anuncio") {
[void]$sections.Add($section)
Write-Host "Added section, current count in ArrayList:" $sections.Count
}
}
# Output the final count and the results
Write-Host "Final count in ArrayList after loop:" $sections.Count
foreach ($section in $sections) {
Write-Host "Section ID:" $section.Attributes["id"].Value
# Access nested sections
$foto = $section.section | Where-Object { $_.Attributes["class"].Value.Trim() -eq "foto" }
$datos = $section.section | Where-Object { $_.Attributes["class"].Value.Trim() -eq "caja-datos" }
Write-Host "Foto Section:" $foto.OuterXml
Write-Host "Datos Section:" $datos.OuterXml
Write-Host "`n---`n"
}