我已经使用
DOMDocument
和 DOMXPath
来获得解决方案:
$test = <<< HTML
<p class="Heading1-P">
<span class="Heading1-H">Chapter 1</span>
</p>
<p class="Normal-P">
<span class="Normal-H">This is chapter 1</span>
</p>
<p class="Heading1-P">
<span class="Heading1-H">Chapter 2</span>
</p>
<p class="Normal-P">
<span class="Normal-H">This is chapter 2</span>
</p>
<p class="Heading1-P">
<span class="Heading1-H">Chapter 3</span>
</p>
<p class="Normal-P">
<span class="Normal-H">This is chapter 3</span>
</p>
HTML;
$dom = new DOMDocument();
$dom->loadHTML($test);
$xpath = new DOMXPath($dom);
$heading = parseToArray($xpath,'Heading1-H');
$content = parseToArray($xpath,'Normal-H');
var_dump($heading);
echo "<br/>";
var_dump($content);
echo "<br/>";
function parseToArray(DOMXPath $xpath, string $class): array
{
$xpathquery = "//*[@class='$class']";
$elements = $xpath->query($xpathquery);
$resultarray = [];
foreach ($elements as $element) {
$nodes = $element->childNodes;
foreach ($nodes as $node) {
$resultarray[] = $node->nodeValue;
}
}
return $resultarray;
}
尝试查看 PHP 简单 HTML DOM 解析器
它具有类似于 jQuery 的出色语法,因此您可以轻松地通过 ID 或类选择任何您想要的元素
// include/require the simple html dom parser file
$html_string = '
<p class="Heading1-P">
<span class="Heading1-H">Chapter 1</span>
</p>
<p class="Normal-P">
<span class="Normal-H">This is chapter 1</span>
</p>
<p class="Heading1-P">
<span class="Heading1-H">Chapter 2</span>
</p>
<p class="Normal-P">
<span class="Normal-H">This is chapter 2</span>
</p>
<p class="Heading1-P">
<span class="Heading1-H">Chapter 3</span>
</p>
<p class="Normal-P">
<span class="Normal-H">This is chapter 3</span>
</p>';
$html = str_get_html($html_string);
foreach($html->find('span') as $element) {
if ($element->class === 'Heading1-H') {
$heading[] = $element->innertext;
}else if($element->class === 'Normal-H') {
$content[] = $element->innertext;
}
}
DiDOM
解析 html 的另一种方法。
composer require imangazaliev/didom
<?php
use DiDom\Document;
require_once('vendor/autoload.php');
$html = <<<HTML
<p class="Heading1-P">
<span class="Heading1-H">Chapter 1</span>
</p>
<p class="Normal-P">
<span class="Normal-H">This is chapter 1</span>
</p>
<p class="Heading1-P">
<span class="Heading1-H">Chapter 2</span>
</p>
<p class="Normal-P">
<span class="Normal-H">This is chapter 2</span>
</p>
<p class="Heading1-P">
<span class="Heading1-H">Chapter 3</span>
</p>
<p class="Normal-P">
<span class="Normal-H">This is chapter 3</span>
</p>
HTML;
$document = new Document($html);
// find chapter headings
$elements = $document->find('.Heading1-H');
$headings = [];
foreach ($elements as $element) {
$headings[] = $element->text();
}
// find chapter texts
$elements = $document->find('.Normal-H');
$chapters = [];
foreach ($elements as $element) {
$chapters[] = $element->text();
}
echo("Headings\n");
foreach ($headings as $heading) {
echo("- {$heading}\n");
}
echo("Chapter texts\n");
foreach ($chapters as $chapter) {
echo("- {$chapter}\n");
}
您的一个选择是使用 DOMDocument 和 DOMXPath。它们确实需要一些曲线来学习,但是一旦你这样做了,你就会对你所取得的成就感到非常满意。
在 php.net 中阅读以下内容
http://php.net/manual/en/class.domdocument.php
http://php.net/manual/en/class.domxpath.php
希望这有帮助。
这是 @saji89 答案的功能风格等效。 在任何级别上搜索具有所需类的任何元素(如果可能有多个类分配给一个元素,请使用
contains()
),然后使用 text()
定位节点文本。将 XPath 对象转换为数组后,只需隔离 nodeValue
列即可。
代码:(演示)
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach (['Heading1-H', 'Normal-H'] as $class) {
var_export(
array_column(
iterator_to_array($xpath->query("//*[@class='$class']/text()")),
'nodeValue'
)
);
echo "\n---\n";
}
输出:
array (
0 => 'Chapter 1',
1 => 'Chapter 2',
2 => 'Chapter 3',
)
---
array (
0 => 'This is chapter 1',
1 => 'This is chapter 2',
2 => 'This is chapter 3',
)
---
// 从 URL 或文件创建 DOM
$html = file_get_html('http://www.google.com/');
// 查找所有图像
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// 查找所有链接
foreach($html->find('a') as $element)
echo $element->href . '<br>';