使用 XPath 远程抓取页面并获取图像的最相关标题或描述

Question

我正在做的事情本质上与推文按钮或 Facebook 分享/点赞按钮所做的事情相同，那就是抓取页面和一条数据的最相关标题。我能想到的最好的例子是，当您在一个包含许多文章的网站的首页上单击 Facebook Like 按钮时。然后，它将获取与（最近的）“赞”按钮相关的帖子的正确信息。有些网站有 Open Graph 标签，但有些网站没有，但它仍然有效。

由于这是远程完成的，我只能控制我想要定位的数据。在这种情况下，数据是图像。我不是只检索页面的

<title>

，而是希望以某种方式从每个图像的起点反向遍历 dom，并找到最近的“标题”。问题是并非所有标题都出现在图像之前。然而，在这种情况下，图像出现在标题之后的可能性似乎相当高。话虽如此，我希望让它适用于几乎所有网站。

想法：

找到图像的“容器”，然后使用第一个文本块。
在包含某些类（“描述”、“标题”）或元素（h1、h2、h3、h4）的元素中查找文本块。

标题备份：

使用开放图标签
仅使用
```
<title>
```
仅使用 ALT 标签
使用 META 标签

总结：提取图像不是问题，问题是如何获得它们的相关标题。

问题： 您将如何为每张图像获取相关标题？也许使用 DomDocument 或 XPath？

Answer 1

你的方法似乎足够好，我只是给某些标签/属性一个权重，并使用 XPath 查询循环它们，直到找到存在的东西并且它不是无效的。比如：

i = 0

while (//img[i][@src])
  if (//img[i][@alt])
    return alt
  else if (//img[i][@description])
    return description
  else if (//img[i]/../p[0])
    return p
  else
    return (//title)

  i++

一个简单的 XPath 示例（函数从我的框架移植）：

function ph_DOM($html, $xpath = null)
{
    if (is_object($html) === true)
    {
        if (isset($xpath) === true)
        {
            $html = $html->xpath($xpath);
        }

        return $html;
    }

    else if (is_string($html) === true)
    {
        $dom = new DOMDocument();

        if (libxml_use_internal_errors(true) === true)
        {
            libxml_clear_errors();
        }

        if ($dom->loadHTML(ph()->Text->Unicode->mb_html_entities($html)) === true)
        {
            return ph_DOM(simplexml_import_dom($dom), $xpath);
        }
    }

    return false;
}

实际使用情况：

$html = file_get_contents('http://en.wikipedia.org/wiki/Photography');

print_r(ph_DOM($html, '//img')); // gets all images
print_r(ph_DOM($html, '//img[@src]')); // gets all images that have a src
print_r(ph_DOM($html, '//img[@src]/..')); // gets all images that have a src and their parent element
print_r(ph_DOM($html, '//img[@src]/../..')); // and so on...
print_r(ph_DOM($html, '//title')); // get the title of the page

使用 XPath 远程抓取页面并获取图像的最相关标题或描述

问题描述投票：0回答：1

1个回答

最新问题

使用 XPath 远程抓取页面并获取图像的最相关标题或描述

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1