PHP - 完整显示远程页面的内容

Question

我需要获取远程页面，修改一些元素（使用“PHP Simple HTML DOM Parser”库）并输出修改后的内容。

远程页面存在问题，其源代码中没有完整的 URL，因此不会加载 CSS 元素和图像。当然，它不会阻止我修改元素，但输出看起来很糟糕。

例如，打开https://www.raspberrypi.org/downloads/

但是，如果您使用代码

$html = file_get_html('http://www.raspberrypi.org/downloads');
echo $html;

看起来会很糟糕。我尝试应用一个简单的技巧，但这只是一点帮助：

$html = file_get_html('http://www.raspberrypi.org/downloads');
$html=str_ireplace("</head>", "<base href='http://www.raspberrypi.org'></head>", $html);
echo $html;

有什么方法可以“指示”脚本解析来自“http://www.raspberrypi.org”的 $html 变量的所有链接？换句话说，如何使 raspberrypi.org 成为获取的所有图像/CSS 元素的“主要”来源？

我不知道如何更好地解释它，但我相信你有想法。

Answer 1

我刚刚在本地尝试过这个，我注意到（在源代码中）HTML 中的链接标签是这样的：

<link rel='stylesheet' href='/wp-content/themes/mind-control/js/qtip/jquery.qtip.min.css' />

它显然需要一个位于我的本地目录中的文件（例如 localhost/wp-content/etc.../）。链接标签的 href 必须类似于

<link rel='stylesheet' href='https://www.raspberrypi.org/wp-content/themes/mind-control/js/qtip/jquery.qtip.min.css' />

因此，您可能想要做的就是找到所有链接标签并在其余链接前面添加其 href 属性“https://www.raspberrypi.org/”。

编辑：嘿，我实际上已经使样式起作用了，试试这个代码：

$html = file_get_html('http://www.raspberrypi.org/downloads');
$i = 0;
foreach($html->find('link') as $element)
{
       $html->find('link', $i)->href = 'http://www.raspberrypi.org'.$element->href;
       $i++;
}
echo $html;die;

Answer 2

由于只有 Nikolay Ganovski 提供了解决方案，我编写了一段代码，通过查找不完整的 css/img/form 标签并将其填充完整，将部分页面转换为完整页面。如果有人需要它，请找到下面的代码：

// finalizes remote page by completing incomplete css/img/form URLs (path/file.css becomes http://somedomain.com/path/file.css, etc.)
function finalize_remote_page($content, $root_url)
{
    $root_url_without_scheme = preg_replace('/(?:https?:\/\/)?(?:www\.)?(.*)\/?$/i', '$1', $root_url); // ignore schemes, in case URL provided by user was http://somedomain.com while URL in source is https://somedomain.com (or vice-versa)

    $content_object = str_get_html($content);
    if (is_object($content_object))
    {
        foreach ($content_object->find('link.[rel=stylesheet]') as $entry) //find css
        {
            if (substr($entry->href, 0, 2) != "//" && stristr($entry->href, $root_url_without_scheme) === FALSE) // ignore "invalid" URLs like // somedomain.com
            {
                $entry->href = $root_url.$entry->href;
            }
        }
        foreach ($content_object->find('img') as $entry) //find img
        {
            if (substr($entry->src, 0, 2) != "//" && stristr($entry->src, $root_url_without_scheme) === FALSE) // ignore "invalid" URLs like // somedomain.com
            {
                $entry->src = $root_url.$entry->src;
            }
        }
        foreach ($content_object->find('form') as $entry) //find form
        {
            if (substr($entry->action, 0, 2) != "//" && stristr($entry->action, $root_url_without_scheme) === FALSE) // ignore "invalid" URLs like // somedomain.com
            {
                $entry->action = $root_url.$entry->action;
            }
        }
    }
    return $content_object;
}

PHP - 完整显示远程页面的内容

问题描述投票：0回答：2

2个回答

最新问题

PHP - 完整显示远程页面的内容

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2