使用 PHP 简单 HTML DOM 解析器抓取网站时如何解决 403 禁止错误？

Question

我正在尝试使用 PHP Simple HTML DOM Parser 从网站上抓取数据。然而，每次我尝试获取页面的 HTML 内容时，都会遇到 403 Forbidden 错误。

为了排除故障，我尝试使用 Guzzle PHP 来模拟浏览器请求来设置自定义标头（包括用户代理）。尽管如此，问题仍然存在，我无法检索网页内容。

// using simple dom parser
require '../simple_html_dom.php';

$html = file_get_html('https://www.mywebsite.com');
$title = $html->find('title', 0);
$image = $html->find('img', 0);

echo $title->plaintext."<br>\n";
echo $image->src;

// using guzzle
require '../../vendor/autoload.php';

use GuzzleHttp\Client;

$url = "https://www.mywebsite.com";
$client = new Client();

try {
    $response = $client->request('GET', $url, [
        'headers' => [
            'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
            'Accept-Language' => 'en-US,en;q=0.9',
            'Accept-Encoding' => 'gzip, deflate, br',
            'accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
            'Referer' => 'https://www.mywebsite.com',
        ]
    ]);

    if ($response->getStatusCode() === 200) {
        $html = $response->getBody()->getContents();
        echo "Fetched HTML (first 500 characters):\n" . substr($html, 0, 500) . "\n\n";

        // Continue with DOM parsing...
    } else {
        echo "Failed to fetch the URL. HTTP Status Code: " . $response->getStatusCode() . "\n";
    }
} catch (Exception $e) {
    echo "An error occurred: " . $e->getMessage() . "\n";
}

我怀疑服务器可能有其他机制，例如 IP 阻止、反机器人保护或 cookie，导致 403 错误。

我是否应该包含其他标头或配置来绕过 403 禁止错误？
是否有替代方法或库可能效果更好用于抓取具有此类限制的网站？

任何有关解决此问题的指导将不胜感激！

Answer 1

由于 IP 黑名单、用户代理过滤或速率限制，可能会出现

403 Forbidden

错误。解决方法如下：

添加标头以模拟浏览器请求：

'headers' => [
    ....
    'Connection' => 'keep-alive',
    'Cache-Control' => 'no-cache',
    'Pragma' => 'no-cache',
    'Upgrade-Insecure-Requests' => '1',
]

在请求之间暂停以避免检测 (
```
sleep(2);
```
)。
检查您的IP：
测试来自其他网络的访问（例如移动数据）。如果有效，您的 IP 可能会被阻止。如果需要，请使用代理：
```
$response = $client->request('GET', $url, [
     'proxy' => 'http://proxy-server:port',
     'headers' => ['User-Agent' => '...']
]);
```

建议：

PHP 8.4 引入了简化 Web 抓取任务的增强功能，特别是在 DOM 扩展中添加了

querySelector()

和

querySelectorAll()

方法。

这些方法可以使用 CSS 选择器选择 DOM 元素，简化从 HTML 文档解析和提取数据的过程。

您的 PHP 8.4 代码：

<?php

require '../../vendor/autoload.php';

use GuzzleHttp\Client;

// Define the URL
$url = "https://www.mywebsite.com";

// Create a Guzzle HTTP client
$client = new Client();

try {
    // Send a GET request with appropriate headers
    $response = $client->request('GET', $url, [
        'proxy' => 'http://proxy-server:port',
        'headers' => [
            ......
        ]
    ]);

    // Check the HTTP status code
    if ($response->getStatusCode() === 200) {
        // Fetch the HTML content
        $htmlContent = $response->getBody()->getContents();

        // Load the HTML content into a DOMDocument
        $doc = new DOMDocument();
        libxml_use_internal_errors(true); // Suppress warnings for invalid HTML
        $doc->loadHTML($htmlContent);
        libxml_clear_errors();

        // Use querySelector to extract the page title
        $title = $doc->querySelector('title');
        echo "Page Title: " . ($title ? $title->textContent : "Not found") . "<br>\n";

        // Use querySelector to extract the first image
        $image = $doc->querySelector('img');
        echo "Image Source: " . ($image ? $image->getAttribute('src') : "Not found") . "<br>\n";
    } else {
        echo "Failed to fetch the URL. HTTP Status Code: " . $response->getStatusCode() . "\n";
    }
} catch (Exception $e) {
    echo "An error occurred: " . $e->getMessage() . "\n";
}

使用 PHP 简单 HTML DOM 解析器抓取网站时如何解决 403 禁止错误？

问题描述投票：0回答：1

1个回答

建议：

最新问题

使用 PHP 简单 HTML DOM 解析器抓取网站时如何解决 403 禁止错误？

问题描述 投票：0回答：1

1个回答

建议：

最新问题

问题描述投票：0回答：1