我正在尝试使用 PHP Simple HTML DOM Parser 从网站上抓取数据。然而,每次我尝试获取页面的 HTML 内容时,都会遇到 403 Forbidden 错误。
为了排除故障,我尝试使用 Guzzle PHP 来模拟浏览器请求来设置自定义标头(包括用户代理)。尽管如此,问题仍然存在,我无法检索网页内容。
// using simple dom parser
require '../simple_html_dom.php';
$html = file_get_html('https://www.mywebsite.com');
$title = $html->find('title', 0);
$image = $html->find('img', 0);
echo $title->plaintext."<br>\n";
echo $image->src;
// using guzzle
require '../../vendor/autoload.php';
use GuzzleHttp\Client;
$url = "https://www.mywebsite.com";
$client = new Client();
try {
$response = $client->request('GET', $url, [
'headers' => [
'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Accept-Language' => 'en-US,en;q=0.9',
'Accept-Encoding' => 'gzip, deflate, br',
'accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Referer' => 'https://www.mywebsite.com',
]
]);
if ($response->getStatusCode() === 200) {
$html = $response->getBody()->getContents();
echo "Fetched HTML (first 500 characters):\n" . substr($html, 0, 500) . "\n\n";
// Continue with DOM parsing...
} else {
echo "Failed to fetch the URL. HTTP Status Code: " . $response->getStatusCode() . "\n";
}
} catch (Exception $e) {
echo "An error occurred: " . $e->getMessage() . "\n";
}
我怀疑服务器可能有其他机制,例如 IP 阻止、反机器人保护或 cookie,导致 403 错误。
任何有关解决此问题的指导将不胜感激!
由于 IP 黑名单、用户代理过滤或速率限制,可能会出现
403 Forbidden
错误。解决方法如下:
添加标头以模拟浏览器请求:
'headers' => [
....
'Connection' => 'keep-alive',
'Cache-Control' => 'no-cache',
'Pragma' => 'no-cache',
'Upgrade-Insecure-Requests' => '1',
]
在请求之间暂停以避免检测 (
sleep(2);
)。
检查您的IP:
测试来自其他网络的访问(例如移动数据)。如果有效,您的 IP 可能会被阻止。如果需要,请使用代理:
$response = $client->request('GET', $url, [
'proxy' => 'http://proxy-server:port',
'headers' => ['User-Agent' => '...']
]);
PHP 8.4 引入了简化 Web 抓取任务的增强功能,特别是在 DOM 扩展中添加了
querySelector()
和 querySelectorAll()
方法。
这些方法可以使用 CSS 选择器选择 DOM 元素,简化从 HTML 文档解析和提取数据的过程。
您的 PHP 8.4 代码:
<?php
require '../../vendor/autoload.php';
use GuzzleHttp\Client;
// Define the URL
$url = "https://www.mywebsite.com";
// Create a Guzzle HTTP client
$client = new Client();
try {
// Send a GET request with appropriate headers
$response = $client->request('GET', $url, [
'proxy' => 'http://proxy-server:port',
'headers' => [
......
]
]);
// Check the HTTP status code
if ($response->getStatusCode() === 200) {
// Fetch the HTML content
$htmlContent = $response->getBody()->getContents();
// Load the HTML content into a DOMDocument
$doc = new DOMDocument();
libxml_use_internal_errors(true); // Suppress warnings for invalid HTML
$doc->loadHTML($htmlContent);
libxml_clear_errors();
// Use querySelector to extract the page title
$title = $doc->querySelector('title');
echo "Page Title: " . ($title ? $title->textContent : "Not found") . "<br>\n";
// Use querySelector to extract the first image
$image = $doc->querySelector('img');
echo "Image Source: " . ($image ? $image->getAttribute('src') : "Not found") . "<br>\n";
} else {
echo "Failed to fetch the URL. HTTP Status Code: " . $response->getStatusCode() . "\n";
}
} catch (Exception $e) {
echo "An error occurred: " . $e->getMessage() . "\n";
}