从使用 Ajax/Javascript 的网站收集数据。 - 卷曲

问题描述 投票:0回答:1

我正在尝试使用curl(通过PHP)抓取搜索表单。我认为一切都是正确的,或者接近正确的,但事实似乎并非如此。为了提供一点背景知识,我尝试从搜索表单中收集(或抓取)数据,用户在其中插入日期范围,然后提交搜索。结果将显示在搜索输入下方。该页面正在使用 AJAX/JavaScript 加载数据。

当我运行 PHP 脚本时,我什么也没得到。我添加了

print_r
来查看结果,但没有显示任何内容。

这是我的脚本。欢迎所有建议。

<?php
    class Scraper {

        // Class constructor method
        function __construct() {
            $this->useragent = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3';
            $handle = fopen('cookie.txt', 'w') or exit('Unable to create or open cookie.txt file.'."\n");   // Opening or creating cookie file
            fclose($handle);    // Closing cookie file
            $this->cookie = 'cookie.txt';    // Setting a cookie file to store cookie
            $this->timeout = 30; // Setting connection timeout in seconds
        }

        // Method to search and scrape search details
        public function scrapePersons($searchString = '') {

            $searchUrl = 'https://virre.prh.fi/novus/publishedEntriesSearch';

            $postValues = array(
                'businessId' => '',
                'startDate' => '07072016',
                'endDate' => '08072016',
                'registrationTypeCode' => 'kltu.U',
                '_todayRegistered' => 'on',
                'domicileCode' => '091',
                '_domicileCode' => '1',
                '_eventId_search' => 'Search',
                'execution' => 'e2s1',
                '_defaultEventId' => '',
            );

            $search = $this->curlPostFields($searchUrl, $postValues);

            return $search;
        }

        // Method to make a POST request using form fields
        public function curlPostFields($postUrl, $postValues) {
            $_ch = curl_init(); // Initialising cURL session

            // Setting cURL options
            curl_setopt($_ch, CURLOPT_SSL_VERIFYPEER, FALSE);   // Prevent cURL from verifying SSL certificate
            curl_setopt($_ch, CURLOPT_FAILONERROR, TRUE);   // Script should fail silently on error
            curl_setopt($_ch, CURLOPT_COOKIESESSION, TRUE); // Use cookies
            curl_setopt($_ch, CURLOPT_FOLLOWLOCATION, TRUE);    // Follow Location: headers
            curl_setopt($_ch, CURLOPT_RETURNTRANSFER, TRUE);    // Returning transfer as a string
            curl_setopt($_ch, CURLOPT_COOKIEFILE, $this->cookie);    // Setting cookiefile
            curl_setopt($_ch, CURLOPT_COOKIEJAR, $this->cookie); // Setting cookiejar
            curl_setopt($_ch, CURLOPT_USERAGENT, $this->useragent);  // Setting useragent
            curl_setopt($_ch, CURLOPT_URL, $postUrl);   // Setting URL to POST to
            curl_setopt($_ch, CURLOPT_CONNECTTIMEOUT, $this->timeout);   // Connection timeout
            curl_setopt($_ch, CURLOPT_TIMEOUT, $this->timeout); // Request timeout
            curl_setopt($_ch, CURLOPT_POST, TRUE);  // Setting method as POST
            curl_setopt($_ch, CURLOPT_POSTFIELDS, $postValues); // Setting POST fields (array)

            $results = curl_exec($_ch); // Executing cURL session
            curl_close($_ch);   // Closing cURL session

            return $results;
        }

        // Class destructor method
        function __destruct() {
            // Empty
        }
    }

    $testScrape = new Scraper();   // Instantiating new object

    $data = json_decode($testScrape->scrapePersons());   // Scraping people records
    print_r($data);
?>
javascript php ajax web-scraping curl
1个回答
1
投票

首先我会检查以确保您可以这样做。

假设您是这样,问题是您收到了一个安全检查表单,如果您使用浏览器,该表单会由于 javascript onload 表单提交而自动提交,您需要复制此表单才能使其正常工作。

我得到的输出如下。

<html>
<head>
  <title>Security Check</title></head>
<body onLoad="document.security_check_form.submit()">
<form name="security_check_form" action="j_security_check" method="POST">
<input type="hidden" value="prhanonymous" name="j_username"/>
<input type="hidden" value="*=AQICr82J28VvM2ljVarKvWv3LuibH7WPDyc8EVKuXdfytXrEv/o/MzMP3KfIEq+1Wda1ICP/pDLJQqniyBaRXTXnJGJCJhi2gVIoM0e+rwGEczxoah2+PsKOEnSI6A9j2MQO6/Q4i/vaXHVA7gfjjH7qvz0Fc+Pr7fPiBtJt+2YF3YghUN39cFhoK2O8mjRwTKORojRwcguq74B8Ttd0xyUlYld68t/mplsWv5npwMfT/wfv2XMidoVmB5k/p2rp3XbwlnHpJI3gJJcb5VV58M7frCB0vricZYv3xrKuco6qpMlX9wJeCqrhrMotY0+lisAvmEDIR3YpobE=" name="j_password"/>
</form>
</body>
</html>
© www.soinside.com 2019 - 2024. All rights reserved.