使用 PHP 抓取不一致的搜索结果

问题描述 投票:0回答:1

如何抓取项目数量不一致的搜索结果列表?

这是一个例子:

在此搜索结果中,您将找到 4 家企业: https://www.11880.com/suche/0521441422/deutschland

现在,这 4 家企业中并非每家都包含营业时间信息: 第一个没有,最后 3 个商家包含营业时间信息。

因此,如果我尝试使用下面的脚本执行此操作,营业时间信息会与错误的企业相关 => 它会与前 3 个企业“连接”,而不是最后 3 个企业。

如何修改脚本,以便营业时间连接到正确的业务?

<?php

$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:33.0) Gecko/20120101 Firefox/33.0');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, 'https://www.11880.com/suche/0521441422/deutschland');
$page = curl_exec($ch);

libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($page);
$xpath = new DOMXPath($dom);

$results = [];
$results['name'] = $xpath->query('//h2[@itemprop="name"]');
$results['street'] = $xpath->query('//span[@class="street-address"]');
$results['zipcode'] = $xpath->query('//span[@class="postal-code"]');
$results['city'] = $xpath->query('//span[@class="address-locality"]');
$results['district'] = $xpath->query('//span[@class="quarter"]');
$results['opening_hours'] = $xpath->query('//span[@class="open-or-closed"]');


//*[@id="html-search-result-list"]/li[3]/div/div[3]/div[1]/span[1]
#html-search-result-list > li:nth-child(3) > div > div.row-result-entry--bottom.row > div.col-result-entry-content--contactinfos.hidden-xs.col-sm-8 > span.btn-ghost.btn-ghost-primary.btn-result-entry-interaction.open-or-closed.open

for($x=0; $x < $results['name']->length;$x++)
{
  echo trim($results['name']->item($x)->textContent) . ";";
  echo trim($results['street']->item($x)->textContent) . ";";
  echo trim($results['zipcode']->item($x)->textContent) . ";";
  echo trim($results['city']->item($x)->textContent) . ";";
  echo trim($results['district']->item($x)->textContent) . ";";
  echo trim($results['opening_hours']->item($x)->textContent) . "<br>\n";
}

?>
php web-scraping curl xpath
1个回答
1
投票

你可以这样做。这只是草稿

// Find parent divs
$divs = $xpath->query('//h2[@itemprop="name"]/ancestor::div[1]');
for($x=0; $x < $divs->length;$x++) {
   // Find items, you want, inside div
   $name = $xpath->query('.//h2[@itemprop="name"]', $divs[$x]);
   if ($name) {
      echo trim($name->item(0)->textContent) . ";";
   }
// ...
}
© www.soinside.com 2019 - 2024. All rights reserved.