如何使用 Web::Scraper 抓取以下内容?

问题描述 投票:0回答:2

这个问题与 How to Parse this HTML with Web::Scraper?.

不同但相关

我必须使用 Web::Scraper 抓取页面,其中 HTML 可能会略有变化。有时也可以

<div>
  <p>
    <strong>TITLE1</strong>
    <br>
    DESCRIPTION1
  </p>
  <p>
    <strong>TITLE2</strong>
    <br>
    DESCRIPTION2
  </p>
  <p>
    <strong>TITLE3</strong>
    <br>
    DESCRIPTION3
  </p>
</div>

我用

Web::Scraper
使用以下代码提取

my $test = scraper {
    process 'div p', 'test[]' => scraper {
        process 'p strong', 'name' => 'TEXT';
        process '//p/text()', 'desc' => [ 'TEXT', sub { s/^\s+|\s+$//g } ];
    };
};

但有时它包含以下 HTML(请注意,每个标题和描述不再用

<p>
分隔)。

<div>
  <p>
    <strong>TITLE1</strong>
    <br>
    DESCRIPTION1
    <strong>TITLE2</strong>
    <br>
    DESCRIPTION2
    <strong>TITLE3</strong>
    <br>
    DESCRIPTION3
  </p>
</div>

如何将上面的 HTML 抓取到

test => [
  { desc => "DESCRIPTION1 ", name => "TITLE1" },
  { desc => "DESCRIPTION2 ", name => "TITLE2" },
  { desc => "DESCRIPTION3 ", name => "TITLE3" },
]

我尝试修改上面的代码,但我无法弄清楚使用什么 HTML 来“拆分”唯一的标题和描述对。

html perl web-scraping dom
2个回答
1
投票

我从未使用过 WebScraper,但它的行为似乎很糟糕或者很奇怪。

以下 XPath 表达式或多或少应该适用于这两种情况(需要进行小的调整):

//div//strong/text()
//div//br/following-sibling::text()

将它们插入

xmllint
(libxml2) 时:

tmp >xmllint --html --shell a.html
/ > cat /
 -------
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<div>
  <p>
    <strong>TITLE1</strong>
    <br>
    DESCRIPTION1
  </p>
  <p>
    <strong>TITLE2</strong>
    <br>
    DESCRIPTION2
  </p>
  <p>
    <strong>TITLE3</strong>
    <br>
    DESCRIPTION3
  </p>
</div>
</body></html>

/ > xpath //div//strong/text()
Object is a Node Set :
Set contains 3 nodes:
1  TEXT
    content=TITLE1
2  TEXT
    content=TITLE2
3  TEXT
    content=TITLE3
/ > xpath //div//br/following-sibling::text()
Object is a Node Set :
Set contains 3 nodes:
1  TEXT
    content=     DESCRIPTION1
2  TEXT
    content=     DESCRIPTION2
3  TEXT
    content=     DESCRIPTION3

/ > load b.html
/ > cat /
 -------
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div>
    <p>
    <strong>TITLE1</strong>
    <br>
    DESCRIPTION1
    <strong>TITLE2</strong>
    <br>
    DESCRIPTION2
    <strong>TITLE3</strong>
    <br>
    DESCRIPTION3
    </p>
</div></body></html>

/ > xpath //div//strong/text()
Object is a Node Set :
Set contains 3 nodes:
1  TEXT
    content=TITLE1
2  TEXT
    content=TITLE2
3  TEXT
    content=TITLE3
/ > xpath //div//br/following-sibling::text()
Object is a Node Set :
Set contains 5 nodes:
1  TEXT
    content=  DESCRIPTION1
2  TEXT
    content=
3  TEXT
    content=  DESCRIPTION2
4  TEXT
    content=
5  TEXT
    content=  DESCRIPTION3

当您将这些的各种版本插入 WebScraper 时,它们不起作用。

 process '//div', 'test[]' => scraper {
    process '//strong', 'name' => 'TEXT';
    process '//br/following-sibling::text()', 'desc' => 'TEXT';
  };

结果:

/tmp >for f in a b; do perl bs.pl file:///tmp/$f.html; done
{ test => [{ desc => " DESCRIPTION1 ", name => "TITLE1" }] }
{ test => [{ desc => " DESCRIPTION1 ", name => "TITLE1" }] }

process '//div', 'test[]' => scraper {
  process '//div//strong', 'name' => 'TEXT';
  process '//div//br/following-sibling::text()', 'desc' => 'TEXT';
};

结果:

/tmp >for f in a b; do perl bs.pl file:///tmp/$f.html; done
{ test => [{ desc => " DESCRIPTION1 ", name => "TITLE1" }] }
{ test => [{ desc => " DESCRIPTION1 ", name => "TITLE1" }] }

即使是最基本的情况:

  process 'div', 'test[]' => scraper {
    process 'strong', 'name' => 'TEXT';
  };

结果:

/tmp >for f in a b; do perl bs.pl file:///tmp/$f.html; done
{ test => [{ name => "TITLE1" }] }
{ test => [{ name => "TITLE1" }] }

即使你通过

use Web::Scraper::LibXML
告诉它使用 libxml2 -什么也没有!

为了确保我不会发疯,我尝试使用 Ruby 的 Nokogiri:

 /tmp >for f in a b; do ruby -rnokogiri -rpp -e'pp Nokogiri::HTML(File.read(ARGV[0])).css("div p strong").map &:text' $f.html; done
["TITLE1", "TITLE2", "TITLE3"]
["TITLE1", "TITLE2", "TITLE3"]

缺少什么。


0
投票
我想我已经解决了。我不确定这是否是最好的方法,但它似乎可以处理这两种情况。

my $test = scraper { process '//div', 'test' => scraper { process '//div//strong//text()', 'name[]' => 'TEXT'; process '//p/text()','desc[]' => ['TEXT', sub { s/^\s+|\s+$//g} ]; } }; my $res = $test->scrape(\$html); #get the names and descriptions my @keys = @{$res->{test}->{name}}; my @values = @{$res->{test}->{desc}}; #merge two arrays into hash my %hash; @hash{@keys} = @values;
    
© www.soinside.com 2019 - 2024. All rights reserved.