如何在Nutchx2上使用轮数

问题描述 投票:0回答:1

我也有同样的问题。我只使用这个命令来完成整个过程:

crawl urls/ucuzcumSeed.txt ucuzcum http://localhost:8983/solr/ucuzcum/ 10

crawl <seedDir> <crawlID> [<solrUrl>] <numberOfRounds>

顺便说一句,我使用的是Nutch的2.3.1版本和Solr的5.2.1版本。问题是我无法仅为此命令获取整个网站。我想numberofRounds参数不起作用。首先运行nutch只需找到1个用于获取的URL并生成并解析它。在第二步之后它可以获得更多网址。在这种情况下,这意味着nutch在第一次迭代结束时停止。但它应该按照我的命令继续。我该怎么做才能用nutch抓取整个网站?

nutch-site.xml:

<property>
        <name>http.agent.name</name>
        <value>MerveCrawler</value>
    </property>

 <property>
        <name>storage.data.store.class</name>
        <value>org.apache.gora.hbase.store.HBaseStore</value>
        <description>Default class for storing data</description>
    </property>

 <property>
        <name>plugin.includes</name>
        <value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-rege$
    </property>

<property>
    <name>http.content.limit</name>
    <value>-1</value><!-- No limit -->
    <description>The length limit for downloaded content using the http://
      protocol, in bytes. If this value is nonnegative (>=0), content longer
      than it will be truncated; otherwise, no truncation at all. Do not
      confuse this setting with the file.content.limit setting.
    </description>
  </property>
<property>
  <name>fetcher.verbose</name>
  <value>true</value>
  <description>If true, fetcher will log more verbosely.</description>
</property>

<property>
  <name>db.max.outlinks.per.page</name>
  <value>100000000000000000000000000000000000000000000</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
</property>

<property>
  <name>db.ignore.external.links</name>
  <value>false</value>
  <description>If true, outlinks leading from a page to external hosts
  will be ignored. This is an effective way to limit the crawl to include
  only initially injected hosts, without creating complex URLFilters.
  </description>
</property>
<property>
  <name>db.ignore.internal.links</name>
  <value>false</value>
  <description>If true, when adding new links to a page, links from
  the same host are ignored.  This is an effective way to limit the
  size of the link database, keeping only the highest quality
  links.
  </description>
</property>

<property>
  <name>fetcher.server.delay</name>
  <value>10</value>
  <description>The number of seconds the fetcher will delay between
   successive requests to the same server. Note that this might get
   overriden by a Crawl-Delay from a robots.txt and is used ONLY if
   fetcher.threads.per.queue is set to 1.
   </description>
</property>
<property>
  <name>file.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content using the file
   protocol, in bytes. If this value is nonnegative (>=0), content longer
   than it will be truncated; otherwise, no truncation at all. Do not
   confuse this setting with the http.content.limit setting.
  </description>
</property>

<property>
  <name>http.timeout</name>
  <value>100000000000000000000000000000000000</value>
  <description>The default network timeout, in milliseconds.</description>
</property>
<property>
  <name>http.timeout</name>
  <value>100000000000000000000000000000000000</value>
  <description>The default network timeout, in milliseconds.</description>
</property>

<property>
  <name>generate.max.count</name>
  <value>100000000</value>
  <description>The maximum number of urls in a single
  fetchlist.  -1 if unlimited. The urls are counted according
  to the value of the parameter generator.count.mode.
  </description>
</property>
solr web-crawler nutch
1个回答
0
投票

爬行可能无法进一步发展的原因有多种,例如: robots.txt指令。查看日志和/或爬网表的内容,以便更好地了解问题所在

© www.soinside.com 2019 - 2024. All rights reserved.