我也有同样的问题。我只使用这个命令来完成整个过程:
crawl urls/ucuzcumSeed.txt ucuzcum http://localhost:8983/solr/ucuzcum/ 10
crawl <seedDir> <crawlID> [<solrUrl>] <numberOfRounds>
顺便说一句,我使用的是Nutch的2.3.1版本和Solr的5.2.1版本。问题是我无法仅为此命令获取整个网站。我想numberofRounds参数不起作用。首先运行nutch只需找到1个用于获取的URL并生成并解析它。在第二步之后它可以获得更多网址。在这种情况下,这意味着nutch在第一次迭代结束时停止。但它应该按照我的命令继续。我该怎么做才能用nutch抓取整个网站?
nutch-site.xml:
<property>
<name>http.agent.name</name>
<value>MerveCrawler</value>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-rege$
</property>
<property>
<name>http.content.limit</name>
<value>-1</value><!-- No limit -->
<description>The length limit for downloaded content using the http://
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the file.content.limit setting.
</description>
</property>
<property>
<name>fetcher.verbose</name>
<value>true</value>
<description>If true, fetcher will log more verbosely.</description>
</property>
<property>
<name>db.max.outlinks.per.page</name>
<value>100000000000000000000000000000000000000000000</value>
<description>The maximum number of outlinks that we'll process for a page.
If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
will be processed for a page; otherwise, all outlinks will be processed.
</description>
</property>
<property>
<name>db.ignore.external.links</name>
<value>false</value>
<description>If true, outlinks leading from a page to external hosts
will be ignored. This is an effective way to limit the crawl to include
only initially injected hosts, without creating complex URLFilters.
</description>
</property>
<property>
<name>db.ignore.internal.links</name>
<value>false</value>
<description>If true, when adding new links to a page, links from
the same host are ignored. This is an effective way to limit the
size of the link database, keeping only the highest quality
links.
</description>
</property>
<property>
<name>fetcher.server.delay</name>
<value>10</value>
<description>The number of seconds the fetcher will delay between
successive requests to the same server. Note that this might get
overriden by a Crawl-Delay from a robots.txt and is used ONLY if
fetcher.threads.per.queue is set to 1.
</description>
</property>
<property>
<name>file.content.limit</name>
<value>-1</value>
<description>The length limit for downloaded content using the file
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the http.content.limit setting.
</description>
</property>
<property>
<name>http.timeout</name>
<value>100000000000000000000000000000000000</value>
<description>The default network timeout, in milliseconds.</description>
</property>
<property>
<name>http.timeout</name>
<value>100000000000000000000000000000000000</value>
<description>The default network timeout, in milliseconds.</description>
</property>
<property>
<name>generate.max.count</name>
<value>100000000</value>
<description>The maximum number of urls in a single
fetchlist. -1 if unlimited. The urls are counted according
to the value of the parameter generator.count.mode.
</description>
</property>
爬行可能无法进一步发展的原因有多种,例如: robots.txt指令。查看日志和/或爬网表的内容,以便更好地了解问题所在