Description
The inline documentation of 'conf/crawl-tool.xml' mentions:
<!-- Do not modify this file directly. Instead, copy entries that you --> <!-- wish to modify from this file into nutch-site.xml and change them --> <!-- there. If nutch-site.xml does not already exist, create it. -->
However, I don't see any way of overriding the properties defined in 'conf/crawl-tool.xml' as 'conf/nutch-site.xml' is added to the configuration before 'conf/crawl-tool.xml' in the code. Here are the relevant code snippets:
src/org/apache/nutch/crawl/Crawl.java:
Configuration conf = NutchConfiguration.create(); conf.addResource("crawl-tool.xml"); JobConf job = new NutchJob(conf);
src/org/apache/nutch/tool/NutchConfiguration.java:
conf.addResource("nutch-default.xml"); conf.addResource("nutch-site.xml");
I have fixed this in the attached patch. 'crawl-tool.xml' is now added to the configuration before 'nutch-site.xml' only if crawl is invoked using the 'bin/nutch crawl' command.