Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-735

crawl-tool.xml must be read before nutch-site.xml when invoked using crawl command

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.0.0
    • 1.1
    • web gui
    • None
    • Patch Available

    Description

      The inline documentation of 'conf/crawl-tool.xml' mentions:

      <!-- Do not modify this file directly.  Instead, copy entries that you -->
      <!-- wish to modify from this file into nutch-site.xml and change them -->
      <!-- there.  If nutch-site.xml does not already exist, create it.      -->
      

      However, I don't see any way of overriding the properties defined in 'conf/crawl-tool.xml' as 'conf/nutch-site.xml' is added to the configuration before 'conf/crawl-tool.xml' in the code. Here are the relevant code snippets:

      src/org/apache/nutch/crawl/Crawl.java:

      Configuration conf = NutchConfiguration.create();
      conf.addResource("crawl-tool.xml");
      JobConf job = new NutchJob(conf);
      

      src/org/apache/nutch/tool/NutchConfiguration.java:

      conf.addResource("nutch-default.xml");
      conf.addResource("nutch-site.xml");
      

      I have fixed this in the attached patch. 'crawl-tool.xml' is now added to the configuration before 'nutch-site.xml' only if crawl is invoked using the 'bin/nutch crawl' command.

      Attachments

        1. NUTCH-735v0.1.patch
          2 kB
          Susam Pal

        Activity

          People

            dogacan Dogacan Guney
            susam Susam Pal
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: