Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-735

crawl-tool.xml must be read before nutch-site.xml when invoked using crawl command

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.0.0
    • Fix Version/s: 1.1
    • Component/s: web gui
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      The inline documentation of 'conf/crawl-tool.xml' mentions:

      <!-- Do not modify this file directly.  Instead, copy entries that you -->
      <!-- wish to modify from this file into nutch-site.xml and change them -->
      <!-- there.  If nutch-site.xml does not already exist, create it.      -->
      

      However, I don't see any way of overriding the properties defined in 'conf/crawl-tool.xml' as 'conf/nutch-site.xml' is added to the configuration before 'conf/crawl-tool.xml' in the code. Here are the relevant code snippets:

      src/org/apache/nutch/crawl/Crawl.java:

      Configuration conf = NutchConfiguration.create();
      conf.addResource("crawl-tool.xml");
      JobConf job = new NutchJob(conf);
      

      src/org/apache/nutch/tool/NutchConfiguration.java:

      conf.addResource("nutch-default.xml");
      conf.addResource("nutch-site.xml");
      

      I have fixed this in the attached patch. 'crawl-tool.xml' is now added to the configuration before 'nutch-site.xml' only if crawl is invoked using the 'bin/nutch crawl' command.

        Attachments

        1. NUTCH-735v0.1.patch
          2 kB
          Susam Pal

          Activity

            People

            • Assignee:
              dogacan Dogacan Guney
              Reporter:
              susam Susam Pal
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: