Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1519

Configuration Overrides not in sync between WebTableReader and nutch-default.xml

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Auto Closed
    • 2.1
    • 2.5
    • crawldb, storage
    • None

    Description

      In 2.x HEAD the WebTableReader class [0] provides Overrides for properties such as

      currentJob.getConfiguration().setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", false);
      currentJob.getConfiguration().setBoolean("db.reader.stats.sort", sort);
      

      as well as

      Configuration cfg = job.getConfiguration();
          cfg.set(WebTableRegexMapper.regexParamName, regex);
          cfg.setBoolean(WebTableRegexMapper.contentParamName, content);
          cfg.setBoolean(WebTableRegexMapper.headersParamName, headers);
          cfg.setBoolean(WebTableRegexMapper.linksParamName, links);
          cfg.setBoolean(WebTableRegexMapper.textParamName, text);
      

      None of these are actually present and therefore configurable an able to be Overridden.
      This should be sorted out.

      [0] http://svn.apache.org/repos/asf/nutch/branches/2.x/src/java/org/apache/nutch/crawl/WebTableReader.java

      Attachments

        Activity

          People

            Unassigned Unassigned
            lewismc Lewis John McGibbney
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: