Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1519

Configuration Overrides not in sync between WebTableReader and nutch-default.xml

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Auto Closed
    • Affects Version/s: 2.1
    • Fix Version/s: 2.5
    • Component/s: crawldb, storage
    • Labels:
      None

      Description

      In 2.x HEAD the WebTableReader class [0] provides Overrides for properties such as

      currentJob.getConfiguration().setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", false);
      currentJob.getConfiguration().setBoolean("db.reader.stats.sort", sort);
      

      as well as

      Configuration cfg = job.getConfiguration();
          cfg.set(WebTableRegexMapper.regexParamName, regex);
          cfg.setBoolean(WebTableRegexMapper.contentParamName, content);
          cfg.setBoolean(WebTableRegexMapper.headersParamName, headers);
          cfg.setBoolean(WebTableRegexMapper.linksParamName, links);
          cfg.setBoolean(WebTableRegexMapper.textParamName, text);
      

      None of these are actually present and therefore configurable an able to be Overridden.
      This should be sorted out.

      [0] http://svn.apache.org/repos/asf/nutch/branches/2.x/src/java/org/apache/nutch/crawl/WebTableReader.java

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              lewismc Lewis John McGibbney
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: