Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2375

Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: deployment
    • Labels:
      None

      Description

      Nutch is still using the deprecated org.apache.hadoop.mapred dependency which has been deprecated. It need to be updated to org.apache.hadoop.mapreduce dependency.

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 opened a new pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188

          • This PR is a part of the upgrade and will be updated continuously by me.
          • Please feel free to review the PR.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 opened a new pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188 This PR is a part of the upgrade and will be updated continuously by me. Please feel free to review the PR. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-296388783

          Excellent @Omkar20895 thank you for starting this pull request. Now you will see that EVERY class is broken e.g. does not compile... lets gradually fix those classes. Please update this PR as you progress. Thank you for submitting PR early it makes a huge difference for review.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-296388783 Excellent @Omkar20895 thank you for starting this pull request. Now you will see that EVERY class is broken e.g. does not compile... lets gradually fix those classes. Please update this PR as you progress. Thank you for submitting PR early it makes a huge difference for review. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-297560916

          @Omkar20895 please keep the updates coming as we can review the code incrementally. Thanks.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-297560916 @Omkar20895 please keep the updates coming as we can review the code incrementally. Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-297653471

          @lewismc I am working on upgrading crawldb, I will send the update as soon as I am done with it. Thanks.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-297653471 @lewismc I am working on upgrading crawldb, I will send the update as soon as I am done with it. Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          omkar20895 Omkar Reddy added a comment -

          Hello dev@,

          I am using the following url : https://www.slideshare.net/sh1mmer/upgrading-to-the-new-map-reduce-api, to upgrade the codebase. Please post on this thread if there is any discrepancy in the ppt in the above link.

          Thanks,
          Omkar.

          Show
          omkar20895 Omkar Reddy added a comment - Hello dev@, I am using the following url : https://www.slideshare.net/sh1mmer/upgrading-to-the-new-map-reduce-api , to upgrade the codebase. Please post on this thread if there is any discrepancy in the ppt in the above link. Thanks, Omkar.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-298213968

          CrawlDb needs to be updated more, this will be done in the coming weeks. Major changes still need to be done in DeDuplicationJob.java, Thanks.
          @lewismc

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-298213968 CrawlDb needs to be updated more, this will be done in the coming weeks. Major changes still need to be done in DeDuplicationJob.java, Thanks. @lewismc ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-300236434

          @lewismc please review the latest commit so that I can incorporate the changes suggested by you.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-300236434 @lewismc please review the latest commit so that I can incorporate the changes suggested by you. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#discussion_r116651511

          ##########
          File path: src/java/org/apache/nutch/crawl/CrawlDbReader.java
          ##########
          @@ -47,22 +47,10 @@
          import org.apache.hadoop.io.SequenceFile;
          import org.apache.hadoop.io.Text;
          import org.apache.hadoop.io.Writable;
          -import org.apache.hadoop.mapred.FileInputFormat;
          -import org.apache.hadoop.mapred.FileOutputFormat;
          -import org.apache.hadoop.mapred.JobClient;
          -import org.apache.hadoop.mapred.JobConf;
          -import org.apache.hadoop.mapred.MapFileOutputFormat;
          -import org.apache.hadoop.mapred.Mapper;
          -import org.apache.hadoop.mapred.OutputCollector;
          -import org.apache.hadoop.mapred.RecordWriter;
          -import org.apache.hadoop.mapred.Reducer;
          -import org.apache.hadoop.mapred.Reporter;
          -import org.apache.hadoop.mapred.SequenceFileInputFormat;
          -import org.apache.hadoop.mapred.SequenceFileOutputFormat;
          -import org.apache.hadoop.mapred.TextOutputFormat;
          -import org.apache.hadoop.mapred.lib.HashPartitioner;
          -import org.apache.hadoop.mapred.lib.IdentityMapper;
          -import org.apache.hadoop.mapred.lib.IdentityReducer;
          +import org.apache.hadoop.mapreduce.*;

          Review comment:
          Same here.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r116651511 ########## File path: src/java/org/apache/nutch/crawl/CrawlDbReader.java ########## @@ -47,22 +47,10 @@ import org.apache.hadoop.io.SequenceFile; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.Writable; -import org.apache.hadoop.mapred.FileInputFormat; -import org.apache.hadoop.mapred.FileOutputFormat; -import org.apache.hadoop.mapred.JobClient; -import org.apache.hadoop.mapred.JobConf; -import org.apache.hadoop.mapred.MapFileOutputFormat; -import org.apache.hadoop.mapred.Mapper; -import org.apache.hadoop.mapred.OutputCollector; -import org.apache.hadoop.mapred.RecordWriter; -import org.apache.hadoop.mapred.Reducer; -import org.apache.hadoop.mapred.Reporter; -import org.apache.hadoop.mapred.SequenceFileInputFormat; -import org.apache.hadoop.mapred.SequenceFileOutputFormat; -import org.apache.hadoop.mapred.TextOutputFormat; -import org.apache.hadoop.mapred.lib.HashPartitioner; -import org.apache.hadoop.mapred.lib.IdentityMapper; -import org.apache.hadoop.mapred.lib.IdentityReducer; +import org.apache.hadoop.mapreduce.*; Review comment: Same here. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#discussion_r116651544

          ##########
          File path: src/java/org/apache/nutch/crawl/CrawlDbReader.java
          ##########
          @@ -371,26 +361,27 @@ public void close() {
          private TreeMap<String, LongWritable> processStatJobHelper(String crawlDb, Configuration config, boolean sort) throws IOException{
          Path tmpFolder = new Path(crawlDb, "stat_tmp" + System.currentTimeMillis());

          • JobConf job = new NutchJob(config);
            + Job job = new NutchJob(config);
            + config = job.getConfiguration();

          Review comment:
          Formatting.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r116651544 ########## File path: src/java/org/apache/nutch/crawl/CrawlDbReader.java ########## @@ -371,26 +361,27 @@ public void close() { private TreeMap<String, LongWritable> processStatJobHelper(String crawlDb, Configuration config, boolean sort) throws IOException{ Path tmpFolder = new Path(crawlDb, "stat_tmp" + System.currentTimeMillis()); JobConf job = new NutchJob(config); + Job job = new NutchJob(config); + config = job.getConfiguration(); Review comment: Formatting. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#discussion_r116651333

          ##########
          File path: src/java/org/apache/nutch/crawl/CrawlDb.java
          ##########
          @@ -28,8 +28,9 @@
          import org.apache.hadoop.io.*;
          import org.apache.hadoop.fs.*;
          import org.apache.hadoop.conf.*;
          -import org.apache.hadoop.mapred.*;
          import org.apache.hadoop.mapreduce.Job;
          +import org.apache.hadoop.mapreduce.lib.input.*;

          Review comment:
          Please never use the wildcard imports e.g. ```import org.apache.hadoop.mapreduce.lib.input.*;```, always make individual imports.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r116651333 ########## File path: src/java/org/apache/nutch/crawl/CrawlDb.java ########## @@ -28,8 +28,9 @@ import org.apache.hadoop.io.*; import org.apache.hadoop.fs.*; import org.apache.hadoop.conf.*; -import org.apache.hadoop.mapred.*; import org.apache.hadoop.mapreduce.Job; +import org.apache.hadoop.mapreduce.lib.input.*; Review comment: Please never use the wildcard imports e.g. ```import org.apache.hadoop.mapreduce.lib.input.*;```, always make individual imports. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#discussion_r116651673

          ##########
          File path: src/java/org/apache/nutch/crawl/Generator.java
          ##########
          @@ -29,8 +29,10 @@
          import org.apache.commons.jexl2.Expression;
          import org.apache.hadoop.io.*;
          import org.apache.hadoop.conf.*;
          -import org.apache.hadoop.mapred.*;
          -import org.apache.hadoop.mapred.lib.MultipleSequenceFileOutputFormat;
          +import org.apache.hadoop.mapreduce.*;

          Review comment:
          Same here.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r116651673 ########## File path: src/java/org/apache/nutch/crawl/Generator.java ########## @@ -29,8 +29,10 @@ import org.apache.commons.jexl2.Expression; import org.apache.hadoop.io.*; import org.apache.hadoop.conf.*; -import org.apache.hadoop.mapred.*; -import org.apache.hadoop.mapred.lib.MultipleSequenceFileOutputFormat; +import org.apache.hadoop.mapreduce.*; Review comment: Same here. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#discussion_r116651780

          ##########
          File path: src/java/org/apache/nutch/util/NutchJob.java
          ##########
          @@ -17,14 +17,16 @@

          package org.apache.nutch.util;

          +import java.io.IOException;
          +
          import org.apache.hadoop.conf.Configuration;
          -import org.apache.hadoop.mapred.JobConf;
          +import org.apache.hadoop.mapreduce.Job;

          -/** A

          {@link JobConf}

          for Nutch jobs. */
          -public class NutchJob extends JobConf {
          +/** A

          {@link Job}

          for Nutch jobs. */
          +public class NutchJob extends Job {

          • public NutchJob(Configuration conf) {
          • super(conf, NutchJob.class);
            + public NutchJob(Configuration conf) throws IOException {
            + super(conf, "NutchJob");

          Review comment:
          Does this mean that every Job will be named ```"NutchJob"```?

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r116651780 ########## File path: src/java/org/apache/nutch/util/NutchJob.java ########## @@ -17,14 +17,16 @@ package org.apache.nutch.util; +import java.io.IOException; + import org.apache.hadoop.conf.Configuration; -import org.apache.hadoop.mapred.JobConf; +import org.apache.hadoop.mapreduce.Job; -/** A {@link JobConf} for Nutch jobs. */ -public class NutchJob extends JobConf { +/** A {@link Job} for Nutch jobs. */ +public class NutchJob extends Job { public NutchJob(Configuration conf) { super(conf, NutchJob.class); + public NutchJob(Configuration conf) throws IOException { + super(conf, "NutchJob"); Review comment: Does this mean that every Job will be named ```"NutchJob"```? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#discussion_r116651456

          ##########
          File path: src/java/org/apache/nutch/crawl/CrawlDbMerger.java
          ##########
          @@ -32,7 +32,9 @@
          import org.apache.hadoop.fs.Path;
          import org.apache.hadoop.io.Text;
          import org.apache.hadoop.io.Writable;
          -import org.apache.hadoop.mapred.*;
          +import org.apache.hadoop.mapreduce.*;

          Review comment:
          Same here.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r116651456 ########## File path: src/java/org/apache/nutch/crawl/CrawlDbMerger.java ########## @@ -32,7 +32,9 @@ import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.Writable; -import org.apache.hadoop.mapred.*; +import org.apache.hadoop.mapreduce.*; Review comment: Same here. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#discussion_r116651646

          ##########
          File path: src/java/org/apache/nutch/crawl/DeduplicationJob.java
          ##########
          @@ -31,7 +31,7 @@
          import org.apache.hadoop.fs.Path;
          import org.apache.hadoop.io.BytesWritable;
          import org.apache.hadoop.io.Text;
          -import org.apache.hadoop.mapred.Counters.Group;
          +/*import org.apache.hadoop.mapred.Counters.Group;

          Review comment:
          Never leave code commented out like this please. Also make explicit individual imports please.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r116651646 ########## File path: src/java/org/apache/nutch/crawl/DeduplicationJob.java ########## @@ -31,7 +31,7 @@ import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.BytesWritable; import org.apache.hadoop.io.Text; -import org.apache.hadoop.mapred.Counters.Group; +/*import org.apache.hadoop.mapred.Counters.Group; Review comment: Never leave code commented out like this please. Also make explicit individual imports please. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#discussion_r116693140

          ##########
          File path: src/java/org/apache/nutch/util/NutchJob.java
          ##########
          @@ -17,14 +17,16 @@

          package org.apache.nutch.util;

          +import java.io.IOException;
          +
          import org.apache.hadoop.conf.Configuration;
          -import org.apache.hadoop.mapred.JobConf;
          +import org.apache.hadoop.mapreduce.Job;

          -/** A

          {@link JobConf}

          for Nutch jobs. */
          -public class NutchJob extends JobConf {
          +/** A

          {@link Job}

          for Nutch jobs. */
          +public class NutchJob extends Job {

          • public NutchJob(Configuration conf) {
          • super(conf, NutchJob.class);
            + public NutchJob(Configuration conf) throws IOException {
            + super(conf, "NutchJob");

          Review comment:
          yes, @lewismc if we create a job as NutchJob job = new NutchJob(); then it will be named by default NutchJob and it can be overriden as job.setJobName("<name>"); in the code where the job is being created. Should it be in any other way?

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r116693140 ########## File path: src/java/org/apache/nutch/util/NutchJob.java ########## @@ -17,14 +17,16 @@ package org.apache.nutch.util; +import java.io.IOException; + import org.apache.hadoop.conf.Configuration; -import org.apache.hadoop.mapred.JobConf; +import org.apache.hadoop.mapreduce.Job; -/** A {@link JobConf} for Nutch jobs. */ -public class NutchJob extends JobConf { +/** A {@link Job} for Nutch jobs. */ +public class NutchJob extends Job { public NutchJob(Configuration conf) { super(conf, NutchJob.class); + public NutchJob(Configuration conf) throws IOException { + super(conf, "NutchJob"); Review comment: yes, @lewismc if we create a job as NutchJob job = new NutchJob(); then it will be named by default NutchJob and it can be overriden as job.setJobName("<name>"); in the code where the job is being created. Should it be in any other way? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#discussion_r116693310

          ##########
          File path: src/java/org/apache/nutch/crawl/CrawlDb.java
          ##########
          @@ -28,8 +28,9 @@
          import org.apache.hadoop.io.*;
          import org.apache.hadoop.fs.*;
          import org.apache.hadoop.conf.*;
          -import org.apache.hadoop.mapred.*;
          import org.apache.hadoop.mapreduce.Job;
          +import org.apache.hadoop.mapreduce.lib.input.*;

          Review comment:
          I am changing it in the local copy and I will update the PR today.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r116693310 ########## File path: src/java/org/apache/nutch/crawl/CrawlDb.java ########## @@ -28,8 +28,9 @@ import org.apache.hadoop.io.*; import org.apache.hadoop.fs.*; import org.apache.hadoop.conf.*; -import org.apache.hadoop.mapred.*; import org.apache.hadoop.mapreduce.Job; +import org.apache.hadoop.mapreduce.lib.input.*; Review comment: I am changing it in the local copy and I will update the PR today. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-302783499

          Please ignore the number of commits, I will squash them while this PR is being merged. Thanks.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-302783499 Please ignore the number of commits, I will squash them while this PR is being merged. Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-302793054

          np @Omkar20895 keep them coming

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-302793054 np @Omkar20895 keep them coming ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-302793286

          Also @Omkar20895 please ensure that this PR is kept up-to-date with master branch or else we will end up in mess. Thank you

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-302793286 Also @Omkar20895 please ensure that this PR is kept up-to-date with master branch or else we will end up in mess. Thank you ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-302852374

          yes I will @lewismc

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-302852374 yes I will @lewismc ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-304350642

          @lewismc please review the latest commit. The errors I was mentioning will be found in the files crawl/Generator.java, crawl/CrawlDbReader.java, and crawl/DeduplicationJob.java

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-304350642 @lewismc please review the latest commit. The errors I was mentioning will be found in the files crawl/Generator.java, crawl/CrawlDbReader.java, and crawl/DeduplicationJob.java ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#discussion_r118763652

          ##########
          File path: src/java/org/apache/nutch/crawl/CrawlDb.java
          ##########
          @@ -28,8 +28,9 @@
          import org.apache.hadoop.io.*;
          import org.apache.hadoop.fs.*;
          import org.apache.hadoop.conf.*;
          -import org.apache.hadoop.mapred.*;
          import org.apache.hadoop.mapreduce.Job;
          +import org.apache.hadoop.mapreduce.lib.input.*;

          Review comment:
          Can you please update this as well?

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r118763652 ########## File path: src/java/org/apache/nutch/crawl/CrawlDb.java ########## @@ -28,8 +28,9 @@ import org.apache.hadoop.io.*; import org.apache.hadoop.fs.*; import org.apache.hadoop.conf.*; -import org.apache.hadoop.mapred.*; import org.apache.hadoop.mapreduce.Job; +import org.apache.hadoop.mapreduce.lib.input.*; Review comment: Can you please update this as well? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-304371425

          @Omkar20895 I have no idea what the errors are as I am getting a whole host of compile issues due to the changing nature of NutchJob with the upgrade.
          Please go ahead and upgrade all instances of
          ```
          JobConf job = new NutchJob(getConf());
          ```
          We can then take it job-by-job, thanks. Also please make the updates as per my previous review and comments. Let's keep this branch up-to-date with master and also up-to-date with my comments. Thank you very much @Omkar20895

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-304371425 @Omkar20895 I have no idea what the errors are as I am getting a whole host of compile issues due to the changing nature of NutchJob with the upgrade. Please go ahead and upgrade all instances of ``` JobConf job = new NutchJob(getConf()); ``` We can then take it job-by-job, thanks. Also please make the updates as per my previous review and comments. Let's keep this branch up-to-date with master and also up-to-date with my comments. Thank you very much @Omkar20895 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-304371641

          @Omkar20895 you will also see the following
          ```
          /usr/local/nutch/src/java/org/apache/nutch/util/NutchJob.java:29: warning: [deprecation] Job(Configuration,String) in Job has been deprecated
          [javac] super(conf, "NutchJob");
          ```
          This also needs to be fixed before you go on and do anything else.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-304371641 @Omkar20895 you will also see the following ``` /usr/local/nutch/src/java/org/apache/nutch/util/NutchJob.java:29: warning: [deprecation] Job(Configuration,String) in Job has been deprecated [javac] super(conf, "NutchJob"); ``` This also needs to be fixed before you go on and do anything else. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#discussion_r121465277

          ##########
          File path: src/java/org/apache/nutch/crawl/CrawlDb.java
          ##########
          @@ -28,8 +28,12 @@
          import org.apache.hadoop.io.*;

          Review comment:
          @Omkar20895 can you please sort out all instances of wildcard imports for Hadoop?

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r121465277 ########## File path: src/java/org/apache/nutch/crawl/CrawlDb.java ########## @@ -28,8 +28,12 @@ import org.apache.hadoop.io.*; Review comment: @Omkar20895 can you please sort out all instances of wildcard imports for Hadoop? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#discussion_r121465424

          ##########
          File path: src/java/org/apache/nutch/crawl/CrawlDbMerger.java
          ##########
          @@ -32,7 +32,9 @@
          import org.apache.hadoop.fs.Path;
          import org.apache.hadoop.io.Text;
          import org.apache.hadoop.io.Writable;
          -import org.apache.hadoop.mapred.*;
          +import org.apache.hadoop.mapreduce.*;

          Review comment:
          @Omkar20895 please address the above issue

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r121465424 ########## File path: src/java/org/apache/nutch/crawl/CrawlDbMerger.java ########## @@ -32,7 +32,9 @@ import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.Writable; -import org.apache.hadoop.mapred.*; +import org.apache.hadoop.mapreduce.*; Review comment: @Omkar20895 please address the above issue ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-308030299

          Hi @lewismc , the major is error is produced by SequenceFileOutputFormat.getReaders() which has been removed from the latest mapreduce API, I cannot find the replacement for it. Thanks.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-308030299 Hi @lewismc , the major is error is produced by SequenceFileOutputFormat.getReaders() which has been removed from the latest mapreduce API, I cannot find the replacement for it. Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-308946744

          Hi @lewismc, did you get a chance to look at this issue? Thanks.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-308946744 Hi @lewismc, did you get a chance to look at this issue? Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-308948137

          Yes I did @Omkar20895

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-308948137 Yes I did @Omkar20895 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-308948181

          I'll comment on thr PR

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-308948181 I'll comment on thr PR ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-309027522

          Hi @Omkar20895 I am getting the following build errors
          ```
          compile-core:
          [javac] Compiling 284 source files to /usr/local/nutch/build/classes
          [javac] /usr/local/nutch/src/java/org/apache/nutch/crawl/CrawlDb.java:119: error: unreported exception InterruptedException; must be caught or declared to be thrown
          [javac] int complete = job.waitForCompletion(true)?0:1;
          [javac] ^
          [javac] /usr/local/nutch/src/java/org/apache/nutch/crawl/CrawlDbReader.java:399: error: cannot find symbol
          [javac] SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(config,
          [javac] ^
          [javac] symbol: method getReaders(Configuration,Path)
          [javac] location: class SequenceFileOutputFormat
          [javac] /usr/local/nutch/src/java/org/apache/nutch/hostdb/ReadHostDb.java:208: error: cannot find symbol
          [javac] SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(conf, hostDb);
          [javac] ^
          [javac] symbol: method getReaders(Configuration,Path)
          [javac] location: class SequenceFileOutputFormat
          [javac] /usr/local/nutch/src/java/org/apache/nutch/segment/SegmentReader.java:434: error: cannot find symbol
          [javac] SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(
          [javac] ^
          [javac] symbol: method getReaders(Configuration,Path)
          [javac] location: class SequenceFileOutputFormat
          [javac] /usr/local/nutch/src/java/org/apache/nutch/segment/SegmentReader.java:511: error: cannot find symbol
          [javac] SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(
          [javac] ^
          [javac] symbol: method getReaders(Configuration,Path)
          [javac] location: class SequenceFileOutputFormat
          [javac] /usr/local/nutch/src/java/org/apache/nutch/tools/FreeGenerator.java:197: error: cannot find symbol
          [javac] job.setOutputKeyComparatorClass(Generator.HashComparator.class);
          [javac] ^
          [javac] symbol: method setOutputKeyComparatorClass(Class<HashComparator>)
          [javac] location: variable job of type Job
          [javac] /usr/local/nutch/src/java/org/apache/nutch/util/NutchJob.java:29: warning: [deprecation] Job(Configuration,String) in Job has been deprecated
          [javac] super(conf, "NutchJob");
          [javac] ^
          [javac] Note: Some input files use unchecked or unsafe operations.
          [javac] Note: Recompile with -Xlint:unchecked for details.
          [javac] 6 errors
          [javac] 1 warning
          ```

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-309027522 Hi @Omkar20895 I am getting the following build errors ``` compile-core: [javac] Compiling 284 source files to /usr/local/nutch/build/classes [javac] /usr/local/nutch/src/java/org/apache/nutch/crawl/CrawlDb.java:119: error: unreported exception InterruptedException; must be caught or declared to be thrown [javac] int complete = job.waitForCompletion(true)?0:1; [javac] ^ [javac] /usr/local/nutch/src/java/org/apache/nutch/crawl/CrawlDbReader.java:399: error: cannot find symbol [javac] SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(config, [javac] ^ [javac] symbol: method getReaders(Configuration,Path) [javac] location: class SequenceFileOutputFormat [javac] /usr/local/nutch/src/java/org/apache/nutch/hostdb/ReadHostDb.java:208: error: cannot find symbol [javac] SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(conf, hostDb); [javac] ^ [javac] symbol: method getReaders(Configuration,Path) [javac] location: class SequenceFileOutputFormat [javac] /usr/local/nutch/src/java/org/apache/nutch/segment/SegmentReader.java:434: error: cannot find symbol [javac] SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders( [javac] ^ [javac] symbol: method getReaders(Configuration,Path) [javac] location: class SequenceFileOutputFormat [javac] /usr/local/nutch/src/java/org/apache/nutch/segment/SegmentReader.java:511: error: cannot find symbol [javac] SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders( [javac] ^ [javac] symbol: method getReaders(Configuration,Path) [javac] location: class SequenceFileOutputFormat [javac] /usr/local/nutch/src/java/org/apache/nutch/tools/FreeGenerator.java:197: error: cannot find symbol [javac] job.setOutputKeyComparatorClass(Generator.HashComparator.class); [javac] ^ [javac] symbol: method setOutputKeyComparatorClass(Class<HashComparator>) [javac] location: variable job of type Job [javac] /usr/local/nutch/src/java/org/apache/nutch/util/NutchJob.java:29: warning: [deprecation] Job(Configuration,String) in Job has been deprecated [javac] super(conf, "NutchJob"); [javac] ^ [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [javac] 6 errors [javac] 1 warning ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-309027722

          Please start with the following issue
          ```
          [javac] symbol: method setOutputKeyComparatorClass(Class<HashComparator>)
          [javac] location: variable job of type Job
          [javac] /usr/local/nutch/src/java/org/apache/nutch/util/NutchJob.java:29: warning: [deprecation] Job(Configuration,String) in Job has been deprecated
          [javac] super(conf, "NutchJob");
          [javac] ^
          ```

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-309027722 Please start with the following issue ``` [javac] symbol: method setOutputKeyComparatorClass(Class<HashComparator>) [javac] location: variable job of type Job [javac] /usr/local/nutch/src/java/org/apache/nutch/util/NutchJob.java:29: warning: [deprecation] Job(Configuration,String) in Job has been deprecated [javac] super(conf, "NutchJob"); [javac] ^ ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-309028312

          @Omkar20895 please refer to following class for guidance
          https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/util/NutchJob.java
          Once you push a commit please ping me. Thank you.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-309028312 @Omkar20895 please refer to following class for guidance https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/util/NutchJob.java Once you push a commit please ping me. Thank you. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-309217381

          @lewismc please take a look at this commit.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-309217381 @lewismc please take a look at this commit. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#discussion_r122575484

          ##########
          File path: src/java/org/apache/nutch/util/NutchJob.java
          ##########
          @@ -25,8 +25,9 @@
          /** A

          {@link Job}

          for Nutch jobs. */
          public class NutchJob extends Job {

          • public NutchJob(Configuration conf) throws IOException { - super(conf, "NutchJob"); - }

            + public static NutchJob getJobInstance(Configuration conf){

          Review comment:
          This latest addition will still give error because we do not have an constructor in this class and JVM will add a default constructor which will have a default super call to the parent class. Something of this form :
          **default constructor**
          public NutchJob()

          { super() }

          But this won't create any discrepancy because we are not utilizing that default constructor in the whole code base.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r122575484 ########## File path: src/java/org/apache/nutch/util/NutchJob.java ########## @@ -25,8 +25,9 @@ /** A {@link Job} for Nutch jobs. */ public class NutchJob extends Job { public NutchJob(Configuration conf) throws IOException { - super(conf, "NutchJob"); - } + public static NutchJob getJobInstance(Configuration conf){ Review comment: This latest addition will still give error because we do not have an constructor in this class and JVM will add a default constructor which will have a default super call to the parent class. Something of this form : ** default constructor ** public NutchJob() { super() } But this won't create any discrepancy because we are not utilizing that default constructor in the whole code base. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#discussion_r122575534

          ##########
          File path: src/java/org/apache/nutch/util/NutchJob.java
          ##########
          @@ -25,8 +25,9 @@
          /** A

          {@link Job}

          for Nutch jobs. */
          public class NutchJob extends Job {

          • public NutchJob(Configuration conf) throws IOException { - super(conf, "NutchJob"); - }

            + public static NutchJob getJobInstance(Configuration conf){

          Review comment:
          This doesn't give an error but gives a warning.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r122575534 ########## File path: src/java/org/apache/nutch/util/NutchJob.java ########## @@ -25,8 +25,9 @@ /** A {@link Job} for Nutch jobs. */ public class NutchJob extends Job { public NutchJob(Configuration conf) throws IOException { - super(conf, "NutchJob"); - } + public static NutchJob getJobInstance(Configuration conf){ Review comment: This doesn't give an error but gives a warning. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-309652343

          Hi, @lewismc any update on this?

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-309652343 Hi, @lewismc any update on this? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-309812412

          Hi @Omkar20895 please update all instance of ```SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(getConf(), dir);``` to ```MapFile.Reader[] readers = MapFileOutputFormat.getReaders(dir, getConf());```

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-309812412 Hi @Omkar20895 please update all instance of ```SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(getConf(), dir);``` to ```MapFile.Reader[] readers = MapFileOutputFormat.getReaders(dir, getConf());``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-309812412

          Hi @Omkar20895 please update all instance of ```SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(getConf(), dir);``` to ```MapFile.Reader[] readers = MapFileOutputFormat.getReaders(dir, getConf());```.
          Please push an update once this is done and we can take it from there. Thank you.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-309812412 Hi @Omkar20895 please update all instance of ```SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(getConf(), dir);``` to ```MapFile.Reader[] readers = MapFileOutputFormat.getReaders(dir, getConf());```. Please push an update once this is done and we can take it from there. Thank you. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-309813213

          But mapfileoutputformat and sequencefileoutputformat are different aren't they? Won't it cause some discrepancy?
          @lewismc

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-309813213 But mapfileoutputformat and sequencefileoutputformat are different aren't they? Won't it cause some discrepancy? @lewismc ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-309815700

          > But mapfileoutputformat and sequencefileoutputformat are different aren't they?

          Yes they have different semantics. You can see difference between [MapFile](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/MapFile.html) and [SequenceFile](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/SequenceFile.html).

          > Won't it cause some discrepancy?

          It may well do. Right now though you can implement them (there are only 4 or 5 instances) we have this thread for reference. If there are discrepancies then we can come back and fix them.
          \Most import thing is that you continue to move onwards.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-309815700 > But mapfileoutputformat and sequencefileoutputformat are different aren't they? Yes they have different semantics. You can see difference between [MapFile] ( https://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/MapFile.html ) and [SequenceFile] ( https://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/SequenceFile.html ). > Won't it cause some discrepancy? It may well do. Right now though you can implement them (there are only 4 or 5 instances) we have this thread for reference. If there are discrepancies then we can come back and fix them. \Most import thing is that you continue to move onwards. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-310011122

          Done! what about the other two errors.
          One is I cannot find the replacement for
          /home/omkar/Documents/GIT/nutch1/src/java/org/apache/nutch/tools/FreeGenerator.java:197: error: cannot find symbol
          [javac] job.setOutputKeyComparatorClass(Generator.HashComparator.class);

          @lewismc

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-310011122 Done! what about the other two errors. One is I cannot find the replacement for /home/omkar/Documents/GIT/nutch1/src/java/org/apache/nutch/tools/FreeGenerator.java:197: error: cannot find symbol [javac] job.setOutputKeyComparatorClass(Generator.HashComparator.class); @lewismc ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-310097433

          @Omkar20895 before I look at this, which Class or Method is no longer found? ```setOutputKeyComparatorClass```, ```Generator```, or ```HashComparator```?

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-310097433 @Omkar20895 before I look at this, which Class or Method is no longer found? ```setOutputKeyComparatorClass```, ```Generator```, or ```HashComparator```? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-310113233

          @lewismc setOutputKeyComparatorClass.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-310113233 @lewismc setOutputKeyComparatorClass. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-310138744

          @Omkar20895 did you push your new commits? I don't see them yet and I just tried pulling.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-310138744 @Omkar20895 did you push your new commits? I don't see them yet and I just tried pulling. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-310139507

          Please replace ```job.setOutputKeyComparatorClass``` with ```job.setSortComparatorClass```

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-310139507 Please replace ```job.setOutputKeyComparatorClass``` with ```job.setSortComparatorClass``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-310141065

          Also, the NutchJob class should look as follows
          ```
          package org.apache.nutch.util;

          import org.apache.hadoop.conf.Configuration;
          import org.apache.hadoop.mapred.JobConf;

          /** A

          {@link JobConf}

          for Nutch jobs. */
          public class NutchJob extends JobConf {

          public NutchJob(Configuration conf)

          { super(conf, NutchJob.class); }

          }
          ```

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-310141065 Also, the NutchJob class should look as follows ``` package org.apache.nutch.util; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.mapred.JobConf; /** A {@link JobConf} for Nutch jobs. */ public class NutchJob extends JobConf { public NutchJob(Configuration conf) { super(conf, NutchJob.class); } } ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-310144221

          `public class NutchJob extends JobConf `
          Okay let me tell you my why I changed it, you see JobConf is no longer there in new Mapreduce API and hence I replaced it with Job class and now NutchJob extends Job class.
          ```
          public NutchJob(Configuration conf)

          { super(conf, NutchJob.class); }

          ```
          If you look at the constructor it has a super call that will call Job(conf, class) which is deprecated and the documentation suggests to use getInstance instead.

          Please refer to constructor summary section [here](https://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/mapreduce/Job.html).

          Please feel free to correct me if I am wrong. Thanks.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-310144221 `public class NutchJob extends JobConf ` Okay let me tell you my why I changed it, you see JobConf is no longer there in new Mapreduce API and hence I replaced it with Job class and now NutchJob extends Job class. ``` public NutchJob(Configuration conf) { super(conf, NutchJob.class); } ``` If you look at the constructor it has a super call that will call Job(conf, class) which is deprecated and the documentation suggests to use getInstance instead. Please refer to constructor summary section [here] ( https://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/mapreduce/Job.html ). Please feel free to correct me if I am wrong. Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-310156248

          OK, just as long as the deprecation WARN is removed at some stage I am happy with that. Looking forward to this upgrade finishing and we can test it out. Thank you @Omkar20895

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-310156248 OK, just as long as the deprecation WARN is removed at some stage I am happy with that. Looking forward to this upgrade finishing and we can test it out. Thank you @Omkar20895 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-310279028

          Hi there is this one last error that I am concerned about.

          ```
          [javac] /home/omkar/Documents/GIT/nutch1/src/java/org/apache/nutch/crawl/CrawlDb.java:119: error: unreported exception InterruptedException; must be caught or declared to be thrown
          [javac] int complete = job.waitForCompletion(true)?0:1;
          ```
          I have used `int complete = job.waitForCompletion(true)?0:1;` in multiple files and I did not get any error in those files and I am unable to understand why I am getting this here. Thanks.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-310279028 Hi there is this one last error that I am concerned about. ``` [javac] /home/omkar/Documents/GIT/nutch1/src/java/org/apache/nutch/crawl/CrawlDb.java:119: error: unreported exception InterruptedException; must be caught or declared to be thrown [javac] int complete = job.waitForCompletion(true)?0:1; ``` I have used `int complete = job.waitForCompletion(true)?0:1;` in multiple files and I did not get any error in those files and I am unable to understand why I am getting this here. Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-310300795

          @Omkar20895 ... again, you've not pushed anything to the remote branch. Please push so I can pull locally. Thanks.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-310300795 @Omkar20895 ... again, you've not pushed anything to the remote branch. Please push so I can pull locally. Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-310357695

          @lewismc pardon me for the delay. I thought I pushed a commit. 😅

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-310357695 @lewismc pardon me for the delay. I thought I pushed a commit. 😅 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-310437374

          Check out [Job.waitForCompletion](https://hadoop.apache.org/docs/current/api/index.html?org/apache/hadoop/mapreduce/Job.html) it
          ```
          Throws:
          IOException - thrown if the communication with the JobTracker is lost
          InterruptedException
          ClassNotFoundException
          ```
          These need to be handled appropriately, please go and have a look at Exception handling within method calls e.g. try... catch

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-310437374 Check out [Job.waitForCompletion] ( https://hadoop.apache.org/docs/current/api/index.html?org/apache/hadoop/mapreduce/Job.html ) it ``` Throws: IOException - thrown if the communication with the JobTracker is lost InterruptedException ClassNotFoundException ``` These need to be handled appropriately, please go and have a look at Exception handling within method calls e.g. try... catch ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-310624270

          @lewismc

          >>>`I have used int complete = job.waitForCompletion(true)?0:1; in multiple files and I did not get any error in those files and I am unable to understand why I am getting this here`

          This exception handling needs to be done in every file that used the above line, I am working on it and will push a commit as soon as I am done. I am making following changes.

          Converting `int complete = job.waitForCompletion(true)?0:1;` to

          ```
          try

          { int complete = job.waitForCompletion(true)?0:1; }

          catch (InterruptedException e)

          { LOG.info("Exception: "+e); throw e; }

          catch (ClassNotFoundException e)

          { LOG.info("Exception: "+e); throw e; }

          ```
          Thanks.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-310624270 @lewismc >>>`I have used int complete = job.waitForCompletion(true)?0:1; in multiple files and I did not get any error in those files and I am unable to understand why I am getting this here` This exception handling needs to be done in every file that used the above line, I am working on it and will push a commit as soon as I am done. I am making following changes. Converting `int complete = job.waitForCompletion(true)?0:1;` to ``` try { int complete = job.waitForCompletion(true)?0:1; } catch (InterruptedException e) { LOG.info("Exception: "+e); throw e; } catch (ClassNotFoundException e) { LOG.info("Exception: "+e); throw e; } ``` Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-310639864

          Hi @Omkar20895, blindly catching exceptions (logging and throwing them again) does not make the code better. It's important to think whether the exception needs to be handled or not If the job job.waitForCompletion(true) throws an exception the job has failed

          • sometimes you need to clean up to avoid that long-living data structures (CrawlDb, LinkDb) are broken, e.g. in [Injector](https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/Injector.java#L402)
          • for write-once structures (segments) this is usually not a requirement, as broken segments are just ignored by other tools. In this case it's enough to throw the exception (it needs to be declared to be thrown)
            Afaics, you're on the right way. Just make sure before a push that everything compiles (`ant clean runtime javadoc test`).

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-310639864 Hi @Omkar20895, blindly catching exceptions (logging and throwing them again) does not make the code better. It's important to think whether the exception needs to be handled or not If the job job.waitForCompletion(true) throws an exception the job has failed sometimes you need to clean up to avoid that long-living data structures (CrawlDb, LinkDb) are broken, e.g. in [Injector] ( https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/Injector.java#L402 ) for write-once structures (segments) this is usually not a requirement, as broken segments are just ignored by other tools. In this case it's enough to throw the exception (it needs to be declared to be thrown) Afaics, you're on the right way. Just make sure before a push that everything compiles (`ant clean runtime javadoc test`). ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#discussion_r123725912

          ##########
          File path: src/java/org/apache/nutch/crawl/CrawlDb.java
          ##########
          @@ -111,7 +116,7 @@ public void update(Path crawlDb, Path[] segments, boolean normalize,
          LOG.info("CrawlDb update: Merging segment data into db.");
          }
          try

          { - JobClient.runJob(job); + int complete = job.waitForCompletion(true)?0:1; }

          catch (IOException e) {

          Review comment:
          Need to catch all thrown exceptions, incl. ClassNotFoundException and InterruptedException. Could also just catch `Exception`.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r123725912 ########## File path: src/java/org/apache/nutch/crawl/CrawlDb.java ########## @@ -111,7 +116,7 @@ public void update(Path crawlDb, Path[] segments, boolean normalize, LOG.info("CrawlDb update: Merging segment data into db."); } try { - JobClient.runJob(job); + int complete = job.waitForCompletion(true)?0:1; } catch (IOException e) { Review comment: Need to catch all thrown exceptions, incl. ClassNotFoundException and InterruptedException. Could also just catch `Exception`. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-310639864

          Hi @Omkar20895, blindly catching exceptions (logging and throwing them again) does not make the code better. It's important to think whether the exception needs to be handled or not If the job job.waitForCompletion(true) throws an exception the job has failed

          Afaics, you're on the right way. Just make sure before a push that everything compiles (`ant clean runtime javadoc test`).

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-310639864 Hi @Omkar20895, blindly catching exceptions (logging and throwing them again) does not make the code better. It's important to think whether the exception needs to be handled or not If the job job.waitForCompletion(true) throws an exception the job has failed sometimes you need to clean up to avoid that long-living data structures (CrawlDb, LinkDb) are broken, e.g. in [Injector] ( https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/Injector.java#L402 ) for write-once structures (segments) this is usually not a requirement, as broken segments are just ignored by other tools. In this case it's enough to throw the exception (it needs to be declared to be thrown) Afaics, you're on the right way. Just make sure before a push that everything compiles (`ant clean runtime javadoc test`). ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#discussion_r123726840

          ##########
          File path: src/java/org/apache/nutch/crawl/CrawlDb.java
          ##########
          @@ -111,7 +116,7 @@ public void update(Path crawlDb, Path[] segments, boolean normalize,
          LOG.info("CrawlDb update: Merging segment data into db.");
          }
          try

          { - JobClient.runJob(job); + int complete = job.waitForCompletion(true)?0:1; }

          catch (IOException e) {

          Review comment:
          @sebastian-nagel in the above job when there is an IOException the output path is being removed (cleaning up) before throwing up the error and the same needs to be done when we catch the other two exceptions i.e ClassNotFoundException and InterruptedException also, isn't it? Please feel free to correct me if I am wrong.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r123726840 ########## File path: src/java/org/apache/nutch/crawl/CrawlDb.java ########## @@ -111,7 +116,7 @@ public void update(Path crawlDb, Path[] segments, boolean normalize, LOG.info("CrawlDb update: Merging segment data into db."); } try { - JobClient.runJob(job); + int complete = job.waitForCompletion(true)?0:1; } catch (IOException e) { Review comment: @sebastian-nagel in the above job when there is an IOException the output path is being removed (cleaning up) before throwing up the error and the same needs to be done when we catch the other two exceptions i.e ClassNotFoundException and InterruptedException also, isn't it? Please feel free to correct me if I am wrong. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#discussion_r123728900

          ##########
          File path: src/java/org/apache/nutch/crawl/CrawlDb.java
          ##########
          @@ -111,7 +116,7 @@ public void update(Path crawlDb, Path[] segments, boolean normalize,
          LOG.info("CrawlDb update: Merging segment data into db.");
          }
          try

          { - JobClient.runJob(job); + int complete = job.waitForCompletion(true)?0:1; }

          catch (IOException e) {

          Review comment:
          Yes, exactly, e.g.
          ```
          public void update(...) throws Exception {
          ...
          try

          { job.waitForCompletion(true); }

          catch (IOException | ClassNotFoundException | InterruptedException e) {
          // do cleanup
          throw e
          ```

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r123728900 ########## File path: src/java/org/apache/nutch/crawl/CrawlDb.java ########## @@ -111,7 +116,7 @@ public void update(Path crawlDb, Path[] segments, boolean normalize, LOG.info("CrawlDb update: Merging segment data into db."); } try { - JobClient.runJob(job); + int complete = job.waitForCompletion(true)?0:1; } catch (IOException e) { Review comment: Yes, exactly, e.g. ``` public void update(...) throws Exception { ... try { job.waitForCompletion(true); } catch (IOException | ClassNotFoundException | InterruptedException e) { // do cleanup throw e ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#discussion_r123728981

          ##########
          File path: src/java/org/apache/nutch/crawl/CrawlDb.java
          ##########
          @@ -111,7 +116,7 @@ public void update(Path crawlDb, Path[] segments, boolean normalize,
          LOG.info("CrawlDb update: Merging segment data into db.");
          }
          try

          { - JobClient.runJob(job); + int complete = job.waitForCompletion(true)?0:1; }

          catch (IOException e) {

          Review comment:
          I left `complete` away because it is not used.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r123728981 ########## File path: src/java/org/apache/nutch/crawl/CrawlDb.java ########## @@ -111,7 +116,7 @@ public void update(Path crawlDb, Path[] segments, boolean normalize, LOG.info("CrawlDb update: Merging segment data into db."); } try { - JobClient.runJob(job); + int complete = job.waitForCompletion(true)?0:1; } catch (IOException e) { Review comment: I left `complete` away because it is not used. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#discussion_r123728900

          ##########
          File path: src/java/org/apache/nutch/crawl/CrawlDb.java
          ##########
          @@ -111,7 +116,7 @@ public void update(Path crawlDb, Path[] segments, boolean normalize,
          LOG.info("CrawlDb update: Merging segment data into db.");
          }
          try

          { - JobClient.runJob(job); + int complete = job.waitForCompletion(true)?0:1; }

          catch (IOException e) {

          Review comment:
          Yes, exactly, e.g.
          ```
          public void update(...) throws Exception {
          ...
          try

          { job.waitForCompletion(true); }

          catch (IOException | ClassNotFoundException | InterruptedException e) {
          // do cleanup
          throw e
          ```

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r123728900 ########## File path: src/java/org/apache/nutch/crawl/CrawlDb.java ########## @@ -111,7 +116,7 @@ public void update(Path crawlDb, Path[] segments, boolean normalize, LOG.info("CrawlDb update: Merging segment data into db."); } try { - JobClient.runJob(job); + int complete = job.waitForCompletion(true)?0:1; } catch (IOException e) { Review comment: Yes, exactly, e.g. ``` public void update(...) throws Exception { ... try { job.waitForCompletion(true); } catch (IOException | ClassNotFoundException | InterruptedException e) { // do cleanup throw e ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#discussion_r123730458

          ##########
          File path: src/java/org/apache/nutch/crawl/CrawlDb.java
          ##########
          @@ -111,7 +116,7 @@ public void update(Path crawlDb, Path[] segments, boolean normalize,
          LOG.info("CrawlDb update: Merging segment data into db.");
          }
          try

          { - JobClient.runJob(job); + int complete = job.waitForCompletion(true)?0:1; }

          catch (IOException e) {

          Review comment:
          Of course, you could also (maybe a matter of taste):
          ```
          public void update(...) throws IOException, ClassNotFoundException, InterruptedException {
          ```

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r123730458 ########## File path: src/java/org/apache/nutch/crawl/CrawlDb.java ########## @@ -111,7 +116,7 @@ public void update(Path crawlDb, Path[] segments, boolean normalize, LOG.info("CrawlDb update: Merging segment data into db."); } try { - JobClient.runJob(job); + int complete = job.waitForCompletion(true)?0:1; } catch (IOException e) { Review comment: Of course, you could also (maybe a matter of taste): ``` public void update(...) throws IOException, ClassNotFoundException, InterruptedException { ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-310693000

          The next commit will be updating the plugins and tests which I am working on. I will update the PR as soon as I am done. Thanks.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-310693000 The next commit will be updating the plugins and tests which I am working on. I will update the PR as soon as I am done. Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-310733679

          Thank you @sebastian-nagel excellent advice

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-310733679 Thank you @sebastian-nagel excellent advice ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-311569394

          How are things coming along @Omkar20895 ?

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-311569394 How are things coming along @Omkar20895 ? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-312704405

          Hi,
          These are the warnings and errors that I am getting right now and unable to solve.

          ```
          compile-core-test:
          [javac] Compiling 54 source files to /Users/omkar/Documents/Git/nutch/build/test/classes
          [javac] /Users/omkar/Documents/Git/nutch/src/test/org/apache/nutch/crawl/CrawlDbUpdateTestDriver.java:95: warning: [deprecation] setConfiguration(Configuration) in TestDriver has been deprecated
          [javac] reduceDriver.setConfiguration(configuration);
          [javac] ^
          [javac] /Users/omkar/Documents/Git/nutch/src/test/org/apache/nutch/crawl/CrawlDbUpdateUtil.java:58: error: cannot find symbol
          [javac] reducer.configure(Job.getInstance(conf));
          [javac] ^
          [javac] symbol: method configure(Job)
          [javac] location: variable reducer of type T
          [javac] where T is a type-variable:
          [javac] T extends Reducer<Text,CrawlDatum,Text,CrawlDatum> declared in class CrawlDbUpdateUtil
          ```
          The reducer we are referring to here is [CrawlDbReducer](https://github.com/apache/nutch/blob/3e2d3d456489bf52bc586dae0e2e71fb7aad8fe7/src/java/org/apache/nutch/crawl/CrawlDbReducer.java) and even though it has a configure method in it I am getting the error.

          ```
          [javac] /Users/omkar/Documents/Git/nutch/src/test/org/apache/nutch/crawl/CrawlDbUpdateUtil.java:62: error: CrawlDbUpdateUtil.DummyContext is not abstract and does not override abstract method write(Object,Object) in TaskInputOutputContext
          [javac] private class DummyContext extends Context {
          [javac] ^
          [javac]
          ```
          Here I have already overrided the write method but still I am getting this error.

          ```
          /Users/omkar/Documents/Git/nutch/src/test/org/apache/nutch/crawl/CrawlDbUpdateUtil.java:124: error: reduce(KEYIN,Iterable<VALUEIN>,Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>.Context) has protected access in Reducer
          [javac] reducer.reduce(dummyURL, (Iterable)values, context);
          [javac] ^
          [javac] where KEYIN,VALUEIN,KEYOUT,VALUEOUT are type-variables:
          [javac] KEYIN extends Object declared in class Reducer
          [javac] VALUEIN extends Object declared in class Reducer
          [javac] KEYOUT extends Object declared in class Reducer
          [javac] VALUEOUT extends Object declared in class Reducer
          [javac] /Users/omkar/Documents/Git/nutch/src/test/org/apache/nutch/indexer/TestIndexerMapReduce.java:172: warning: [deprecation] setConfiguration(Configuration) in TestDriver has been deprecated
          [javac] reduceDriver.setConfiguration(configuration);
          [javac] ^
          [javac] Note: /Users/omkar/Documents/Git/nutch/src/test/org/apache/nutch/crawl/CrawlDbUpdateUtil.java uses unchecked or unsafe operations.
          [javac] Note: Recompile with -Xlint:unchecked for details.
          [javac] 3 errors
          [javac] 2 warnings

          ```
          Please pull and try out the latest commit to reproduce the errors and use the command "ant clean runtime test". The runtime build is successful but the test will give errors. Thanks.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-312704405 Hi, These are the warnings and errors that I am getting right now and unable to solve. ``` compile-core-test: [javac] Compiling 54 source files to /Users/omkar/Documents/Git/nutch/build/test/classes [javac] /Users/omkar/Documents/Git/nutch/src/test/org/apache/nutch/crawl/CrawlDbUpdateTestDriver.java:95: warning: [deprecation] setConfiguration(Configuration) in TestDriver has been deprecated [javac] reduceDriver.setConfiguration(configuration); [javac] ^ [javac] /Users/omkar/Documents/Git/nutch/src/test/org/apache/nutch/crawl/CrawlDbUpdateUtil.java:58: error: cannot find symbol [javac] reducer.configure(Job.getInstance(conf)); [javac] ^ [javac] symbol: method configure(Job) [javac] location: variable reducer of type T [javac] where T is a type-variable: [javac] T extends Reducer<Text,CrawlDatum,Text,CrawlDatum> declared in class CrawlDbUpdateUtil ``` The reducer we are referring to here is [CrawlDbReducer] ( https://github.com/apache/nutch/blob/3e2d3d456489bf52bc586dae0e2e71fb7aad8fe7/src/java/org/apache/nutch/crawl/CrawlDbReducer.java ) and even though it has a configure method in it I am getting the error. ``` [javac] /Users/omkar/Documents/Git/nutch/src/test/org/apache/nutch/crawl/CrawlDbUpdateUtil.java:62: error: CrawlDbUpdateUtil.DummyContext is not abstract and does not override abstract method write(Object,Object) in TaskInputOutputContext [javac] private class DummyContext extends Context { [javac] ^ [javac] ``` Here I have already overrided the write method but still I am getting this error. ``` /Users/omkar/Documents/Git/nutch/src/test/org/apache/nutch/crawl/CrawlDbUpdateUtil.java:124: error: reduce(KEYIN,Iterable<VALUEIN>,Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>.Context) has protected access in Reducer [javac] reducer.reduce(dummyURL, (Iterable)values, context); [javac] ^ [javac] where KEYIN,VALUEIN,KEYOUT,VALUEOUT are type-variables: [javac] KEYIN extends Object declared in class Reducer [javac] VALUEIN extends Object declared in class Reducer [javac] KEYOUT extends Object declared in class Reducer [javac] VALUEOUT extends Object declared in class Reducer [javac] /Users/omkar/Documents/Git/nutch/src/test/org/apache/nutch/indexer/TestIndexerMapReduce.java:172: warning: [deprecation] setConfiguration(Configuration) in TestDriver has been deprecated [javac] reduceDriver.setConfiguration(configuration); [javac] ^ [javac] Note: /Users/omkar/Documents/Git/nutch/src/test/org/apache/nutch/crawl/CrawlDbUpdateUtil.java uses unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [javac] 3 errors [javac] 2 warnings ``` Please pull and try out the latest commit to reproduce the errors and use the command "ant clean runtime test". The runtime build is successful but the test will give errors. Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-312743093

          @Omkar20895

          regarding ```CrawlDbUpdateTestDriver.java:95: warning: [deprecation] setConfiguration(Configuration) in TestDriver has been deprecated```
          please see [javadoc](https://mrunit.apache.org/documentation/javadocs/1.1.0/index.html?org/apache/hadoop/mrunit/TestDriver.html): Deprecated. Use getConfiguration() to set configuration items as opposed to overriding the entire configuration object as it's used internally. Also make sure to update the Class and Method-level API documentation to ```

          {@link CrawlDbReducer#reduce(Text, Iterator, Context)}

          ``` this is present in several Classes and you have not updated it yet.

          regarding ```CrawlDbUpdateUtil.java:58: error: cannot find symbol``` this is due to removal of ```configure``` method for Reducer.java. Please see [Javadoc](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Reducer.html). You may have to work with ```setup``` instead.

          Fix both of the above, which once addressed will bring the compilation back to stable I think. Also check out the Javadoc for [TaskInputOutputContext](https://hadoop.apache.org/docs/current/api/index.html?org/apache/hadoop/mapreduce/TaskInputOutputContext.html) and ensure that CrawlDbUpdateUtil.DummyContext overrides the appropriate methods with the appropriate peramaters.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-312743093 @Omkar20895 regarding ```CrawlDbUpdateTestDriver.java:95: warning: [deprecation] setConfiguration(Configuration) in TestDriver has been deprecated``` please see [javadoc] ( https://mrunit.apache.org/documentation/javadocs/1.1.0/index.html?org/apache/hadoop/mrunit/TestDriver.html): Deprecated. Use getConfiguration() to set configuration items as opposed to overriding the entire configuration object as it's used internally. Also make sure to update the Class and Method-level API documentation to ``` {@link CrawlDbReducer#reduce(Text, Iterator, Context)} ``` this is present in several Classes and you have not updated it yet. regarding ```CrawlDbUpdateUtil.java:58: error: cannot find symbol``` this is due to removal of ```configure``` method for Reducer.java. Please see [Javadoc] ( https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Reducer.html ). You may have to work with ```setup``` instead. Fix both of the above, which once addressed will bring the compilation back to stable I think. Also check out the Javadoc for [TaskInputOutputContext] ( https://hadoop.apache.org/docs/current/api/index.html?org/apache/hadoop/mapreduce/TaskInputOutputContext.html ) and ensure that CrawlDbUpdateUtil.DummyContext overrides the appropriate methods with the appropriate peramaters. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-312743093

          @Omkar20895

          regarding ```CrawlDbUpdateTestDriver.java:95: warning: [deprecation] setConfiguration(Configuration) in TestDriver has been deprecated```
          please see [javadoc](https://mrunit.apache.org/documentation/javadocs/1.1.0/index.html?org/apache/hadoop/mrunit/TestDriver.html): Deprecated. Use getConfiguration() to set configuration items as opposed to overriding the entire configuration object as it's used internally. Also make sure to update the Class and Method-level API documentation to ```

          {@link CrawlDbReducer#reduce(Text, Iterator, Context)}

          ``` this is present in several Classes and you have not updated it yet.

          regarding ```CrawlDbUpdateUtil.java:58: error: cannot find symbol``` this is due to removal of ```configure``` method for Reducer.java. Please see [Javadoc](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Reducer.html). You may have to work with ```setup``` instead.

          Fix both of the above, which once addressed will bring the compilation back to stable I think. Also check out the Javadoc for [TaskInputOutputContext](https://hadoop.apache.org/docs/current/api/index.html?org/apache/hadoop/mapreduce/TaskInputOutputContext.html) and ensure that CrawlDbUpdateUtil.DummyContext overrides the appropriate methods with the appropriate paramaters.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-312743093 @Omkar20895 regarding ```CrawlDbUpdateTestDriver.java:95: warning: [deprecation] setConfiguration(Configuration) in TestDriver has been deprecated``` please see [javadoc] ( https://mrunit.apache.org/documentation/javadocs/1.1.0/index.html?org/apache/hadoop/mrunit/TestDriver.html): Deprecated. Use getConfiguration() to set configuration items as opposed to overriding the entire configuration object as it's used internally. Also make sure to update the Class and Method-level API documentation to ``` {@link CrawlDbReducer#reduce(Text, Iterator, Context)} ``` this is present in several Classes and you have not updated it yet. regarding ```CrawlDbUpdateUtil.java:58: error: cannot find symbol``` this is due to removal of ```configure``` method for Reducer.java. Please see [Javadoc] ( https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Reducer.html ). You may have to work with ```setup``` instead. Fix both of the above, which once addressed will bring the compilation back to stable I think. Also check out the Javadoc for [TaskInputOutputContext] ( https://hadoop.apache.org/docs/current/api/index.html?org/apache/hadoop/mapreduce/TaskInputOutputContext.html ) and ensure that CrawlDbUpdateUtil.DummyContext overrides the appropriate methods with the appropriate paramaters. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-312743093

          @Omkar20895

          regarding ```CrawlDbUpdateTestDriver.java:95: warning: [deprecation] setConfiguration(Configuration) in TestDriver has been deprecated```
          please see [javadoc](https://mrunit.apache.org/documentation/javadocs/1.1.0/index.html?org/apache/hadoop/mrunit/TestDriver.html): Deprecated. Use getConfiguration() to set configuration items as opposed to overriding the entire configuration object as it's used internally. Also make sure to update the Class and Method-level API documentation to ```

          {@link CrawlDbReducer#reduce(Text, Iterator, Context)}

          ``` this is present in several Classes and you have not updated it yet.

          regarding ```CrawlDbUpdateUtil.java:58: error: cannot find symbol``` this is due to removal of ```configure``` method for Reducer.java. Please see [Javadoc](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Reducer.html). You may have to work with ```setup``` instead.

          Fix both of the above, which once addressed will bring the compilation back to stable I think. Also check out the Javadoc for [TaskInputOutputContext](https://hadoop.apache.org/docs/current/api/index.html?org/apache/hadoop/mapreduce/TaskInputOutputContext.html) and ensure that CrawlDbUpdateUtil.DummyContext overrides the appropriate methods with the appropriate parameters.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-312743093 @Omkar20895 regarding ```CrawlDbUpdateTestDriver.java:95: warning: [deprecation] setConfiguration(Configuration) in TestDriver has been deprecated``` please see [javadoc] ( https://mrunit.apache.org/documentation/javadocs/1.1.0/index.html?org/apache/hadoop/mrunit/TestDriver.html): Deprecated. Use getConfiguration() to set configuration items as opposed to overriding the entire configuration object as it's used internally. Also make sure to update the Class and Method-level API documentation to ``` {@link CrawlDbReducer#reduce(Text, Iterator, Context)} ``` this is present in several Classes and you have not updated it yet. regarding ```CrawlDbUpdateUtil.java:58: error: cannot find symbol``` this is due to removal of ```configure``` method for Reducer.java. Please see [Javadoc] ( https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Reducer.html ). You may have to work with ```setup``` instead. Fix both of the above, which once addressed will bring the compilation back to stable I think. Also check out the Javadoc for [TaskInputOutputContext] ( https://hadoop.apache.org/docs/current/api/index.html?org/apache/hadoop/mapreduce/TaskInputOutputContext.html ) and ensure that CrawlDbUpdateUtil.DummyContext overrides the appropriate methods with the appropriate parameters. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-313889464

          Thank you for updating PR @Omkar20895 , please begin making your way through the tests and fixing the errors. AFAICT, the following tests currently fail. You can see all of the reports in ```build/tests```

          ```
          [junit] Running org.apache.nutch.crawl.TestCrawlDbFilter
          [junit] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 2.099 sec
          [junit] Test org.apache.nutch.crawl.TestCrawlDbFilter FAILED
          [junit] Running org.apache.nutch.crawl.TestCrawlDbMerger
          [junit] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.986 sec
          [junit] Test org.apache.nutch.crawl.TestCrawlDbMerger FAILED
          [junit] Running org.apache.nutch.crawl.TestGenerator
          [junit] Tests run: 4, Failures: 0, Errors: 4, Skipped: 0, Time elapsed: 8.483 sec
          [junit] Test org.apache.nutch.crawl.TestGenerator FAILED
          [junit] Running org.apache.nutch.crawl.TestLinkDbMerger
          [junit] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 2.448 sec
          [junit] Test org.apache.nutch.crawl.TestLinkDbMerger FAILED
          [junit] Running org.apache.nutch.fetcher.TestFetcher
          [junit] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 4.323 sec
          [junit] Test org.apache.nutch.fetcher.TestFetcher FAILED
          [junit] Running org.apache.nutch.indexer.TestIndexerMapReduce
          [junit] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.766 sec
          [junit] Test org.apache.nutch.indexer.TestIndexerMapReduce FAILED
          [junit] Running org.apache.nutch.plugin.TestPluginSystem
          [junit] Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.073 sec
          [junit] Test org.apache.nutch.plugin.TestPluginSystem FAILED
          [junit] Running org.apache.nutch.segment.TestSegmentMerger
          [junit] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 23.399 sec
          [junit] Test org.apache.nutch.segment.TestSegmentMerger FAILED
          [junit] Running org.apache.nutch.segment.TestSegmentMergerCrawlDatums
          [junit] Tests run: 7, Failures: 7, Errors: 0, Skipped: 0, Time elapsed: 32.267 sec
          [junit] Test org.apache.nutch.segment.TestSegmentMergerCrawlDatums FAILED
          ```

          @Omkar20895 have you tried to execute an end-to-end test crawl and validate the results? If not, then start there and see where it is all going wrong.
          Thanks

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-313889464 Thank you for updating PR @Omkar20895 , please begin making your way through the tests and fixing the errors. AFAICT, the following tests currently fail. You can see all of the reports in ```build/tests``` ``` [junit] Running org.apache.nutch.crawl.TestCrawlDbFilter [junit] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 2.099 sec [junit] Test org.apache.nutch.crawl.TestCrawlDbFilter FAILED [junit] Running org.apache.nutch.crawl.TestCrawlDbMerger [junit] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.986 sec [junit] Test org.apache.nutch.crawl.TestCrawlDbMerger FAILED [junit] Running org.apache.nutch.crawl.TestGenerator [junit] Tests run: 4, Failures: 0, Errors: 4, Skipped: 0, Time elapsed: 8.483 sec [junit] Test org.apache.nutch.crawl.TestGenerator FAILED [junit] Running org.apache.nutch.crawl.TestLinkDbMerger [junit] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 2.448 sec [junit] Test org.apache.nutch.crawl.TestLinkDbMerger FAILED [junit] Running org.apache.nutch.fetcher.TestFetcher [junit] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 4.323 sec [junit] Test org.apache.nutch.fetcher.TestFetcher FAILED [junit] Running org.apache.nutch.indexer.TestIndexerMapReduce [junit] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.766 sec [junit] Test org.apache.nutch.indexer.TestIndexerMapReduce FAILED [junit] Running org.apache.nutch.plugin.TestPluginSystem [junit] Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.073 sec [junit] Test org.apache.nutch.plugin.TestPluginSystem FAILED [junit] Running org.apache.nutch.segment.TestSegmentMerger [junit] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 23.399 sec [junit] Test org.apache.nutch.segment.TestSegmentMerger FAILED [junit] Running org.apache.nutch.segment.TestSegmentMergerCrawlDatums [junit] Tests run: 7, Failures: 7, Errors: 0, Skipped: 0, Time elapsed: 32.267 sec [junit] Test org.apache.nutch.segment.TestSegmentMergerCrawlDatums FAILED ``` @Omkar20895 have you tried to execute an end-to-end test crawl and validate the results? If not, then start there and see where it is all going wrong. Thanks ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#discussion_r126293237

          ##########
          File path: src/java/org/apache/nutch/segment/SegmentMerger.java
          ##########
          @@ -384,40 +385,43 @@ public void setConf(Configuration conf) {
          public void close() throws IOException {
          }

          • public void configure(JobConf conf) {
            + public void configure(Job job) {
            + Configuration conf = job.getConfiguration();
            setConf(conf);
            if (sliceSize > 0) {
          • sliceSize = sliceSize / conf.getNumReduceTasks();
            + sliceSize = sliceSize / Integer.parseInt(conf.get("mapreduce.map.tasks"));

          Review comment:
          This does not look correct at all... we should be attempting to obtain a slice size value by diving an integer whose value is greater than zero by the configured number of Reduce tasks... not by the configured number of Map tasks.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r126293237 ########## File path: src/java/org/apache/nutch/segment/SegmentMerger.java ########## @@ -384,40 +385,43 @@ public void setConf(Configuration conf) { public void close() throws IOException { } public void configure(JobConf conf) { + public void configure(Job job) { + Configuration conf = job.getConfiguration(); setConf(conf); if (sliceSize > 0) { sliceSize = sliceSize / conf.getNumReduceTasks(); + sliceSize = sliceSize / Integer.parseInt(conf.get("mapreduce.map.tasks")); Review comment: This does not look correct at all... we should be attempting to obtain a slice size value by diving an integer whose value is greater than zero by the configured number of Reduce tasks... not by the configured number of Map tasks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#discussion_r126293242

          ##########
          File path: src/java/org/apache/nutch/segment/SegmentMerger.java
          ##########
          @@ -384,40 +385,43 @@ public void setConf(Configuration conf) {
          public void close() throws IOException {
          }

          • public void configure(JobConf conf) {
            + public void configure(Job job)
            Unknown macro: {+ Configuration conf = job.getConfiguration(); setConf(conf); if (sliceSize > 0) { - sliceSize = sliceSize / conf.getNumReduceTasks(); + sliceSize = sliceSize / Integer.parseInt(conf.get("mapreduce.map.tasks")); } }
          • private Text newKey = new Text();
          • public void map(Text key, MetaWrapper value,
          • OutputCollector<Text, MetaWrapper> output, Reporter reporter)
          • throws IOException {
          • String url = key.toString();
          • if (normalizers != null) {
          • try { - url = normalizers.normalize(url, URLNormalizers.SCOPE_DEFAULT); // normalize - // the - // url - }

            catch (Exception e) {

          • LOG.warn("Skipping " + url + ":" + e.getMessage());
          • url = null;
            + public static class SegmentMergerMapper extends
            + Mapper<Text, MetaWrapper, Text, MetaWrapper> {
            + public void map(Text key, MetaWrapper value,
            + Context context) throws IOException, InterruptedException {
            + Text newKey = new Text();
            + String url = key.toString();
            + if (normalizers != null) {
            + try {
            + url = normalizers.normalize(url, URLNormalizers.SCOPE_DEFAULT); // normalize

          Review comment:
          Sort the comment out, make sure everything is on the same line.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r126293242 ########## File path: src/java/org/apache/nutch/segment/SegmentMerger.java ########## @@ -384,40 +385,43 @@ public void setConf(Configuration conf) { public void close() throws IOException { } public void configure(JobConf conf) { + public void configure(Job job) Unknown macro: {+ Configuration conf = job.getConfiguration(); setConf(conf); if (sliceSize > 0) { - sliceSize = sliceSize / conf.getNumReduceTasks(); + sliceSize = sliceSize / Integer.parseInt(conf.get("mapreduce.map.tasks")); } } private Text newKey = new Text(); public void map(Text key, MetaWrapper value, OutputCollector<Text, MetaWrapper> output, Reporter reporter) throws IOException { String url = key.toString(); if (normalizers != null) { try { - url = normalizers.normalize(url, URLNormalizers.SCOPE_DEFAULT); // normalize - // the - // url - } catch (Exception e) { LOG.warn("Skipping " + url + ":" + e.getMessage()); url = null; + public static class SegmentMergerMapper extends + Mapper<Text, MetaWrapper, Text, MetaWrapper> { + public void map(Text key, MetaWrapper value, + Context context) throws IOException, InterruptedException { + Text newKey = new Text(); + String url = key.toString(); + if (normalizers != null) { + try { + url = normalizers.normalize(url, URLNormalizers.SCOPE_DEFAULT); // normalize Review comment: Sort the comment out, make sure everything is on the same line. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-313890314

          @jnioche @sebastian-nagel @chrismattmann @jorgelbg and rest of team, would it be preferred for all Mapper and Reducer functions to be extracted from individual Job's which ```extends Configured implements Tool``` into a ```mapper``` and ```reducer``` package respectively or is it fine to retain them as nested classes within the Job they belong to? Thanks for any input.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-313890314 @jnioche @sebastian-nagel @chrismattmann @jorgelbg and rest of team, would it be preferred for all Mapper and Reducer functions to be extracted from individual Job's which ```extends Configured implements Tool``` into a ```mapper``` and ```reducer``` package respectively or is it fine to retain them as nested classes within the Job they belong to? Thanks for any input. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-313890439

          @Omkar20895 start with ```TestSegmentMergerCrawlDatums.java```, the primary issue is as follows
          ```
          2017-07-08 17:27:43,515 WARN mapred.LocalJobRunner (LocalJobRunner.java:run(560)) - job_local1442546213_0007
          java.lang.Exception: java.lang.ClassCastException: org.apache.nutch.crawl.CrawlDatum cannot be cast to org.apache.nutch.metadata.MetaWrapper
          at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
          at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
          Caused by: java.lang.ClassCastException: org.apache.nutch.crawl.CrawlDatum cannot be cast to org.apache.nutch.metadata.MetaWrapper
          at org.apache.nutch.segment.SegmentMerger$SegmentMergerMapper.map(SegmentMerger.java:397)
          at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
          at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
          at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
          at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
          at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
          at java.lang.Thread.run(Thread.java:745)
          ```

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-313890439 @Omkar20895 start with ```TestSegmentMergerCrawlDatums.java```, the primary issue is as follows ``` 2017-07-08 17:27:43,515 WARN mapred.LocalJobRunner (LocalJobRunner.java:run(560)) - job_local1442546213_0007 java.lang.Exception: java.lang.ClassCastException: org.apache.nutch.crawl.CrawlDatum cannot be cast to org.apache.nutch.metadata.MetaWrapper at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522) Caused by: java.lang.ClassCastException: org.apache.nutch.crawl.CrawlDatum cannot be cast to org.apache.nutch.metadata.MetaWrapper at org.apache.nutch.segment.SegmentMerger$SegmentMergerMapper.map(SegmentMerger.java:397) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#discussion_r126295341

          ##########
          File path: src/java/org/apache/nutch/segment/SegmentMerger.java
          ##########
          @@ -384,40 +385,43 @@ public void setConf(Configuration conf) {
          public void close() throws IOException {
          }

          • public void configure(JobConf conf) {
            + public void configure(Job job) {
            + Configuration conf = job.getConfiguration();
            setConf(conf);
            if (sliceSize > 0) {
          • sliceSize = sliceSize / conf.getNumReduceTasks();
            + sliceSize = sliceSize / Integer.parseInt(conf.get("mapreduce.map.tasks"));

          Review comment:
          @lewismc pardon me, my bad! I will make this change and commit it in the next update.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r126295341 ########## File path: src/java/org/apache/nutch/segment/SegmentMerger.java ########## @@ -384,40 +385,43 @@ public void setConf(Configuration conf) { public void close() throws IOException { } public void configure(JobConf conf) { + public void configure(Job job) { + Configuration conf = job.getConfiguration(); setConf(conf); if (sliceSize > 0) { sliceSize = sliceSize / conf.getNumReduceTasks(); + sliceSize = sliceSize / Integer.parseInt(conf.get("mapreduce.map.tasks")); Review comment: @lewismc pardon me, my bad! I will make this change and commit it in the next update. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#discussion_r126473415

          ##########
          File path: src/java/org/apache/nutch/segment/SegmentMerger.java
          ##########
          @@ -384,40 +385,43 @@ public void setConf(Configuration conf) {
          public void close() throws IOException {
          }

          • public void configure(JobConf conf) {
            + public void configure(Job job)
            Unknown macro: {+ Configuration conf = job.getConfiguration(); setConf(conf); if (sliceSize > 0) { - sliceSize = sliceSize / conf.getNumReduceTasks(); + sliceSize = sliceSize / Integer.parseInt(conf.get("mapreduce.map.tasks")); } }
          • private Text newKey = new Text();
          • public void map(Text key, MetaWrapper value,
          • OutputCollector<Text, MetaWrapper> output, Reporter reporter)
          • throws IOException {
          • String url = key.toString();
          • if (normalizers != null) {
          • try { - url = normalizers.normalize(url, URLNormalizers.SCOPE_DEFAULT); // normalize - // the - // url - }

            catch (Exception e) {

          • LOG.warn("Skipping " + url + ":" + e.getMessage());
          • url = null;
            + public static class SegmentMergerMapper extends
            + Mapper<Text, MetaWrapper, Text, MetaWrapper> {
            + public void map(Text key, MetaWrapper value,
            + Context context) throws IOException, InterruptedException {
            + Text newKey = new Text();
            + String url = key.toString();
            + if (normalizers != null) {
            + try {
            + url = normalizers.normalize(url, URLNormalizers.SCOPE_DEFAULT); // normalize

          Review comment:
          Done!

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r126473415 ########## File path: src/java/org/apache/nutch/segment/SegmentMerger.java ########## @@ -384,40 +385,43 @@ public void setConf(Configuration conf) { public void close() throws IOException { } public void configure(JobConf conf) { + public void configure(Job job) Unknown macro: {+ Configuration conf = job.getConfiguration(); setConf(conf); if (sliceSize > 0) { - sliceSize = sliceSize / conf.getNumReduceTasks(); + sliceSize = sliceSize / Integer.parseInt(conf.get("mapreduce.map.tasks")); } } private Text newKey = new Text(); public void map(Text key, MetaWrapper value, OutputCollector<Text, MetaWrapper> output, Reporter reporter) throws IOException { String url = key.toString(); if (normalizers != null) { try { - url = normalizers.normalize(url, URLNormalizers.SCOPE_DEFAULT); // normalize - // the - // url - } catch (Exception e) { LOG.warn("Skipping " + url + ":" + e.getMessage()); url = null; + public static class SegmentMergerMapper extends + Mapper<Text, MetaWrapper, Text, MetaWrapper> { + public void map(Text key, MetaWrapper value, + Context context) throws IOException, InterruptedException { + Text newKey = new Text(); + String url = key.toString(); + if (normalizers != null) { + try { + url = normalizers.normalize(url, URLNormalizers.SCOPE_DEFAULT); // normalize Review comment: Done! ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-314161600

          @lewismc working on it! Thanks.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-314161600 @lewismc working on it! Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-315180650

          @lewismc I think we need to replace the configure() method with setup(). What would you suggest? Thanks for any suggestion.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-315180650 @lewismc I think we need to replace the configure() method with setup(). What would you suggest? Thanks for any suggestion. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-315181534

          @Omkar20895 yes

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-315181534 @Omkar20895 yes ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-315181823

          @lewismc Thanks, I will start working on it.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-315181823 @lewismc Thanks, I will start working on it. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-317454776

          I am working on stabilizing the tests for CrawlDbMerger and will commit it once I am done. Thanks.
          @lewismc

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-317454776 I am working on stabilizing the tests for CrawlDbMerger and will commit it once I am done. Thanks. @lewismc ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-317547121

          OK good @Omkar20895 let me know by usual communication channels if you are struggling at all. Thanks.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-317547121 OK good @Omkar20895 let me know by usual communication channels if you are struggling at all. Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-317593766

          I'm still seeing the following failures in test
          ```
          [junit] Running org.apache.nutch.crawl.TestCrawlDbMerger
          [junit] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 2.095 sec
          [junit] Test org.apache.nutch.crawl.TestCrawlDbMerger FAILED
          [junit] Running org.apache.nutch.crawl.TestGenerator
          [junit] Tests run: 4, Failures: 0, Errors: 4, Skipped: 0, Time elapsed: 9.021 sec
          [junit] Test org.apache.nutch.crawl.TestGenerator FAILED
          [junit] Running org.apache.nutch.crawl.TestLinkDbMerger
          [junit] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 2.524 sec
          [junit] Test org.apache.nutch.crawl.TestLinkDbMerger FAILED
          [junit] Running org.apache.nutch.fetcher.TestFetcher
          [junit] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 4.788 sec
          [junit] Test org.apache.nutch.fetcher.TestFetcher FAILED
          [junit] Running org.apache.nutch.indexer.TestIndexerMapReduce
          [junit] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 1.002 sec
          [junit] Test org.apache.nutch.indexer.TestIndexerMapReduce FAILED
          [junit] Running org.apache.nutch.plugin.TestPluginSystem
          [junit] Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.016 sec
          [junit] Test org.apache.nutch.plugin.TestPluginSystem FAILED
          [junit] Running org.apache.nutch.segment.TestSegmentMerger
          [junit] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 24.588 sec
          [junit] Test org.apache.nutch.segment.TestSegmentMerger FAILED
          [junit] Running org.apache.nutch.segment.TestSegmentMergerCrawlDatums
          [junit] Tests run: 7, Failures: 7, Errors: 0, Skipped: 0, Time elapsed: 35.151 sec
          [junit] Test org.apache.nutch.segment.TestSegmentMergerCrawlDatums FAILED
          ```
          I'm sure you are aware of these. Let's hash them out offline.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-317593766 I'm still seeing the following failures in test ``` [junit] Running org.apache.nutch.crawl.TestCrawlDbMerger [junit] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 2.095 sec [junit] Test org.apache.nutch.crawl.TestCrawlDbMerger FAILED [junit] Running org.apache.nutch.crawl.TestGenerator [junit] Tests run: 4, Failures: 0, Errors: 4, Skipped: 0, Time elapsed: 9.021 sec [junit] Test org.apache.nutch.crawl.TestGenerator FAILED [junit] Running org.apache.nutch.crawl.TestLinkDbMerger [junit] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 2.524 sec [junit] Test org.apache.nutch.crawl.TestLinkDbMerger FAILED [junit] Running org.apache.nutch.fetcher.TestFetcher [junit] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 4.788 sec [junit] Test org.apache.nutch.fetcher.TestFetcher FAILED [junit] Running org.apache.nutch.indexer.TestIndexerMapReduce [junit] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 1.002 sec [junit] Test org.apache.nutch.indexer.TestIndexerMapReduce FAILED [junit] Running org.apache.nutch.plugin.TestPluginSystem [junit] Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.016 sec [junit] Test org.apache.nutch.plugin.TestPluginSystem FAILED [junit] Running org.apache.nutch.segment.TestSegmentMerger [junit] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 24.588 sec [junit] Test org.apache.nutch.segment.TestSegmentMerger FAILED [junit] Running org.apache.nutch.segment.TestSegmentMergerCrawlDatums [junit] Tests run: 7, Failures: 7, Errors: 0, Skipped: 0, Time elapsed: 35.151 sec [junit] Test org.apache.nutch.segment.TestSegmentMergerCrawlDatums FAILED ``` I'm sure you are aware of these. Let's hash them out offline. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-319555964

          hi @Omkar20895 please look at the following document for ideas on how comparisons in Java + JUnit should be done and what the impacts of different methods is.
          https://stackoverflow.com/questions/27605714/test-two-instances-of-object-are-equal-junit
          Please write back here or else on GChat if you re still struggling. We ned to move on and begin coding the new GraphGenerator tool.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-319555964 hi @Omkar20895 please look at the following document for ideas on how comparisons in Java + JUnit should be done and what the impacts of different methods is. https://stackoverflow.com/questions/27605714/test-two-instances-of-object-are-equal-junit Please write back here or else on GChat if you re still struggling. We ned to move on and begin coding the new GraphGenerator tool. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-320462735

          Fix for the rest of the tests on the way. Thanks.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-320462735 Fix for the rest of the tests on the way. Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#discussion_r132838270

          ##########
          File path: src/java/org/apache/nutch/segment/SegmentMerger.java
          ##########
          @@ -174,47 +174,51 @@
          // ignore
          }
          }

          • final SequenceFileRecordReader<Text, Writable> splitReader = new SequenceFileRecordReader<>(
          • job, (FileSplit) split);
            + final SequenceFileRecordReader<Text, Writable> splitReader = new SequenceFileRecordReader<>();
          • try {
          • return new SequenceFileRecordReader<Text, MetaWrapper>(job, fSplit) {
            + return new SequenceFileRecordReader<Text, MetaWrapper>() {
            +
            + public MetaWrapper wrapper;
          • public synchronized boolean next(Text key, MetaWrapper wrapper)
          • throws IOException {
          • LOG.debug("Running OIF.next()");
            + @Override
            + public synchronized boolean nextKeyValue()
            + throws IOException, InterruptedException {
            + try {
            + LOG.debug("Running OIF.nextKeyValue()");
          • boolean res = splitReader.next(key, w);
            + splitReader.initialize(split, context);
            + this.initialize(split,context);
            + boolean res = splitReader.nextKeyValue();
            + wrapper = this.getCurrentValue();

          Review comment:
          Hi @lewismc I think the TestSegmentMerger is failing here with a NullPointerException and I think that it is because the sequenceFileRecordReader here is not being intialized as expected. Please correct me if I am wrong. Thanks.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r132838270 ########## File path: src/java/org/apache/nutch/segment/SegmentMerger.java ########## @@ -174,47 +174,51 @@ // ignore } } final SequenceFileRecordReader<Text, Writable> splitReader = new SequenceFileRecordReader<>( job, (FileSplit) split); + final SequenceFileRecordReader<Text, Writable> splitReader = new SequenceFileRecordReader<>(); try { return new SequenceFileRecordReader<Text, MetaWrapper>(job, fSplit) { + return new SequenceFileRecordReader<Text, MetaWrapper>() { + + public MetaWrapper wrapper; public synchronized boolean next(Text key, MetaWrapper wrapper) throws IOException { LOG.debug("Running OIF.next()"); + @Override + public synchronized boolean nextKeyValue() + throws IOException, InterruptedException { + try { + LOG.debug("Running OIF.nextKeyValue()"); boolean res = splitReader.next(key, w); + splitReader.initialize(split, context); + this.initialize(split,context); + boolean res = splitReader.nextKeyValue(); + wrapper = this.getCurrentValue(); Review comment: Hi @lewismc I think the TestSegmentMerger is failing here with a NullPointerException and I think that it is because the sequenceFileRecordReader here is not being intialized as expected. Please correct me if I am wrong. Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#discussion_r134567680

          ##########
          File path: src/java/org/apache/nutch/fetcher/Fetcher.java
          ##########
          @@ -93,39 +103,23 @@
          public static class InputFormat extends
          SequenceFileInputFormat<Text, CrawlDatum> {
          /** Don't split inputs, to keep things polite. */

          • public InputSplit[] getSplits(JobConf job, int nSplits) throws IOException {
          • FileStatus[] files = listStatus(job);
          • FileSplit[] splits = new FileSplit[files.length];
          • for (int i = 0; i < files.length; i++) {
          • FileStatus cur = files[i];
          • splits[i] = new FileSplit(cur.getPath(), 0, cur.getLen(),
            + public InputSplit[] getSplits(JobContext job, int nSplits) throws IOException {
            + Configuration conf = job.getConfiguration();
            + List<FileStatus> files = listStatus(job);

          Review comment:
          @lewismc These changes were made because in the old API listStatus(job) used to return FileSplit array but now it returns a list object containing objects of type FileStatus. The code was changed accordingly.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r134567680 ########## File path: src/java/org/apache/nutch/fetcher/Fetcher.java ########## @@ -93,39 +103,23 @@ public static class InputFormat extends SequenceFileInputFormat<Text, CrawlDatum> { /** Don't split inputs, to keep things polite. */ public InputSplit[] getSplits(JobConf job, int nSplits) throws IOException { FileStatus[] files = listStatus(job); FileSplit[] splits = new FileSplit [files.length] ; for (int i = 0; i < files.length; i++) { FileStatus cur = files [i] ; splits [i] = new FileSplit(cur.getPath(), 0, cur.getLen(), + public InputSplit[] getSplits(JobContext job, int nSplits) throws IOException { + Configuration conf = job.getConfiguration(); + List<FileStatus> files = listStatus(job); Review comment: @lewismc These changes were made because in the old API listStatus(job) used to return FileSplit array but now it returns a list object containing objects of type FileStatus. The code was changed accordingly. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-327110295

          Hi, I believe that all the changes in this PR are stable now. Please feel free to review it and I would be glad if I can receive any suggestions. Thanks.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-327110295 Hi, I believe that all the changes in this PR are stable now. Please feel free to review it and I would be glad if I can receive any suggestions. Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-327219557

          @Omkar20895 please squash your commits and we will undertake community peer review thank you.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-327219557 @Omkar20895 please squash your commits and we will undertake community peer review thank you. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-327368291

          @Omkar20895 ping

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-327368291 @Omkar20895 ping ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-327400799

          @lewismc Please find above the squashed commit. Thanks.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-327400799 @lewismc Please find above the squashed commit. Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188#issuecomment-327608706

          Hi @Omkar20895 you've not squashed the commits... if you had squashed them it would say

          > Omkar20895 wants to merge 1 commits into apache:master from Omkar20895:NUTCH-2375

          not

          > Omkar20895 wants to merge 46 commits into apache:master from Omkar20895:NUTCH-2375

          Can you please try squashing again? Thanks

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-327608706 Hi @Omkar20895 you've not squashed the commits... if you had squashed them it would say > Omkar20895 wants to merge 1 commits into apache:master from Omkar20895: NUTCH-2375 not > Omkar20895 wants to merge 46 commits into apache:master from Omkar20895: NUTCH-2375 Can you please try squashing again? Thanks ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 closed pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/188

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 closed pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 opened a new pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 opened a new pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221 This code has been tested locally by me. Some of the discussions regarding this code can be found [here] ( https://github.com/apache/nutch/pull/188 ). ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#discussion_r138008013

          ##########
          File path: src/java/org/apache/nutch/crawl/CrawlDbReader.java
          ##########
          @@ -368,41 +367,46 @@ public void close()

          { closeReaders(); }
          • private TreeMap<String, LongWritable> processStatJobHelper(String crawlDb, Configuration config, boolean sort) throws IOException{
            + private TreeMap<String, LongWritable> processStatJobHelper(String crawlDb, Configuration config, boolean sort)
            + throws IOException, InterruptedException, ClassNotFoundException{
            Path tmpFolder = new Path(crawlDb, "stat_tmp" + System.currentTimeMillis());
          • JobConf job = new NutchJob(config);
            + Job job = NutchJob.getInstance(config);
            + config = job.getConfiguration();
            job.setJobName("stats " + crawlDb);
          • job.setBoolean("db.reader.stats.sort", sort);
            + config.setBoolean("db.reader.stats.sort", sort);

          FileInputFormat.addInputPath(job, new Path(crawlDb, CrawlDb.CURRENT_NAME));

          • job.setInputFormat(SequenceFileInputFormat.class);
            + job.setInputFormatClass(SequenceFileInputFormat.class);

          job.setMapperClass(CrawlDbStatMapper.class);
          job.setCombinerClass(CrawlDbStatCombiner.class);
          job.setReducerClass(CrawlDbStatReducer.class);

          FileOutputFormat.setOutputPath(job, tmpFolder);

          • job.setOutputFormat(SequenceFileOutputFormat.class);
            + job.setOutputFormatClass(SequenceFileOutputFormat.class);
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(LongWritable.class);

          // https://issues.apache.org/jira/browse/NUTCH-1029

          • job.setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", false);
            -
          • JobClient.runJob(job);
            + config.setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", false);

          + try

          { + int complete = job.waitForCompletion(true)?0:1; + }

          catch (InterruptedException | ClassNotFoundException e)

          { + LOG.error(StringUtils.stringifyException(e)); + throw e; + }

          // reading the result
          FileSystem fileSystem = tmpFolder.getFileSystem(config);

          • SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(config,
          • tmpFolder);
            + MapFile.Reader[] readers = MapFileOutputFormat.getReaders(tmpFolder, config);

          Review comment:
          ... a SequenceFile.Reader is required to read it. Otherwise reading fails with
          ```
          Exception in thread "main" java.io.FileNotFoundException: File file:.../stat_tmp1505114561240/part-r-00000/data does not exist
          at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:609)
          at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:822)
          at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:599)
          at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
          at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1820)
          at org.apache.hadoop.io.MapFile$Reader.createDataFileReader(MapFile.java:456)
          at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:429)
          at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:399)
          at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:408)
          at org.apache.hadoop.mapreduce.lib.output.MapFileOutputFormat.getReaders(MapFileOutputFormat.java:98)
          at org.apache.nutch.crawl.CrawlDbReader.processStatJobHelper(CrawlDbReader.java:402)
          at org.apache.nutch.crawl.CrawlDbReader.processStatJob(CrawlDbReader.java:444)
          at org.apache.nutch.crawl.CrawlDbReader.run(CrawlDbReader.java:740)
          at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
          at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:792)
          ```

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#discussion_r138008013 ########## File path: src/java/org/apache/nutch/crawl/CrawlDbReader.java ########## @@ -368,41 +367,46 @@ public void close() { closeReaders(); } private TreeMap<String, LongWritable> processStatJobHelper(String crawlDb, Configuration config, boolean sort) throws IOException{ + private TreeMap<String, LongWritable> processStatJobHelper(String crawlDb, Configuration config, boolean sort) + throws IOException, InterruptedException, ClassNotFoundException{ Path tmpFolder = new Path(crawlDb, "stat_tmp" + System.currentTimeMillis()); JobConf job = new NutchJob(config); + Job job = NutchJob.getInstance(config); + config = job.getConfiguration(); job.setJobName("stats " + crawlDb); job.setBoolean("db.reader.stats.sort", sort); + config.setBoolean("db.reader.stats.sort", sort); FileInputFormat.addInputPath(job, new Path(crawlDb, CrawlDb.CURRENT_NAME)); job.setInputFormat(SequenceFileInputFormat.class); + job.setInputFormatClass(SequenceFileInputFormat.class); job.setMapperClass(CrawlDbStatMapper.class); job.setCombinerClass(CrawlDbStatCombiner.class); job.setReducerClass(CrawlDbStatReducer.class); FileOutputFormat.setOutputPath(job, tmpFolder); job.setOutputFormat(SequenceFileOutputFormat.class); + job.setOutputFormatClass(SequenceFileOutputFormat.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(LongWritable.class); // https://issues.apache.org/jira/browse/NUTCH-1029 job.setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", false); - JobClient.runJob(job); + config.setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", false); + try { + int complete = job.waitForCompletion(true)?0:1; + } catch (InterruptedException | ClassNotFoundException e) { + LOG.error(StringUtils.stringifyException(e)); + throw e; + } // reading the result FileSystem fileSystem = tmpFolder.getFileSystem(config); SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(config, tmpFolder); + MapFile.Reader[] readers = MapFileOutputFormat.getReaders(tmpFolder, config); Review comment: ... a SequenceFile.Reader is required to read it. Otherwise reading fails with ``` Exception in thread "main" java.io.FileNotFoundException: File file:.../stat_tmp1505114561240/part-r-00000/data does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:609) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:822) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:599) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1820) at org.apache.hadoop.io.MapFile$Reader.createDataFileReader(MapFile.java:456) at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:429) at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:399) at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:408) at org.apache.hadoop.mapreduce.lib.output.MapFileOutputFormat.getReaders(MapFileOutputFormat.java:98) at org.apache.nutch.crawl.CrawlDbReader.processStatJobHelper(CrawlDbReader.java:402) at org.apache.nutch.crawl.CrawlDbReader.processStatJob(CrawlDbReader.java:444) at org.apache.nutch.crawl.CrawlDbReader.run(CrawlDbReader.java:740) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:792) ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#discussion_r138018731

          ##########
          File path: src/java/org/apache/nutch/fetcher/FetcherOutputFormat.java
          ##########
          @@ -29,73 +29,84 @@
          import org.apache.hadoop.io.Writable;
          import org.apache.hadoop.io.Text;
          import org.apache.hadoop.io.SequenceFile.CompressionType;
          -import org.apache.hadoop.mapred.FileOutputFormat;
          +import org.apache.hadoop.util.Progressable;
          +import org.apache.hadoop.conf.Configuration;
          +import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
          import org.apache.hadoop.mapred.InvalidJobConfException;
          -import org.apache.hadoop.mapred.OutputFormat;
          -import org.apache.hadoop.mapred.RecordWriter;
          -import org.apache.hadoop.mapred.JobConf;
          -import org.apache.hadoop.mapred.Reporter;
          -import org.apache.hadoop.mapred.SequenceFileOutputFormat;
          +import org.apache.hadoop.mapreduce.OutputFormat;
          +import org.apache.hadoop.mapreduce.RecordWriter;
          +import org.apache.hadoop.mapreduce.Job;
          +import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
          +import org.apache.hadoop.mapreduce.TaskAttemptContext;
          +import org.apache.hadoop.mapreduce.JobContext;
          +import org.apache.hadoop.mapreduce.InputSplit;
          +import org.apache.hadoop.mapred.FileSplit;
          import org.apache.hadoop.util.Progressable;
          import org.apache.nutch.parse.Parse;
          import org.apache.nutch.parse.ParseOutputFormat;
          import org.apache.nutch.protocol.Content;

          /** Splits FetcherOutput entries into multiple map files. */
          -public class FetcherOutputFormat implements OutputFormat<Text, NutchWritable> {
          +public class FetcherOutputFormat extends FileOutputFormat<Text, NutchWritable> {

          • public void checkOutputSpecs(FileSystem fs, JobConf job) throws IOException {
            + @Override
            + public void checkOutputSpecs(JobContext job) throws IOException
            Unknown macro: {+ Configuration conf = job.getConfiguration();+ FileSystem fs = FileSystem.get(conf); Path out = FileOutputFormat.getOutputPath(job); if ((out == null) && (job.getNumReduceTasks() != 0)) { - throw new InvalidJobConfException("Output directory not set in JobConf."); + throw new InvalidJobConfException("Output directory not set in conf."); } if (fs == null) { - fs = out.getFileSystem(job); + fs = out.getFileSystem(conf); } if (fs.exists(new Path(out, CrawlDatum.FETCH_DIR_NAME))) throw new IOException("Segment already fetched!"); }
          • public RecordWriter<Text, NutchWritable> getRecordWriter(final FileSystem fs,
          • final JobConf job, final String name, final Progressable progress)
            + @Override
            + public RecordWriter<Text, NutchWritable> getRecordWriter(TaskAttemptContext context)
            throws IOException {
          • Path out = FileOutputFormat.getOutputPath(job);
            + Configuration conf = context.getConfiguration();
            + String name = context.getJobName();//getTaskAttemptID().toString();
            + Path dir = FileOutputFormat.getOutputPath(context);
            + FileSystem fs = dir.getFileSystem(context.getConfiguration());
            + Path out = FileOutputFormat.getOutputPath(context);

          Review comment:
          This will change the output folder structure and probably will cause collisions of output folders if run in distributed mode (on a Hadoop cluster). The directory tree of a segment should look as before:
          ```
          crawl/segments/20170816093452/

          – content
          `-- part-00000
            – data
          `-- index
          – crawl_fetch
          `-- part-00000
            – data
          `-- index
          – crawl_generate
          `-- part-00000
          – crawl_parse
          `-- part-00000
          – parse_data
          `-- part-00000
            – data
          `-- index
          `-- parse_text
          `-- part-00000
          – data
          `-- index
          ```

          There will be changes due to the MapReduce upgrade (part-xxxxx -> part-r-xxxxx). The tree is now
          ```
          crawl/segments/20170911103223/

          – content
          `-- FetchData
            – data
          `-- index
          – crawl_fetch
          `-- FetchData
            – data
          `-- index
          – crawl_generate
          `-- part-r-00000
          – crawl_parse
          `-- parse\ crawl
          `-- segments
          `-- 20170911103223
          – parse_data
          `-- parse\ crawl
          `-- segments
          `-- 20170911103223
            – data
          `-- index
          `-- parse_text
          `-- parse\ crawl
          `-- segments
          `-- 20170911103223
          – data
          `-- index
          ```

          which makes a crawl failing, e.g. with
          ```
          CrawlDb update: java.io.FileNotFoundException: File file:.../crawl/segments/20170911103223/crawl_parse/parse crawl/data does not exist
          at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:609)
          at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:822)
          at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:599)
          at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
          ```

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#discussion_r138018731 ########## File path: src/java/org/apache/nutch/fetcher/FetcherOutputFormat.java ########## @@ -29,73 +29,84 @@ import org.apache.hadoop.io.Writable; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.SequenceFile.CompressionType; -import org.apache.hadoop.mapred.FileOutputFormat; +import org.apache.hadoop.util.Progressable; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapred.InvalidJobConfException; -import org.apache.hadoop.mapred.OutputFormat; -import org.apache.hadoop.mapred.RecordWriter; -import org.apache.hadoop.mapred.JobConf; -import org.apache.hadoop.mapred.Reporter; -import org.apache.hadoop.mapred.SequenceFileOutputFormat; +import org.apache.hadoop.mapreduce.OutputFormat; +import org.apache.hadoop.mapreduce.RecordWriter; +import org.apache.hadoop.mapreduce.Job; +import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat; +import org.apache.hadoop.mapreduce.TaskAttemptContext; +import org.apache.hadoop.mapreduce.JobContext; +import org.apache.hadoop.mapreduce.InputSplit; +import org.apache.hadoop.mapred.FileSplit; import org.apache.hadoop.util.Progressable; import org.apache.nutch.parse.Parse; import org.apache.nutch.parse.ParseOutputFormat; import org.apache.nutch.protocol.Content; /** Splits FetcherOutput entries into multiple map files. */ -public class FetcherOutputFormat implements OutputFormat<Text, NutchWritable> { +public class FetcherOutputFormat extends FileOutputFormat<Text, NutchWritable> { public void checkOutputSpecs(FileSystem fs, JobConf job) throws IOException { + @Override + public void checkOutputSpecs(JobContext job) throws IOException Unknown macro: {+ Configuration conf = job.getConfiguration();+ FileSystem fs = FileSystem.get(conf); Path out = FileOutputFormat.getOutputPath(job); if ((out == null) && (job.getNumReduceTasks() != 0)) { - throw new InvalidJobConfException("Output directory not set in JobConf."); + throw new InvalidJobConfException("Output directory not set in conf."); } if (fs == null) { - fs = out.getFileSystem(job); + fs = out.getFileSystem(conf); } if (fs.exists(new Path(out, CrawlDatum.FETCH_DIR_NAME))) throw new IOException("Segment already fetched!"); } public RecordWriter<Text, NutchWritable> getRecordWriter(final FileSystem fs, final JobConf job, final String name, final Progressable progress) + @Override + public RecordWriter<Text, NutchWritable> getRecordWriter(TaskAttemptContext context) throws IOException { Path out = FileOutputFormat.getOutputPath(job); + Configuration conf = context.getConfiguration(); + String name = context.getJobName();//getTaskAttemptID().toString(); + Path dir = FileOutputFormat.getOutputPath(context); + FileSystem fs = dir.getFileSystem(context.getConfiguration()); + Path out = FileOutputFormat.getOutputPath(context); Review comment: This will change the output folder structure and probably will cause collisions of output folders if run in distributed mode (on a Hadoop cluster). The directory tree of a segment should look as before: ``` crawl/segments/20170816093452/ – content `-- part-00000   – data `-- index – crawl_fetch `-- part-00000   – data `-- index – crawl_generate `-- part-00000 – crawl_parse `-- part-00000 – parse_data `-- part-00000   – data `-- index `-- parse_text `-- part-00000 – data `-- index ``` There will be changes due to the MapReduce upgrade (part-xxxxx -> part-r-xxxxx). The tree is now ``` crawl/segments/20170911103223/ – content `-- FetchData   – data `-- index – crawl_fetch `-- FetchData   – data `-- index – crawl_generate `-- part-r-00000 – crawl_parse `-- parse\ crawl `-- segments `-- 20170911103223 – parse_data `-- parse\ crawl `-- segments `-- 20170911103223   – data `-- index `-- parse_text `-- parse\ crawl `-- segments `-- 20170911103223 – data `-- index ``` which makes a crawl failing, e.g. with ``` CrawlDb update: java.io.FileNotFoundException: File file:.../crawl/segments/20170911103223/crawl_parse/parse crawl/data does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:609) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:822) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:599) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421) ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#discussion_r138007647

          ##########
          File path: src/java/org/apache/nutch/crawl/CrawlDbReader.java
          ##########
          @@ -368,41 +367,46 @@ public void close()

          { closeReaders(); }
          • private TreeMap<String, LongWritable> processStatJobHelper(String crawlDb, Configuration config, boolean sort) throws IOException{
            + private TreeMap<String, LongWritable> processStatJobHelper(String crawlDb, Configuration config, boolean sort)
            + throws IOException, InterruptedException, ClassNotFoundException{
            Path tmpFolder = new Path(crawlDb, "stat_tmp" + System.currentTimeMillis());
          • JobConf job = new NutchJob(config);
            + Job job = NutchJob.getInstance(config);
            + config = job.getConfiguration();
            job.setJobName("stats " + crawlDb);
          • job.setBoolean("db.reader.stats.sort", sort);
            + config.setBoolean("db.reader.stats.sort", sort);

          FileInputFormat.addInputPath(job, new Path(crawlDb, CrawlDb.CURRENT_NAME));

          • job.setInputFormat(SequenceFileInputFormat.class);
            + job.setInputFormatClass(SequenceFileInputFormat.class);

          job.setMapperClass(CrawlDbStatMapper.class);
          job.setCombinerClass(CrawlDbStatCombiner.class);
          job.setReducerClass(CrawlDbStatReducer.class);

          FileOutputFormat.setOutputPath(job, tmpFolder);

          • job.setOutputFormat(SequenceFileOutputFormat.class);
            + job.setOutputFormatClass(SequenceFileOutputFormat.class);

          Review comment:
          If the job writes it's output to a sequence file ...

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#discussion_r138007647 ########## File path: src/java/org/apache/nutch/crawl/CrawlDbReader.java ########## @@ -368,41 +367,46 @@ public void close() { closeReaders(); } private TreeMap<String, LongWritable> processStatJobHelper(String crawlDb, Configuration config, boolean sort) throws IOException{ + private TreeMap<String, LongWritable> processStatJobHelper(String crawlDb, Configuration config, boolean sort) + throws IOException, InterruptedException, ClassNotFoundException{ Path tmpFolder = new Path(crawlDb, "stat_tmp" + System.currentTimeMillis()); JobConf job = new NutchJob(config); + Job job = NutchJob.getInstance(config); + config = job.getConfiguration(); job.setJobName("stats " + crawlDb); job.setBoolean("db.reader.stats.sort", sort); + config.setBoolean("db.reader.stats.sort", sort); FileInputFormat.addInputPath(job, new Path(crawlDb, CrawlDb.CURRENT_NAME)); job.setInputFormat(SequenceFileInputFormat.class); + job.setInputFormatClass(SequenceFileInputFormat.class); job.setMapperClass(CrawlDbStatMapper.class); job.setCombinerClass(CrawlDbStatCombiner.class); job.setReducerClass(CrawlDbStatReducer.class); FileOutputFormat.setOutputPath(job, tmpFolder); job.setOutputFormat(SequenceFileOutputFormat.class); + job.setOutputFormatClass(SequenceFileOutputFormat.class); Review comment: If the job writes it's output to a sequence file ... ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#discussion_r138013644

          ##########
          File path: src/java/org/apache/nutch/hostdb/ReadHostDb.java
          ##########
          @@ -205,7 +205,7 @@ private void readHostDb(Path hostDb, Path output, boolean dumpHomepages, boolean

          private void getHostDbRecord(Path hostDb, String host) throws Exception {
          Configuration conf = getConf();

          • SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(conf, hostDb);
            + MapFile.Reader[] readers = MapFileOutputFormat.getReaders(hostDb, conf);

          Review comment:
          If HostDb isn't changed to be stored as MapFile must still use a SequenceFile reader, see comments in CrawlDbReader.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#discussion_r138013644 ########## File path: src/java/org/apache/nutch/hostdb/ReadHostDb.java ########## @@ -205,7 +205,7 @@ private void readHostDb(Path hostDb, Path output, boolean dumpHomepages, boolean private void getHostDbRecord(Path hostDb, String host) throws Exception { Configuration conf = getConf(); SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(conf, hostDb); + MapFile.Reader[] readers = MapFileOutputFormat.getReaders(hostDb, conf); Review comment: If HostDb isn't changed to be stored as MapFile must still use a SequenceFile reader, see comments in CrawlDbReader. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#discussion_r138014758

          ##########
          File path: src/java/org/apache/nutch/segment/SegmentReader.java
          ##########
          @@ -502,13 +511,14 @@ public void getStats(Path segment, final SegmentReaderStats stats)
          throws Exception {
          long cnt = 0L;
          Text key = new Text();
          + Text val = new Text();
          FileSystem fs = segment.getFileSystem(getConf());

          if (ge) {

          • SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(
          • getConf(), new Path(segment, CrawlDatum.GENERATE_DIR_NAME));
            + MapFile.Reader[] readers = MapFileOutputFormat.getReaders(
            + new Path(segment, CrawlDatum.GENERATE_DIR_NAME), getConf());

          Review comment:
          `crawl_generate` is stored as SequenceFile (and must be because it's not ordered by key) - must still use a SequenceFile reader.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#discussion_r138014758 ########## File path: src/java/org/apache/nutch/segment/SegmentReader.java ########## @@ -502,13 +511,14 @@ public void getStats(Path segment, final SegmentReaderStats stats) throws Exception { long cnt = 0L; Text key = new Text(); + Text val = new Text(); FileSystem fs = segment.getFileSystem(getConf()); if (ge) { SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders( getConf(), new Path(segment, CrawlDatum.GENERATE_DIR_NAME)); + MapFile.Reader[] readers = MapFileOutputFormat.getReaders( + new Path(segment, CrawlDatum.GENERATE_DIR_NAME), getConf()); Review comment: `crawl_generate` is stored as SequenceFile (and must be because it's not ordered by key) - must still use a SequenceFile reader. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#discussion_r138073377

          ##########
          File path: src/java/org/apache/nutch/crawl/CrawlDbReader.java
          ##########
          @@ -368,41 +367,46 @@ public void close()

          { closeReaders(); }
          • private TreeMap<String, LongWritable> processStatJobHelper(String crawlDb, Configuration config, boolean sort) throws IOException{
            + private TreeMap<String, LongWritable> processStatJobHelper(String crawlDb, Configuration config, boolean sort)
            + throws IOException, InterruptedException, ClassNotFoundException{
            Path tmpFolder = new Path(crawlDb, "stat_tmp" + System.currentTimeMillis());
          • JobConf job = new NutchJob(config);
            + Job job = NutchJob.getInstance(config);
            + config = job.getConfiguration();
            job.setJobName("stats " + crawlDb);
          • job.setBoolean("db.reader.stats.sort", sort);
            + config.setBoolean("db.reader.stats.sort", sort);

          FileInputFormat.addInputPath(job, new Path(crawlDb, CrawlDb.CURRENT_NAME));

          • job.setInputFormat(SequenceFileInputFormat.class);
            + job.setInputFormatClass(SequenceFileInputFormat.class);

          job.setMapperClass(CrawlDbStatMapper.class);
          job.setCombinerClass(CrawlDbStatCombiner.class);
          job.setReducerClass(CrawlDbStatReducer.class);

          FileOutputFormat.setOutputPath(job, tmpFolder);

          • job.setOutputFormat(SequenceFileOutputFormat.class);
            + job.setOutputFormatClass(SequenceFileOutputFormat.class);
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(LongWritable.class);

          // https://issues.apache.org/jira/browse/NUTCH-1029

          • job.setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", false);
            -
          • JobClient.runJob(job);
            + config.setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", false);

          + try

          { + int complete = job.waitForCompletion(true)?0:1; + }

          catch (InterruptedException | ClassNotFoundException e)

          { + LOG.error(StringUtils.stringifyException(e)); + throw e; + }

          // reading the result
          FileSystem fileSystem = tmpFolder.getFileSystem(config);

          • SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(config,
          • tmpFolder);
            + MapFile.Reader[] readers = MapFileOutputFormat.getReaders(tmpFolder, config);

          Review comment:
          @sebastian-nagel SequenceFileOutputFormat does not have the sub-routine in the upgrade(new API). One of the things that I can do is replicate the SequenceFileOutputFormat.getReaders(of the old API) in a separate util file in org/apache/nutch/util/ please let me know your thoughts in it.

          The implementation of the old API getReaders can be found [here](https://hadoop.apache.org/docs/r2.6.1/api/src-html/org/apache/hadoop/mapred/SequenceFileOutputFormat.html#line.84). Thanks.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#discussion_r138073377 ########## File path: src/java/org/apache/nutch/crawl/CrawlDbReader.java ########## @@ -368,41 +367,46 @@ public void close() { closeReaders(); } private TreeMap<String, LongWritable> processStatJobHelper(String crawlDb, Configuration config, boolean sort) throws IOException{ + private TreeMap<String, LongWritable> processStatJobHelper(String crawlDb, Configuration config, boolean sort) + throws IOException, InterruptedException, ClassNotFoundException{ Path tmpFolder = new Path(crawlDb, "stat_tmp" + System.currentTimeMillis()); JobConf job = new NutchJob(config); + Job job = NutchJob.getInstance(config); + config = job.getConfiguration(); job.setJobName("stats " + crawlDb); job.setBoolean("db.reader.stats.sort", sort); + config.setBoolean("db.reader.stats.sort", sort); FileInputFormat.addInputPath(job, new Path(crawlDb, CrawlDb.CURRENT_NAME)); job.setInputFormat(SequenceFileInputFormat.class); + job.setInputFormatClass(SequenceFileInputFormat.class); job.setMapperClass(CrawlDbStatMapper.class); job.setCombinerClass(CrawlDbStatCombiner.class); job.setReducerClass(CrawlDbStatReducer.class); FileOutputFormat.setOutputPath(job, tmpFolder); job.setOutputFormat(SequenceFileOutputFormat.class); + job.setOutputFormatClass(SequenceFileOutputFormat.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(LongWritable.class); // https://issues.apache.org/jira/browse/NUTCH-1029 job.setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", false); - JobClient.runJob(job); + config.setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", false); + try { + int complete = job.waitForCompletion(true)?0:1; + } catch (InterruptedException | ClassNotFoundException e) { + LOG.error(StringUtils.stringifyException(e)); + throw e; + } // reading the result FileSystem fileSystem = tmpFolder.getFileSystem(config); SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(config, tmpFolder); + MapFile.Reader[] readers = MapFileOutputFormat.getReaders(tmpFolder, config); Review comment: @sebastian-nagel SequenceFileOutputFormat does not have the sub-routine in the upgrade(new API). One of the things that I can do is replicate the SequenceFileOutputFormat.getReaders(of the old API) in a separate util file in org/apache/nutch/util/ please let me know your thoughts in it. The implementation of the old API getReaders can be found [here] ( https://hadoop.apache.org/docs/r2.6.1/api/src-html/org/apache/hadoop/mapred/SequenceFileOutputFormat.html#line.84 ). Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#discussion_r138104817

          ##########
          File path: src/java/org/apache/nutch/crawl/CrawlDbReader.java
          ##########
          @@ -368,41 +367,46 @@ public void close()

          { closeReaders(); }
          • private TreeMap<String, LongWritable> processStatJobHelper(String crawlDb, Configuration config, boolean sort) throws IOException{
            + private TreeMap<String, LongWritable> processStatJobHelper(String crawlDb, Configuration config, boolean sort)
            + throws IOException, InterruptedException, ClassNotFoundException{
            Path tmpFolder = new Path(crawlDb, "stat_tmp" + System.currentTimeMillis());
          • JobConf job = new NutchJob(config);
            + Job job = NutchJob.getInstance(config);
            + config = job.getConfiguration();
            job.setJobName("stats " + crawlDb);
          • job.setBoolean("db.reader.stats.sort", sort);
            + config.setBoolean("db.reader.stats.sort", sort);

          FileInputFormat.addInputPath(job, new Path(crawlDb, CrawlDb.CURRENT_NAME));

          • job.setInputFormat(SequenceFileInputFormat.class);
            + job.setInputFormatClass(SequenceFileInputFormat.class);

          job.setMapperClass(CrawlDbStatMapper.class);
          job.setCombinerClass(CrawlDbStatCombiner.class);
          job.setReducerClass(CrawlDbStatReducer.class);

          FileOutputFormat.setOutputPath(job, tmpFolder);

          • job.setOutputFormat(SequenceFileOutputFormat.class);
            + job.setOutputFormatClass(SequenceFileOutputFormat.class);
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(LongWritable.class);

          // https://issues.apache.org/jira/browse/NUTCH-1029

          • job.setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", false);
            -
          • JobClient.runJob(job);
            + config.setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", false);

          + try

          { + int complete = job.waitForCompletion(true)?0:1; + }

          catch (InterruptedException | ClassNotFoundException e)

          { + LOG.error(StringUtils.stringifyException(e)); + throw e; + }

          // reading the result
          FileSystem fileSystem = tmpFolder.getFileSystem(config);

          • SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(config,
          • tmpFolder);
            + MapFile.Reader[] readers = MapFileOutputFormat.getReaders(tmpFolder, config);

          Review comment:
          Yes, reimplimenting the getReaders method seems the only way, at least, I didn't find another one.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#discussion_r138104817 ########## File path: src/java/org/apache/nutch/crawl/CrawlDbReader.java ########## @@ -368,41 +367,46 @@ public void close() { closeReaders(); } private TreeMap<String, LongWritable> processStatJobHelper(String crawlDb, Configuration config, boolean sort) throws IOException{ + private TreeMap<String, LongWritable> processStatJobHelper(String crawlDb, Configuration config, boolean sort) + throws IOException, InterruptedException, ClassNotFoundException{ Path tmpFolder = new Path(crawlDb, "stat_tmp" + System.currentTimeMillis()); JobConf job = new NutchJob(config); + Job job = NutchJob.getInstance(config); + config = job.getConfiguration(); job.setJobName("stats " + crawlDb); job.setBoolean("db.reader.stats.sort", sort); + config.setBoolean("db.reader.stats.sort", sort); FileInputFormat.addInputPath(job, new Path(crawlDb, CrawlDb.CURRENT_NAME)); job.setInputFormat(SequenceFileInputFormat.class); + job.setInputFormatClass(SequenceFileInputFormat.class); job.setMapperClass(CrawlDbStatMapper.class); job.setCombinerClass(CrawlDbStatCombiner.class); job.setReducerClass(CrawlDbStatReducer.class); FileOutputFormat.setOutputPath(job, tmpFolder); job.setOutputFormat(SequenceFileOutputFormat.class); + job.setOutputFormatClass(SequenceFileOutputFormat.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(LongWritable.class); // https://issues.apache.org/jira/browse/NUTCH-1029 job.setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", false); - JobClient.runJob(job); + config.setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", false); + try { + int complete = job.waitForCompletion(true)?0:1; + } catch (InterruptedException | ClassNotFoundException e) { + LOG.error(StringUtils.stringifyException(e)); + throw e; + } // reading the result FileSystem fileSystem = tmpFolder.getFileSystem(config); SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(config, tmpFolder); + MapFile.Reader[] readers = MapFileOutputFormat.getReaders(tmpFolder, config); Review comment: Yes, reimplimenting the getReaders method seems the only way, at least, I didn't find another one. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#discussion_r138104817

          ##########
          File path: src/java/org/apache/nutch/crawl/CrawlDbReader.java
          ##########
          @@ -368,41 +367,46 @@ public void close()

          { closeReaders(); }
          • private TreeMap<String, LongWritable> processStatJobHelper(String crawlDb, Configuration config, boolean sort) throws IOException{
            + private TreeMap<String, LongWritable> processStatJobHelper(String crawlDb, Configuration config, boolean sort)
            + throws IOException, InterruptedException, ClassNotFoundException{
            Path tmpFolder = new Path(crawlDb, "stat_tmp" + System.currentTimeMillis());
          • JobConf job = new NutchJob(config);
            + Job job = NutchJob.getInstance(config);
            + config = job.getConfiguration();
            job.setJobName("stats " + crawlDb);
          • job.setBoolean("db.reader.stats.sort", sort);
            + config.setBoolean("db.reader.stats.sort", sort);

          FileInputFormat.addInputPath(job, new Path(crawlDb, CrawlDb.CURRENT_NAME));

          • job.setInputFormat(SequenceFileInputFormat.class);
            + job.setInputFormatClass(SequenceFileInputFormat.class);

          job.setMapperClass(CrawlDbStatMapper.class);
          job.setCombinerClass(CrawlDbStatCombiner.class);
          job.setReducerClass(CrawlDbStatReducer.class);

          FileOutputFormat.setOutputPath(job, tmpFolder);

          • job.setOutputFormat(SequenceFileOutputFormat.class);
            + job.setOutputFormatClass(SequenceFileOutputFormat.class);
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(LongWritable.class);

          // https://issues.apache.org/jira/browse/NUTCH-1029

          • job.setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", false);
            -
          • JobClient.runJob(job);
            + config.setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", false);

          + try

          { + int complete = job.waitForCompletion(true)?0:1; + }

          catch (InterruptedException | ClassNotFoundException e)

          { + LOG.error(StringUtils.stringifyException(e)); + throw e; + }

          // reading the result
          FileSystem fileSystem = tmpFolder.getFileSystem(config);

          • SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(config,
          • tmpFolder);
            + MapFile.Reader[] readers = MapFileOutputFormat.getReaders(tmpFolder, config);

          Review comment:
          Yes, reimplementing the getReaders method seems the only way, at least, I didn't find another one.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#discussion_r138104817 ########## File path: src/java/org/apache/nutch/crawl/CrawlDbReader.java ########## @@ -368,41 +367,46 @@ public void close() { closeReaders(); } private TreeMap<String, LongWritable> processStatJobHelper(String crawlDb, Configuration config, boolean sort) throws IOException{ + private TreeMap<String, LongWritable> processStatJobHelper(String crawlDb, Configuration config, boolean sort) + throws IOException, InterruptedException, ClassNotFoundException{ Path tmpFolder = new Path(crawlDb, "stat_tmp" + System.currentTimeMillis()); JobConf job = new NutchJob(config); + Job job = NutchJob.getInstance(config); + config = job.getConfiguration(); job.setJobName("stats " + crawlDb); job.setBoolean("db.reader.stats.sort", sort); + config.setBoolean("db.reader.stats.sort", sort); FileInputFormat.addInputPath(job, new Path(crawlDb, CrawlDb.CURRENT_NAME)); job.setInputFormat(SequenceFileInputFormat.class); + job.setInputFormatClass(SequenceFileInputFormat.class); job.setMapperClass(CrawlDbStatMapper.class); job.setCombinerClass(CrawlDbStatCombiner.class); job.setReducerClass(CrawlDbStatReducer.class); FileOutputFormat.setOutputPath(job, tmpFolder); job.setOutputFormat(SequenceFileOutputFormat.class); + job.setOutputFormatClass(SequenceFileOutputFormat.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(LongWritable.class); // https://issues.apache.org/jira/browse/NUTCH-1029 job.setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", false); - JobClient.runJob(job); + config.setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", false); + try { + int complete = job.waitForCompletion(true)?0:1; + } catch (InterruptedException | ClassNotFoundException e) { + LOG.error(StringUtils.stringifyException(e)); + throw e; + } // reading the result FileSystem fileSystem = tmpFolder.getFileSystem(config); SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(config, tmpFolder); + MapFile.Reader[] readers = MapFileOutputFormat.getReaders(tmpFolder, config); Review comment: Yes, reimplementing the getReaders method seems the only way, at least, I didn't find another one. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on issue #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#issuecomment-328608371

          @sebastian-nagel please take a look at my latest commit. I have tested it locally and the build is running successfully. The directory hierarchy seems to be fixed with this commit. Thanks.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#issuecomment-328608371 @sebastian-nagel please take a look at my latest commit. I have tested it locally and the build is running successfully. The directory hierarchy seems to be fixed with this commit. Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on issue #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#issuecomment-328622744

          Thanks, @Omkar20895. Could you also add the class file SegmentReaderUtils.java, it's not included in the last commit.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on issue #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#issuecomment-328622744 Thanks, @Omkar20895. Could you also add the class file SegmentReaderUtils.java, it's not included in the last commit. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on issue #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#issuecomment-328622744

          Thanks, @Omkar20895. Could you also add the class file SegmentReaderUtil.java, it's not included in the last commit.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on issue #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#issuecomment-328622744 Thanks, @Omkar20895. Could you also add the class file SegmentReaderUtil.java, it's not included in the last commit. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on issue #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#issuecomment-328624935

          Done! @sebastian-nagel

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#issuecomment-328624935 Done! @sebastian-nagel ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on issue #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#issuecomment-328643458

          Excellent! Continued testing:

          • unit tests pass
          • successfully run a small test crawl in local mode (only inject + few generate-fetch-parse-update cycles)
          • `nutch readdb` works now

          I'll continue testing the next days (todo: indexer, linkdb, ...) and hope to test it also on a Hadoop cluster.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on issue #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#issuecomment-328643458 Excellent! Continued testing: unit tests pass successfully run a small test crawl in local mode (only inject + few generate-fetch-parse-update cycles) `nutch readdb` works now I'll continue testing the next days (todo: indexer, linkdb, ...) and hope to test it also on a Hadoop cluster. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#issuecomment-328653338

          Very good @sebastian-nagel thank you for the updates

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#issuecomment-328653338 Very good @sebastian-nagel thank you for the updates ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on issue #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#issuecomment-328747611

          @sebastian-nagel Thank you very much. Do keep this thread posted with any results or errors that you face.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#issuecomment-328747611 @sebastian-nagel Thank you very much. Do keep this thread posted with any results or errors that you face. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on issue #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#issuecomment-329512206

          @sebastian-nagel thank you very much for the review.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#issuecomment-329512206 @sebastian-nagel thank you very much for the review. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#discussion_r138927671

          ##########
          File path: src/java/org/apache/nutch/scoring/webgraph/LinkDumper.java
          ##########
          @@ -214,94 +212,98 @@ public void write(DataOutput out) throws IOException {

          • Inverts outlinks from the WebGraph to inlinks and attaches node
          • information.
            */
          • public static class Inverter implements
          • Mapper<Text, Writable, Text, ObjectWritable>,
          • Reducer<Text, ObjectWritable, Text, LinkNode> {
            + public static class Inverter {
          • private JobConf conf;
            -
          • public void configure(JobConf conf) { - this.conf = conf; - }

            + private static Configuration conf;

          Review comment:
          See comments in Generator etc. regarding static variables shared between mapper and reducer classes.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#discussion_r138927671 ########## File path: src/java/org/apache/nutch/scoring/webgraph/LinkDumper.java ########## @@ -214,94 +212,98 @@ public void write(DataOutput out) throws IOException { Inverts outlinks from the WebGraph to inlinks and attaches node information. */ public static class Inverter implements Mapper<Text, Writable, Text, ObjectWritable>, Reducer<Text, ObjectWritable, Text, LinkNode> { + public static class Inverter { private JobConf conf; - public void configure(JobConf conf) { - this.conf = conf; - } + private static Configuration conf; Review comment: See comments in Generator etc. regarding static variables shared between mapper and reducer classes. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#discussion_r138929528

          ##########
          File path: src/java/org/apache/nutch/segment/SegmentMerger.java
          ##########
          @@ -120,20 +121,18 @@

          • @author Andrzej Bialecki
            */
            -public class SegmentMerger extends Configured implements Tool,
          • Mapper<Text, MetaWrapper, Text, MetaWrapper>,
          • Reducer<Text, MetaWrapper, Text, MetaWrapper> {
            +public class SegmentMerger extends Configured implements Tool{
            private static final Logger LOG = LoggerFactory
            .getLogger(MethodHandles.lookup().lookupClass());

          private static final String SEGMENT_PART_KEY = "part";
          private static final String SEGMENT_SLICE_KEY = "slice";

          • private URLFilters filters = null;
          • private URLNormalizers normalizers = null;
          • private SegmentMergeFilters mergeFilters = null;
          • private long sliceSize = -1;
          • private long curCount = 0;
            + private static URLFilters filters = null;
            + private static URLNormalizers normalizers = null;
            + private static SegmentMergeFilters mergeFilters = null;
            + private static long sliceSize = -1;
            + private static long curCount = 0;

          Review comment:
          See comments in Generator etc. regarding static variables shared between mapper and reducer classes.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#discussion_r138929528 ########## File path: src/java/org/apache/nutch/segment/SegmentMerger.java ########## @@ -120,20 +121,18 @@ @author Andrzej Bialecki */ -public class SegmentMerger extends Configured implements Tool, Mapper<Text, MetaWrapper, Text, MetaWrapper>, Reducer<Text, MetaWrapper, Text, MetaWrapper> { +public class SegmentMerger extends Configured implements Tool{ private static final Logger LOG = LoggerFactory .getLogger(MethodHandles.lookup().lookupClass()); private static final String SEGMENT_PART_KEY = "part"; private static final String SEGMENT_SLICE_KEY = "slice"; private URLFilters filters = null; private URLNormalizers normalizers = null; private SegmentMergeFilters mergeFilters = null; private long sliceSize = -1; private long curCount = 0; + private static URLFilters filters = null; + private static URLNormalizers normalizers = null; + private static SegmentMergeFilters mergeFilters = null; + private static long sliceSize = -1; + private static long curCount = 0; Review comment: See comments in Generator etc. regarding static variables shared between mapper and reducer classes. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#discussion_r138926576

          ##########
          File path: src/java/org/apache/nutch/parse/ParseOutputFormat.java
          ##########
          @@ -48,14 +61,19 @@
          import org.apache.hadoop.util.Progressable;

          /* Parse content in a segment. */
          -public class ParseOutputFormat implements OutputFormat<Text, Parse> {
          +public class ParseOutputFormat extends OutputFormat<Text, Parse> {
          private static final Logger LOG = LoggerFactory
          .getLogger(MethodHandles.lookup().lookupClass());
          private URLFilters filters;
          private URLExemptionFilters exemptionFilters;
          private URLNormalizers normalizers;
          private ScoringFilters scfilters;
          -
          + private static final NumberFormat NUMBER_FORMAT = NumberFormat.getInstance();

          Review comment:
          NumberFormat isn't thread-safe. Should definitely not be static!

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#discussion_r138926576 ########## File path: src/java/org/apache/nutch/parse/ParseOutputFormat.java ########## @@ -48,14 +61,19 @@ import org.apache.hadoop.util.Progressable; /* Parse content in a segment. */ -public class ParseOutputFormat implements OutputFormat<Text, Parse> { +public class ParseOutputFormat extends OutputFormat<Text, Parse> { private static final Logger LOG = LoggerFactory .getLogger(MethodHandles.lookup().lookupClass()); private URLFilters filters; private URLExemptionFilters exemptionFilters; private URLNormalizers normalizers; private ScoringFilters scfilters; - + private static final NumberFormat NUMBER_FORMAT = NumberFormat.getInstance(); Review comment: NumberFormat isn't thread-safe. Should definitely not be static! ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#discussion_r138925338

          ##########
          File path: src/java/org/apache/nutch/indexer/IndexerMapReduce.java
          ##########
          @@ -70,51 +68,26 @@
          public static final String URL_NORMALIZING = "indexer.url.normalizers";
          public static final String INDEXER_BINARY_AS_BASE64 = "indexer.binary.base64";

          • private boolean skip = false;
          • private boolean delete = false;
          • private boolean deleteRobotsNoIndex = false;
          • private boolean deleteSkippedByIndexingFilter = false;
          • private boolean base64 = false;
          • private IndexingFilters filters;
          • private ScoringFilters scfilters;
            + private static boolean skip = false;
            + private static boolean delete = false;
            + private static boolean deleteRobotsNoIndex = false;
            + private static boolean deleteSkippedByIndexingFilter = false;
            + private static boolean base64 = false;
            + private static IndexingFilters filters;
            + private static ScoringFilters scfilters;

          // using normalizers and/or filters

          • private boolean normalize = false;
          • private boolean filter = false;
            + private static boolean normalize = false;
            + private static boolean filter = false;

          // url normalizers, filters and job configuration

          • private URLNormalizers urlNormalizers;
          • private URLFilters urlFilters;
            + private static URLNormalizers urlNormalizers;
            + private static URLFilters urlFilters;

          Review comment:
          Regarding static variables, see comments in Generator. Afaics, they're really shared between IndexerMapper and IndexerReducer. But then the code for initialization is also best shared to avoid duplicate code (cf. setup methods of mapper and reducer).

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#discussion_r138925338 ########## File path: src/java/org/apache/nutch/indexer/IndexerMapReduce.java ########## @@ -70,51 +68,26 @@ public static final String URL_NORMALIZING = "indexer.url.normalizers"; public static final String INDEXER_BINARY_AS_BASE64 = "indexer.binary.base64"; private boolean skip = false; private boolean delete = false; private boolean deleteRobotsNoIndex = false; private boolean deleteSkippedByIndexingFilter = false; private boolean base64 = false; private IndexingFilters filters; private ScoringFilters scfilters; + private static boolean skip = false; + private static boolean delete = false; + private static boolean deleteRobotsNoIndex = false; + private static boolean deleteSkippedByIndexingFilter = false; + private static boolean base64 = false; + private static IndexingFilters filters; + private static ScoringFilters scfilters; // using normalizers and/or filters private boolean normalize = false; private boolean filter = false; + private static boolean normalize = false; + private static boolean filter = false; // url normalizers, filters and job configuration private URLNormalizers urlNormalizers; private URLFilters urlFilters; + private static URLNormalizers urlNormalizers; + private static URLFilters urlFilters; Review comment: Regarding static variables, see comments in Generator. Afaics, they're really shared between IndexerMapper and IndexerReducer. But then the code for initialization is also best shared to avoid duplicate code (cf. setup methods of mapper and reducer). ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#discussion_r138923714

          ##########
          File path: src/java/org/apache/nutch/crawl/LinkDbMerger.java
          ##########
          @@ -68,12 +66,11 @@

          • @author Andrzej Bialecki
            */
            -public class LinkDbMerger extends Configured implements Tool,
          • Reducer<Text, Inlinks, Text, Inlinks> {
            +public class LinkDbMerger extends Configured implements Tool {
            private static final Logger LOG = LoggerFactory
            .getLogger(MethodHandles.lookup().lookupClass());
          • private int maxInlinks;
            + private static int maxInlinks;

          Review comment:
          Similar to the situation in Generator: maybe it's clearer and safer to move the static variable to LinkDbMergeReducer as instance variable.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#discussion_r138923714 ########## File path: src/java/org/apache/nutch/crawl/LinkDbMerger.java ########## @@ -68,12 +66,11 @@ @author Andrzej Bialecki */ -public class LinkDbMerger extends Configured implements Tool, Reducer<Text, Inlinks, Text, Inlinks> { +public class LinkDbMerger extends Configured implements Tool { private static final Logger LOG = LoggerFactory .getLogger(MethodHandles.lookup().lookupClass()); private int maxInlinks; + private static int maxInlinks; Review comment: Similar to the situation in Generator: maybe it's clearer and safer to move the static variable to LinkDbMergeReducer as instance variable. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#discussion_r138928596

          ##########
          File path: src/java/org/apache/nutch/scoring/webgraph/LinkRank.java
          ##########
          @@ -284,64 +293,72 @@ private void runAnalysis(Path nodeDb, Path inverted, Path output,

          • This is used to determine a rank one score for pages with zero inlinks but
          • that contain outlinks.
            */
          • private static class Counter implements
          • Mapper<Text, Node, Text, LongWritable>,
          • Reducer<Text, LongWritable, Text, LongWritable> {
            + private static class Counter {

          private static Text numNodes = new Text(NUM_NODES);
          private static LongWritable one = new LongWritable(1L);

          • public void configure(JobConf conf) { - }

            -
            /**

          • Outputs one for every node.
            */
          • public void map(Text key, Node value,
          • OutputCollector<Text, LongWritable> output, Reporter reporter)
          • throws IOException {
          • output.collect(numNodes, one);
            + public static class CountMapper extends
            + Mapper<Text, Node, Text, LongWritable> {
            + public void setup(Mapper<Text, Node, Text, LongWritable>.Context context) { + }
            +
            + public void map(Text key, Node value,
            + Context context)
            + throws IOException, InterruptedException { + context.write(numNodes, one); + }
            }

            /**
            * Totals the node number and outputs a single total value.
            */
            - public void reduce(Text key, Iterator<LongWritable> values,
            - OutputCollector<Text, LongWritable> output, Reporter reporter)
            - throws IOException {
            + public static class CountReducer extends
            + Reducer<Text, LongWritable, Text, LongWritable> {
            + public void setup(Reducer<Text, LongWritable, Text, LongWritable>.Context context) {+ }
          • long total = 0;
          • while (values.hasNext()) {
          • total += values.next().get();
            + public void reduce(Text key, Iterable<LongWritable> values,
            + Context context)
            + throws IOException, InterruptedException
            Unknown macro: {++ long total = 0;+ for (LongWritable val }
          • output.collect(numNodes, new LongWritable(total));
            }

          public void close() {
          }
          }

          • private static class Initializer implements Mapper<Text, Node, Text, Node> {
            + private static class Initializer extends Mapper<Text, Node, Text, Node> {
          • private JobConf conf;
          • private float initialScore = 1.0f;
            + private static Configuration conf;
            + private static float initialScore = 1.0f;

          Review comment:
          Why static?

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#discussion_r138928596 ########## File path: src/java/org/apache/nutch/scoring/webgraph/LinkRank.java ########## @@ -284,64 +293,72 @@ private void runAnalysis(Path nodeDb, Path inverted, Path output, This is used to determine a rank one score for pages with zero inlinks but that contain outlinks. */ private static class Counter implements Mapper<Text, Node, Text, LongWritable>, Reducer<Text, LongWritable, Text, LongWritable> { + private static class Counter { private static Text numNodes = new Text(NUM_NODES); private static LongWritable one = new LongWritable(1L); public void configure(JobConf conf) { - } - /** Outputs one for every node. */ public void map(Text key, Node value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException { output.collect(numNodes, one); + public static class CountMapper extends + Mapper<Text, Node, Text, LongWritable> { + public void setup(Mapper<Text, Node, Text, LongWritable>.Context context) { + } + + public void map(Text key, Node value, + Context context) + throws IOException, InterruptedException { + context.write(numNodes, one); + } } /** * Totals the node number and outputs a single total value. */ - public void reduce(Text key, Iterator<LongWritable> values, - OutputCollector<Text, LongWritable> output, Reporter reporter) - throws IOException { + public static class CountReducer extends + Reducer<Text, LongWritable, Text, LongWritable> { + public void setup(Reducer<Text, LongWritable, Text, LongWritable>.Context context) {+ } long total = 0; while (values.hasNext()) { total += values.next().get(); + public void reduce(Text key, Iterable<LongWritable> values, + Context context) + throws IOException, InterruptedException Unknown macro: {++ long total = 0;+ for (LongWritable val } output.collect(numNodes, new LongWritable(total)); } public void close() { } } private static class Initializer implements Mapper<Text, Node, Text, Node> { + private static class Initializer extends Mapper<Text, Node, Text, Node> { private JobConf conf; private float initialScore = 1.0f; + private static Configuration conf; + private static float initialScore = 1.0f; Review comment: Why static? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#discussion_r138930231

          ##########
          File path: src/java/org/apache/nutch/segment/SegmentReader.java
          ##########
          @@ -70,32 +69,31 @@
          import org.apache.nutch.util.HadoopFSUtil;
          import org.apache.nutch.util.NutchConfiguration;
          import org.apache.nutch.util.NutchJob;
          +import org.apache.nutch.util.SegmentReaderUtil;

          /** Dump the content of a segment. */
          -public class SegmentReader extends Configured implements Tool,

          • Reducer<Text, NutchWritable, Text, Text> {
            +public class SegmentReader extends Configured implements Tool {

          private static final Logger LOG = LoggerFactory
          .getLogger(MethodHandles.lookup().lookupClass());

          • long recNo = 0L;
            + static long recNo = 0L;

          Review comment:
          See comments in Generator etc. regarding static variables shared between mapper and reducer classes.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#discussion_r138930231 ########## File path: src/java/org/apache/nutch/segment/SegmentReader.java ########## @@ -70,32 +69,31 @@ import org.apache.nutch.util.HadoopFSUtil; import org.apache.nutch.util.NutchConfiguration; import org.apache.nutch.util.NutchJob; +import org.apache.nutch.util.SegmentReaderUtil; /** Dump the content of a segment. */ -public class SegmentReader extends Configured implements Tool, Reducer<Text, NutchWritable, Text, Text> { +public class SegmentReader extends Configured implements Tool { private static final Logger LOG = LoggerFactory .getLogger(MethodHandles.lookup().lookupClass()); long recNo = 0L; + static long recNo = 0L; Review comment: See comments in Generator etc. regarding static variables shared between mapper and reducer classes. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#discussion_r138930635

          ##########
          File path: src/java/org/apache/nutch/segment/SegmentReader.java
          ##########
          @@ -70,32 +69,31 @@
          import org.apache.nutch.util.HadoopFSUtil;
          import org.apache.nutch.util.NutchConfiguration;
          import org.apache.nutch.util.NutchJob;
          +import org.apache.nutch.util.SegmentReaderUtil;

          /** Dump the content of a segment. */
          -public class SegmentReader extends Configured implements Tool,

          • Reducer<Text, NutchWritable, Text, Text> {
            +public class SegmentReader extends Configured implements Tool {

          private static final Logger LOG = LoggerFactory
          .getLogger(MethodHandles.lookup().lookupClass());

          • long recNo = 0L;
            + static long recNo = 0L;

          private boolean co, fe, ge, pa, pd, pt;

          • public static class InputCompatMapper extends MapReduceBase implements
            + public static class InputCompatMapper extends
            Mapper<WritableComparable<?>, Writable, Text, NutchWritable> {
          • private Text newKey = new Text();
            + private static Text newKey = new Text();

          Review comment:
          This will cause trouble when the mapper class is used by multiple threads (see MultithreadedMapper).

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#discussion_r138930635 ########## File path: src/java/org/apache/nutch/segment/SegmentReader.java ########## @@ -70,32 +69,31 @@ import org.apache.nutch.util.HadoopFSUtil; import org.apache.nutch.util.NutchConfiguration; import org.apache.nutch.util.NutchJob; +import org.apache.nutch.util.SegmentReaderUtil; /** Dump the content of a segment. */ -public class SegmentReader extends Configured implements Tool, Reducer<Text, NutchWritable, Text, Text> { +public class SegmentReader extends Configured implements Tool { private static final Logger LOG = LoggerFactory .getLogger(MethodHandles.lookup().lookupClass()); long recNo = 0L; + static long recNo = 0L; private boolean co, fe, ge, pa, pd, pt; public static class InputCompatMapper extends MapReduceBase implements + public static class InputCompatMapper extends Mapper<WritableComparable<?>, Writable, Text, NutchWritable> { private Text newKey = new Text(); + private static Text newKey = new Text(); Review comment: This will cause trouble when the mapper class is used by multiple threads (see MultithreadedMapper). ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#discussion_r138920971

          ##########
          File path: src/java/org/apache/nutch/crawl/LinkDb.java
          ##########
          @@ -56,11 +61,11 @@
          public static final String CURRENT_NAME = "current";
          public static final String LOCK_NAME = ".locked";

          • private int maxAnchorLength;
          • private boolean ignoreInternalLinks;
          • private boolean ignoreExternalLinks;
          • private URLFilters urlFilters;
          • private URLNormalizers urlNormalizers;
            + private static int maxAnchorLength;
            + private static boolean ignoreInternalLinks;
            + private static boolean ignoreExternalLinks;
            + private static URLFilters urlFilters;
            + private static URLNormalizers urlNormalizers;

          Review comment:
          Similar to the situation in Generator: maybe it's better to make them instance variables in LinkDbMapper.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#discussion_r138920971 ########## File path: src/java/org/apache/nutch/crawl/LinkDb.java ########## @@ -56,11 +61,11 @@ public static final String CURRENT_NAME = "current"; public static final String LOCK_NAME = ".locked"; private int maxAnchorLength; private boolean ignoreInternalLinks; private boolean ignoreExternalLinks; private URLFilters urlFilters; private URLNormalizers urlNormalizers; + private static int maxAnchorLength; + private static boolean ignoreInternalLinks; + private static boolean ignoreExternalLinks; + private static URLFilters urlFilters; + private static URLNormalizers urlNormalizers; Review comment: Similar to the situation in Generator: maybe it's better to make them instance variables in LinkDbMapper. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#discussion_r138930916

          ##########
          File path: src/java/org/apache/nutch/tools/FreeGenerator.java
          ##########
          @@ -66,79 +64,80 @@
          private static final String FILTER_KEY = "free.generator.filter";
          private static final String NORMALIZE_KEY = "free.generator.normalize";

          • public static class FG extends MapReduceBase implements
          • Mapper<WritableComparable<?>, Text, Text, Generator.SelectorEntry>,
          • Reducer<Text, Generator.SelectorEntry, Text, CrawlDatum> {
          • private URLNormalizers normalizers = null;
          • private URLFilters filters = null;
          • private ScoringFilters scfilters;
          • private CrawlDatum datum = new CrawlDatum();
          • private Text url = new Text();
          • private int defaultInterval = 0;
            + public static class FG {
            + private static URLNormalizers normalizers = null;
            + private static URLFilters filters = null;
            + private static ScoringFilters scfilters;
            + private static CrawlDatum datum = new CrawlDatum();
            + private static Text url = new Text();
            + private static int defaultInterval = 0;

          Review comment:
          See comments in Generator etc. regarding static variables shared between mapper and reducer classes.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#discussion_r138930916 ########## File path: src/java/org/apache/nutch/tools/FreeGenerator.java ########## @@ -66,79 +64,80 @@ private static final String FILTER_KEY = "free.generator.filter"; private static final String NORMALIZE_KEY = "free.generator.normalize"; public static class FG extends MapReduceBase implements Mapper<WritableComparable<?>, Text, Text, Generator.SelectorEntry>, Reducer<Text, Generator.SelectorEntry, Text, CrawlDatum> { private URLNormalizers normalizers = null; private URLFilters filters = null; private ScoringFilters scfilters; private CrawlDatum datum = new CrawlDatum(); private Text url = new Text(); private int defaultInterval = 0; + public static class FG { + private static URLNormalizers normalizers = null; + private static URLFilters filters = null; + private static ScoringFilters scfilters; + private static CrawlDatum datum = new CrawlDatum(); + private static Text url = new Text(); + private static int defaultInterval = 0; Review comment: See comments in Generator etc. regarding static variables shared between mapper and reducer classes. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#discussion_r138929207

          ##########
          File path: src/java/org/apache/nutch/scoring/webgraph/NodeDumper.java
          ##########
          @@ -159,31 +181,17 @@ public void reduce(FloatWritable key, Iterator<Text> values,

          • of either the number of inlinks, the number of outlinks or the score. The
          • computed value is then either the sum of all parts or the top value.
            */
          • public static class Dumper extends Configured implements
          • Mapper<Text, Node, Text, FloatWritable>,
          • Reducer<Text, FloatWritable, Text, FloatWritable> {
            -
          • private JobConf conf;
          • private boolean inlinks = false;
          • private boolean outlinks = false;
          • private boolean scores = false;
          • private long topn = Long.MAX_VALUE;
          • private boolean host = false;
          • private boolean domain = false;
          • private boolean sum = false;
          • private boolean max = false;
            -
          • public void configure(JobConf conf) { - this.conf = conf; - this.inlinks = conf.getBoolean("inlinks", false); - this.outlinks = conf.getBoolean("outlinks", false); - this.scores = conf.getBoolean("scores", true); - this.topn = conf.getLong("topn", Long.MAX_VALUE); - this.host = conf.getBoolean("host", false); - this.domain = conf.getBoolean("domain", false); - this.sum = conf.getBoolean("sum", false); - this.max = conf.getBoolean("max", false); - }

            + public static class Dumper extends Configured {
            +
            + private static Configuration conf;
            + private static boolean inlinks = false;
            + private static boolean outlinks = false;
            + private static boolean scores = false;
            + private static long topn = Long.MAX_VALUE;
            + private static boolean host = false;
            + private static boolean domain = false;
            + private static boolean sum = false;
            + private static boolean max = false;

          Review comment:
          See comments in Generator etc. regarding static variables shared between mapper and reducer classes.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#discussion_r138929207 ########## File path: src/java/org/apache/nutch/scoring/webgraph/NodeDumper.java ########## @@ -159,31 +181,17 @@ public void reduce(FloatWritable key, Iterator<Text> values, of either the number of inlinks, the number of outlinks or the score. The computed value is then either the sum of all parts or the top value. */ public static class Dumper extends Configured implements Mapper<Text, Node, Text, FloatWritable>, Reducer<Text, FloatWritable, Text, FloatWritable> { - private JobConf conf; private boolean inlinks = false; private boolean outlinks = false; private boolean scores = false; private long topn = Long.MAX_VALUE; private boolean host = false; private boolean domain = false; private boolean sum = false; private boolean max = false; - public void configure(JobConf conf) { - this.conf = conf; - this.inlinks = conf.getBoolean("inlinks", false); - this.outlinks = conf.getBoolean("outlinks", false); - this.scores = conf.getBoolean("scores", true); - this.topn = conf.getLong("topn", Long.MAX_VALUE); - this.host = conf.getBoolean("host", false); - this.domain = conf.getBoolean("domain", false); - this.sum = conf.getBoolean("sum", false); - this.max = conf.getBoolean("max", false); - } + public static class Dumper extends Configured { + + private static Configuration conf; + private static boolean inlinks = false; + private static boolean outlinks = false; + private static boolean scores = false; + private static long topn = Long.MAX_VALUE; + private static boolean host = false; + private static boolean domain = false; + private static boolean sum = false; + private static boolean max = false; Review comment: See comments in Generator etc. regarding static variables shared between mapper and reducer classes. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#discussion_r138931097

          ##########
          File path: src/java/org/apache/nutch/tools/arc/ArcSegmentCreator.java
          ##########
          @@ -73,18 +71,17 @@

          • </p>
          • */
            -public class ArcSegmentCreator extends Configured implements Tool,

          • Mapper<Text, BytesWritable, Text, NutchWritable> {
            +public class ArcSegmentCreator extends Configured implements Tool {

          private static final Logger LOG = LoggerFactory
          .getLogger(MethodHandles.lookup().lookupClass());
          public static final String URL_VERSION = "arc.url.version";

          • private JobConf jobConf;
          • private URLFilters urlFilters;
          • private ScoringFilters scfilters;
          • private ParseUtil parseUtil;
          • private URLNormalizers normalizers;
          • private int interval;
            + private static Configuration conf;
            + private static URLFilters urlFilters;
            + private static ScoringFilters scfilters;
            + private static ParseUtil parseUtil;
            + private static URLNormalizers normalizers;
            + private static int interval;

          Review comment:
          See comments in Generator etc. regarding static variables shared between mapper and reducer classes.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#discussion_r138931097 ########## File path: src/java/org/apache/nutch/tools/arc/ArcSegmentCreator.java ########## @@ -73,18 +71,17 @@ </p> */ -public class ArcSegmentCreator extends Configured implements Tool, Mapper<Text, BytesWritable, Text, NutchWritable> { +public class ArcSegmentCreator extends Configured implements Tool { private static final Logger LOG = LoggerFactory .getLogger(MethodHandles.lookup().lookupClass()); public static final String URL_VERSION = "arc.url.version"; private JobConf jobConf; private URLFilters urlFilters; private ScoringFilters scfilters; private ParseUtil parseUtil; private URLNormalizers normalizers; private int interval; + private static Configuration conf; + private static URLFilters urlFilters; + private static ScoringFilters scfilters; + private static ParseUtil parseUtil; + private static URLNormalizers normalizers; + private static int interval; Review comment: See comments in Generator etc. regarding static variables shared between mapper and reducer classes. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#discussion_r138928833

          ##########
          File path: src/java/org/apache/nutch/scoring/webgraph/LinkRank.java
          ##########
          @@ -350,186 +367,221 @@ public void close() {

          • WebGraph. The link analysis process consists of inverting, analyzing and
          • scoring, in a loop for a given number of iterations.
            */
          • private static class Inverter implements
          • Mapper<Text, Writable, Text, ObjectWritable>,
          • Reducer<Text, ObjectWritable, Text, LinkDatum> {
            + private static class Inverter {
          • private JobConf conf;
            -
          • public void configure(JobConf conf) { - this.conf = conf; - }

            + private static Configuration conf;

          /**

          • Convert values to ObjectWritable
            */
          • public void map(Text key, Writable value,
          • OutputCollector<Text, ObjectWritable> output, Reporter reporter)
          • throws IOException {
            + public static class InvertMapper extends
            + Mapper<Text, Writable, Text, ObjectWritable>
            Unknown macro: {+ public void setup(Mapper<Text, Writable, Text, ObjectWritable>.Context context) { + conf = context.getConfiguration(); + }++ public void cleanup(){ + }++ public void map(Text key, Writable value,+ Context context)+ throws IOException, InterruptedException { - ObjectWritable objWrite = new ObjectWritable(); - objWrite.set(value); - output.collect(key, objWrite); + ObjectWritable objWrite = new ObjectWritable(); + objWrite.set(value); + context.write(key, objWrite); + } }

          /**

          • Inverts outlinks to inlinks, attaches current score for the outlink from
          • the NodeDb of the WebGraph.
            */
          • public void reduce(Text key, Iterator<ObjectWritable> values,
          • OutputCollector<Text, LinkDatum> output, Reporter reporter)
          • throws IOException {
            -
          • String fromUrl = key.toString();
          • List<LinkDatum> outlinks = new ArrayList<>();
          • Node node = null;
            -
          • // aggregate outlinks, assign other values
          • while (values.hasNext()) {
          • ObjectWritable write = values.next();
          • Object obj = write.get();
          • if (obj instanceof Node) { - node = (Node) obj; - }

            else if (obj instanceof LinkDatum)

            { - outlinks.add(WritableUtils.clone((LinkDatum) obj, conf)); - }

            + public static class InvertReducer extends
            + Reducer<Text, ObjectWritable, Text, LinkDatum> {
            + public void setup(Reducer<Text, ObjectWritable, Text, LinkDatum>.Context context)

            { + conf = context.getConfiguration(); }
          • // get the number of outlinks and the current inlink and outlink scores
          • // from the node of the url
          • int numOutlinks = node.getNumOutlinks();
          • float inlinkScore = node.getInlinkScore();
          • float outlinkScore = node.getOutlinkScore();
          • LOG.debug(fromUrl + ": num outlinks " + numOutlinks);
            -
          • // can't invert if no outlinks
          • if (numOutlinks > 0) {
          • for (int i = 0; i < outlinks.size(); i++) { - LinkDatum outlink = outlinks.get(i); - String toUrl = outlink.getUrl(); - - outlink.setUrl(fromUrl); - outlink.setScore(outlinkScore); - - // collect the inverted outlink - output.collect(new Text(toUrl), outlink); - LOG.debug(toUrl + ": inverting inlink from " + fromUrl - + " origscore: " + inlinkScore + " numOutlinks: " + numOutlinks - + " inlinkscore: " + outlinkScore); - }

            + public void cleanup(){
            }

          • }
          • public void close() {
            + public void reduce(Text key, Iterable<ObjectWritable> values,
            + Context context)
            + throws IOException, InterruptedException {
            +
            + String fromUrl = key.toString();
            + List<LinkDatum> outlinks = new ArrayList<>();
            + Node node = null;
            +
            + // aggregate outlinks, assign other values
            + for (ObjectWritable write : values)
            Unknown macro: {+ Object obj = write.get();+ if (obj instanceof Node) { + node = (Node) obj; + } else if (obj instanceof LinkDatum) { + outlinks.add(WritableUtils.clone((LinkDatum) obj, conf)); + }+ }

            +
            + // get the number of outlinks and the current inlink and outlink scores
            + // from the node of the url
            + int numOutlinks = node.getNumOutlinks();
            + float inlinkScore = node.getInlinkScore();
            + float outlinkScore = node.getOutlinkScore();
            + LOG.debug(fromUrl + ": num outlinks " + numOutlinks);
            +
            + // can't invert if no outlinks
            + if (numOutlinks > 0)

            Unknown macro: {+ for (int i = 0; i < outlinks.size(); i++) { + LinkDatum outlink = outlinks.get(i); + String toUrl = outlink.getUrl(); + + outlink.setUrl(fromUrl); + outlink.setScore(outlinkScore); + + // collect the inverted outlink + context.write(new Text(toUrl), outlink); + LOG.debug(toUrl + ": inverting inlink from " + fromUrl + + " origscore: " + inlinkScore + " numOutlinks: " + numOutlinks + + " inlinkscore: " + outlinkScore); + }+ }

            + }
            }
            }

          /**

          • Runs a single link analysis iteration.
            */
          • private static class Analyzer implements
          • Mapper<Text, Writable, Text, ObjectWritable>,
          • Reducer<Text, ObjectWritable, Text, Node> {
            + private static class Analyzer {
          • private JobConf conf;
          • private float dampingFactor = 0.85f;
          • private float rankOne = 0.0f;
          • private int itNum = 0;
          • private boolean limitPages = true;
          • private boolean limitDomains = true;
            + private static Configuration conf;
            + private static float dampingFactor = 0.85f;
            + private static float rankOne = 0.0f;
            + private static int itNum = 0;
            + private static boolean limitPages = true;
            + private static boolean limitDomains = true;

          Review comment:
          See comments in Generator etc. regarding static variables shared between mapper and reducer classes.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#discussion_r138928833 ########## File path: src/java/org/apache/nutch/scoring/webgraph/LinkRank.java ########## @@ -350,186 +367,221 @@ public void close() { WebGraph. The link analysis process consists of inverting, analyzing and scoring, in a loop for a given number of iterations. */ private static class Inverter implements Mapper<Text, Writable, Text, ObjectWritable>, Reducer<Text, ObjectWritable, Text, LinkDatum> { + private static class Inverter { private JobConf conf; - public void configure(JobConf conf) { - this.conf = conf; - } + private static Configuration conf; /** Convert values to ObjectWritable */ public void map(Text key, Writable value, OutputCollector<Text, ObjectWritable> output, Reporter reporter) throws IOException { + public static class InvertMapper extends + Mapper<Text, Writable, Text, ObjectWritable> Unknown macro: {+ public void setup(Mapper<Text, Writable, Text, ObjectWritable>.Context context) { + conf = context.getConfiguration(); + }++ public void cleanup(){ + }++ public void map(Text key, Writable value,+ Context context)+ throws IOException, InterruptedException { - ObjectWritable objWrite = new ObjectWritable(); - objWrite.set(value); - output.collect(key, objWrite); + ObjectWritable objWrite = new ObjectWritable(); + objWrite.set(value); + context.write(key, objWrite); + } } /** Inverts outlinks to inlinks, attaches current score for the outlink from the NodeDb of the WebGraph. */ public void reduce(Text key, Iterator<ObjectWritable> values, OutputCollector<Text, LinkDatum> output, Reporter reporter) throws IOException { - String fromUrl = key.toString(); List<LinkDatum> outlinks = new ArrayList<>(); Node node = null; - // aggregate outlinks, assign other values while (values.hasNext()) { ObjectWritable write = values.next(); Object obj = write.get(); if (obj instanceof Node) { - node = (Node) obj; - } else if (obj instanceof LinkDatum) { - outlinks.add(WritableUtils.clone((LinkDatum) obj, conf)); - } + public static class InvertReducer extends + Reducer<Text, ObjectWritable, Text, LinkDatum> { + public void setup(Reducer<Text, ObjectWritable, Text, LinkDatum>.Context context) { + conf = context.getConfiguration(); } // get the number of outlinks and the current inlink and outlink scores // from the node of the url int numOutlinks = node.getNumOutlinks(); float inlinkScore = node.getInlinkScore(); float outlinkScore = node.getOutlinkScore(); LOG.debug(fromUrl + ": num outlinks " + numOutlinks); - // can't invert if no outlinks if (numOutlinks > 0) { for (int i = 0; i < outlinks.size(); i++) { - LinkDatum outlink = outlinks.get(i); - String toUrl = outlink.getUrl(); - - outlink.setUrl(fromUrl); - outlink.setScore(outlinkScore); - - // collect the inverted outlink - output.collect(new Text(toUrl), outlink); - LOG.debug(toUrl + ": inverting inlink from " + fromUrl - + " origscore: " + inlinkScore + " numOutlinks: " + numOutlinks - + " inlinkscore: " + outlinkScore); - } + public void cleanup(){ } } public void close() { + public void reduce(Text key, Iterable<ObjectWritable> values, + Context context) + throws IOException, InterruptedException { + + String fromUrl = key.toString(); + List<LinkDatum> outlinks = new ArrayList<>(); + Node node = null; + + // aggregate outlinks, assign other values + for (ObjectWritable write : values) Unknown macro: {+ Object obj = write.get();+ if (obj instanceof Node) { + node = (Node) obj; + } else if (obj instanceof LinkDatum) { + outlinks.add(WritableUtils.clone((LinkDatum) obj, conf)); + }+ } + + // get the number of outlinks and the current inlink and outlink scores + // from the node of the url + int numOutlinks = node.getNumOutlinks(); + float inlinkScore = node.getInlinkScore(); + float outlinkScore = node.getOutlinkScore(); + LOG.debug(fromUrl + ": num outlinks " + numOutlinks); + + // can't invert if no outlinks + if (numOutlinks > 0) Unknown macro: {+ for (int i = 0; i < outlinks.size(); i++) { + LinkDatum outlink = outlinks.get(i); + String toUrl = outlink.getUrl(); + + outlink.setUrl(fromUrl); + outlink.setScore(outlinkScore); + + // collect the inverted outlink + context.write(new Text(toUrl), outlink); + LOG.debug(toUrl + ": inverting inlink from " + fromUrl + + " origscore: " + inlinkScore + " numOutlinks: " + numOutlinks + + " inlinkscore: " + outlinkScore); + }+ } + } } } /** Runs a single link analysis iteration. */ private static class Analyzer implements Mapper<Text, Writable, Text, ObjectWritable>, Reducer<Text, ObjectWritable, Text, Node> { + private static class Analyzer { private JobConf conf; private float dampingFactor = 0.85f; private float rankOne = 0.0f; private int itNum = 0; private boolean limitPages = true; private boolean limitDomains = true; + private static Configuration conf; + private static float dampingFactor = 0.85f; + private static float rankOne = 0.0f; + private static int itNum = 0; + private static boolean limitPages = true; + private static boolean limitDomains = true; Review comment: See comments in Generator etc. regarding static variables shared between mapper and reducer classes. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#discussion_r138927288

          ##########
          File path: src/java/org/apache/nutch/parse/ParseSegment.java
          ##########
          @@ -43,20 +50,16 @@
          import java.util.Map.Entry;

          /* Parse content in a segment. */
          -public class ParseSegment extends NutchTool implements Tool,

          • Mapper<WritableComparable<?>, Content, Text, ParseImpl>,
          • Reducer<Text, Writable, Text, Writable> {
            +public class ParseSegment extends NutchTool implements Tool {

          private static final Logger LOG = LoggerFactory
          .getLogger(MethodHandles.lookup().lookupClass());

          public static final String SKIP_TRUNCATED = "parser.skip.truncated";

          • private ScoringFilters scfilters;
            + private static ScoringFilters scfilters;
          • private ParseUtil parseUtil;
            -
          • private boolean skipTruncated;
            + private static boolean skipTruncated;

          Review comment:
          See comments in Generator etc. regarding static variables shared between mapper and reducer classes.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#discussion_r138927288 ########## File path: src/java/org/apache/nutch/parse/ParseSegment.java ########## @@ -43,20 +50,16 @@ import java.util.Map.Entry; /* Parse content in a segment. */ -public class ParseSegment extends NutchTool implements Tool, Mapper<WritableComparable<?>, Content, Text, ParseImpl>, Reducer<Text, Writable, Text, Writable> { +public class ParseSegment extends NutchTool implements Tool { private static final Logger LOG = LoggerFactory .getLogger(MethodHandles.lookup().lookupClass()); public static final String SKIP_TRUNCATED = "parser.skip.truncated"; private ScoringFilters scfilters; + private static ScoringFilters scfilters; private ParseUtil parseUtil; - private boolean skipTruncated; + private static boolean skipTruncated; Review comment: See comments in Generator etc. regarding static variables shared between mapper and reducer classes. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#discussion_r138919091

          ##########
          File path: src/java/org/apache/nutch/crawl/Generator.java
          ##########
          @@ -114,81 +122,34 @@ public String toString() {
          }

          /** Selects entries due for fetch. */

          • public static class Selector implements
          • Mapper<Text, CrawlDatum, FloatWritable, SelectorEntry>,
          • Partitioner<FloatWritable, Writable>,
          • Reducer<FloatWritable, SelectorEntry, FloatWritable, SelectorEntry> {
          • private LongWritable genTime = new LongWritable(System.currentTimeMillis());
          • private long curTime;
          • private long limit;
          • private long count;
          • private HashMap<String, int[]> hostCounts = new HashMap<>();
          • private int segCounts[];
          • private int maxCount;
          • private boolean byDomain = false;
          • private Partitioner<Text, Writable> partitioner = new URLPartitioner();
          • private URLFilters filters;
          • private URLNormalizers normalizers;
          • private ScoringFilters scfilters;
          • private SelectorEntry entry = new SelectorEntry();
          • private FloatWritable sortValue = new FloatWritable();
          • private boolean filter;
          • private boolean normalise;
          • private long genDelay;
          • private FetchSchedule schedule;
          • private float scoreThreshold = 0f;
          • private int intervalThreshold = -1;
          • private String restrictStatus = null;
          • private int maxNumSegments = 1;
          • private Expression expr = null;
          • private int currentsegmentnum = 1;
          • private SequenceFile.Reader[] hostdbReaders = null;
          • private Expression maxCountExpr = null;
          • private Expression fetchDelayExpr = null;
          • private JobConf conf = null;
          • public void configure(JobConf job) {
          • this.conf = job;
          • curTime = job.getLong(GENERATOR_CUR_TIME, System.currentTimeMillis());
          • limit = job.getLong(GENERATOR_TOP_N, Long.MAX_VALUE)
          • / job.getNumReduceTasks();
          • maxCount = job.getInt(GENERATOR_MAX_COUNT, -1);
          • if (maxCount == -1) { - byDomain = false; - }
          • if (GENERATOR_COUNT_VALUE_DOMAIN.equals(job.get(GENERATOR_COUNT_MODE)))
          • byDomain = true;
          • filters = new URLFilters(job);
          • normalise = job.getBoolean(GENERATOR_NORMALISE, true);
          • if (normalise)
          • normalizers = new URLNormalizers(job,
          • URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
          • scfilters = new ScoringFilters(job);
          • partitioner.configure(job);
          • filter = job.getBoolean(GENERATOR_FILTER, true);
          • genDelay = job.getLong(GENERATOR_DELAY, 7L) * 3600L * 24L * 1000L;
          • long time = job.getLong(Nutch.GENERATE_TIME_KEY, 0L);
          • if (time > 0)
          • genTime.set(time);
          • schedule = FetchScheduleFactory.getFetchSchedule(job);
          • scoreThreshold = job.getFloat(GENERATOR_MIN_SCORE, Float.NaN);
          • intervalThreshold = job.getInt(GENERATOR_MIN_INTERVAL, -1);
          • restrictStatus = job.get(GENERATOR_RESTRICT_STATUS, null);
          • expr = JexlUtil.parseExpression(job.get(GENERATOR_EXPR, null));
          • maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1);
          • segCounts = new int[maxNumSegments];
          • if (job.get(GENERATOR_HOSTDB) != null) {
          • try { - Path path = new Path(job.get(GENERATOR_HOSTDB), "current"); - hostdbReaders = SequenceFileOutputFormat.getReaders(job, path); - maxCountExpr = JexlUtil.parseExpression(job.get(GENERATOR_MAX_COUNT_EXPR, null)); - fetchDelayExpr = JexlUtil.parseExpression(job.get(GENERATOR_FETCH_DELAY_EXPR, null)); - }

            catch (IOException e) {

          • LOG.error("Error reading HostDB because {}", e.getMessage());
          • }
          • }
          • }
            + public static class Selector extends
            + Partitioner<FloatWritable, Writable> {
            + private static LongWritable genTime = new LongWritable(System.currentTimeMillis());
            + private static long curTime;
            + private static long limit;
            + private static int segCounts[];
            + private static int maxCount;
            + private static boolean byDomain = false;
            + private static URLFilters filters;
            + private static URLNormalizers normalizers;
            + private static ScoringFilters scfilters;
            + private static SelectorEntry entry = new SelectorEntry();
            + private static FloatWritable sortValue = new FloatWritable();
            + private static boolean filter;
            + private static boolean normalise;
            + private static long genDelay;
            + private static FetchSchedule schedule;
            + private static float scoreThreshold = 0f;
            + private static int intervalThreshold = -1;
            + private static String restrictStatus = null;
            + private static int maxNumSegments = 1;
            + private static Expression expr = null;
            + private static MapFile.Reader[] hostdbReaders = null;
            + private static Expression maxCountExpr = null;
            + private static Expression fetchDelayExpr = null;
            + private static Configuration config;
            +

          Review comment:
          Sharing configuration values between mapper and reducer via static variables does not work in distributed mode because mapper and reducer running each in its own JavaVM. It's not enough to set a static variable in the mapper's setup method.
          Maybe it's better get rid of the Selector class and split it into independent classes SelectorMapper, SelectorPartitioner and SelectorReducer. No shared static variables would also allow for easy checks whether a variable is properly set from the configuration.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#discussion_r138919091 ########## File path: src/java/org/apache/nutch/crawl/Generator.java ########## @@ -114,81 +122,34 @@ public String toString() { } /** Selects entries due for fetch. */ public static class Selector implements Mapper<Text, CrawlDatum, FloatWritable, SelectorEntry>, Partitioner<FloatWritable, Writable>, Reducer<FloatWritable, SelectorEntry, FloatWritable, SelectorEntry> { private LongWritable genTime = new LongWritable(System.currentTimeMillis()); private long curTime; private long limit; private long count; private HashMap<String, int[]> hostCounts = new HashMap<>(); private int segCounts[]; private int maxCount; private boolean byDomain = false; private Partitioner<Text, Writable> partitioner = new URLPartitioner(); private URLFilters filters; private URLNormalizers normalizers; private ScoringFilters scfilters; private SelectorEntry entry = new SelectorEntry(); private FloatWritable sortValue = new FloatWritable(); private boolean filter; private boolean normalise; private long genDelay; private FetchSchedule schedule; private float scoreThreshold = 0f; private int intervalThreshold = -1; private String restrictStatus = null; private int maxNumSegments = 1; private Expression expr = null; private int currentsegmentnum = 1; private SequenceFile.Reader[] hostdbReaders = null; private Expression maxCountExpr = null; private Expression fetchDelayExpr = null; private JobConf conf = null; public void configure(JobConf job) { this.conf = job; curTime = job.getLong(GENERATOR_CUR_TIME, System.currentTimeMillis()); limit = job.getLong(GENERATOR_TOP_N, Long.MAX_VALUE) / job.getNumReduceTasks(); maxCount = job.getInt(GENERATOR_MAX_COUNT, -1); if (maxCount == -1) { - byDomain = false; - } if (GENERATOR_COUNT_VALUE_DOMAIN.equals(job.get(GENERATOR_COUNT_MODE))) byDomain = true; filters = new URLFilters(job); normalise = job.getBoolean(GENERATOR_NORMALISE, true); if (normalise) normalizers = new URLNormalizers(job, URLNormalizers.SCOPE_GENERATE_HOST_COUNT); scfilters = new ScoringFilters(job); partitioner.configure(job); filter = job.getBoolean(GENERATOR_FILTER, true); genDelay = job.getLong(GENERATOR_DELAY, 7L) * 3600L * 24L * 1000L; long time = job.getLong(Nutch.GENERATE_TIME_KEY, 0L); if (time > 0) genTime.set(time); schedule = FetchScheduleFactory.getFetchSchedule(job); scoreThreshold = job.getFloat(GENERATOR_MIN_SCORE, Float.NaN); intervalThreshold = job.getInt(GENERATOR_MIN_INTERVAL, -1); restrictStatus = job.get(GENERATOR_RESTRICT_STATUS, null); expr = JexlUtil.parseExpression(job.get(GENERATOR_EXPR, null)); maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1); segCounts = new int [maxNumSegments] ; if (job.get(GENERATOR_HOSTDB) != null) { try { - Path path = new Path(job.get(GENERATOR_HOSTDB), "current"); - hostdbReaders = SequenceFileOutputFormat.getReaders(job, path); - maxCountExpr = JexlUtil.parseExpression(job.get(GENERATOR_MAX_COUNT_EXPR, null)); - fetchDelayExpr = JexlUtil.parseExpression(job.get(GENERATOR_FETCH_DELAY_EXPR, null)); - } catch (IOException e) { LOG.error("Error reading HostDB because {}", e.getMessage()); } } } + public static class Selector extends + Partitioner<FloatWritable, Writable> { + private static LongWritable genTime = new LongWritable(System.currentTimeMillis()); + private static long curTime; + private static long limit; + private static int segCounts[]; + private static int maxCount; + private static boolean byDomain = false; + private static URLFilters filters; + private static URLNormalizers normalizers; + private static ScoringFilters scfilters; + private static SelectorEntry entry = new SelectorEntry(); + private static FloatWritable sortValue = new FloatWritable(); + private static boolean filter; + private static boolean normalise; + private static long genDelay; + private static FetchSchedule schedule; + private static float scoreThreshold = 0f; + private static int intervalThreshold = -1; + private static String restrictStatus = null; + private static int maxNumSegments = 1; + private static Expression expr = null; + private static MapFile.Reader[] hostdbReaders = null; + private static Expression maxCountExpr = null; + private static Expression fetchDelayExpr = null; + private static Configuration config; + Review comment: Sharing configuration values between mapper and reducer via static variables does not work in distributed mode because mapper and reducer running each in its own JavaVM. It's not enough to set a static variable in the mapper's setup method. Maybe it's better get rid of the Selector class and split it into independent classes SelectorMapper, SelectorPartitioner and SelectorReducer. No shared static variables would also allow for easy checks whether a variable is properly set from the configuration. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#discussion_r138929031

          ##########
          File path: src/java/org/apache/nutch/scoring/webgraph/NodeDumper.java
          ##########
          @@ -86,70 +84,94 @@

          • on the command line, the top urls could be for number of inlinks, for
          • number of outlinks, or for link analysis score.
            */
          • public static class Sorter extends Configured implements
          • Mapper<Text, Node, FloatWritable, Text>,
          • Reducer<FloatWritable, Text, Text, FloatWritable> {
            + public static class Sorter extends Configured {
          • private JobConf conf;
          • private boolean inlinks = false;
          • private boolean outlinks = false;
          • private boolean scores = false;
          • private long topn = Long.MAX_VALUE;
            + private static Configuration conf;
            + private static boolean inlinks = false;
            + private static boolean outlinks = false;
            + private static boolean scores = false;
            + private static long topn = Long.MAX_VALUE;

          Review comment:
          See comments in Generator etc. regarding static variables shared between mapper and reducer classes.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#discussion_r138929031 ########## File path: src/java/org/apache/nutch/scoring/webgraph/NodeDumper.java ########## @@ -86,70 +84,94 @@ on the command line, the top urls could be for number of inlinks, for number of outlinks, or for link analysis score. */ public static class Sorter extends Configured implements Mapper<Text, Node, FloatWritable, Text>, Reducer<FloatWritable, Text, Text, FloatWritable> { + public static class Sorter extends Configured { private JobConf conf; private boolean inlinks = false; private boolean outlinks = false; private boolean scores = false; private long topn = Long.MAX_VALUE; + private static Configuration conf; + private static boolean inlinks = false; + private static boolean outlinks = false; + private static boolean scores = false; + private static long topn = Long.MAX_VALUE; Review comment: See comments in Generator etc. regarding static variables shared between mapper and reducer classes. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#discussion_r138949121

          ##########
          File path: src/java/org/apache/nutch/crawl/Generator.java
          ##########
          @@ -114,81 +122,34 @@ public String toString() {
          }

          /** Selects entries due for fetch. */

          • public static class Selector implements
          • Mapper<Text, CrawlDatum, FloatWritable, SelectorEntry>,
          • Partitioner<FloatWritable, Writable>,
          • Reducer<FloatWritable, SelectorEntry, FloatWritable, SelectorEntry> {
          • private LongWritable genTime = new LongWritable(System.currentTimeMillis());
          • private long curTime;
          • private long limit;
          • private long count;
          • private HashMap<String, int[]> hostCounts = new HashMap<>();
          • private int segCounts[];
          • private int maxCount;
          • private boolean byDomain = false;
          • private Partitioner<Text, Writable> partitioner = new URLPartitioner();
          • private URLFilters filters;
          • private URLNormalizers normalizers;
          • private ScoringFilters scfilters;
          • private SelectorEntry entry = new SelectorEntry();
          • private FloatWritable sortValue = new FloatWritable();
          • private boolean filter;
          • private boolean normalise;
          • private long genDelay;
          • private FetchSchedule schedule;
          • private float scoreThreshold = 0f;
          • private int intervalThreshold = -1;
          • private String restrictStatus = null;
          • private int maxNumSegments = 1;
          • private Expression expr = null;
          • private int currentsegmentnum = 1;
          • private SequenceFile.Reader[] hostdbReaders = null;
          • private Expression maxCountExpr = null;
          • private Expression fetchDelayExpr = null;
          • private JobConf conf = null;
          • public void configure(JobConf job) {
          • this.conf = job;
          • curTime = job.getLong(GENERATOR_CUR_TIME, System.currentTimeMillis());
          • limit = job.getLong(GENERATOR_TOP_N, Long.MAX_VALUE)
          • / job.getNumReduceTasks();
          • maxCount = job.getInt(GENERATOR_MAX_COUNT, -1);
          • if (maxCount == -1) { - byDomain = false; - }
          • if (GENERATOR_COUNT_VALUE_DOMAIN.equals(job.get(GENERATOR_COUNT_MODE)))
          • byDomain = true;
          • filters = new URLFilters(job);
          • normalise = job.getBoolean(GENERATOR_NORMALISE, true);
          • if (normalise)
          • normalizers = new URLNormalizers(job,
          • URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
          • scfilters = new ScoringFilters(job);
          • partitioner.configure(job);
          • filter = job.getBoolean(GENERATOR_FILTER, true);
          • genDelay = job.getLong(GENERATOR_DELAY, 7L) * 3600L * 24L * 1000L;
          • long time = job.getLong(Nutch.GENERATE_TIME_KEY, 0L);
          • if (time > 0)
          • genTime.set(time);
          • schedule = FetchScheduleFactory.getFetchSchedule(job);
          • scoreThreshold = job.getFloat(GENERATOR_MIN_SCORE, Float.NaN);
          • intervalThreshold = job.getInt(GENERATOR_MIN_INTERVAL, -1);
          • restrictStatus = job.get(GENERATOR_RESTRICT_STATUS, null);
          • expr = JexlUtil.parseExpression(job.get(GENERATOR_EXPR, null));
          • maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1);
          • segCounts = new int[maxNumSegments];
          • if (job.get(GENERATOR_HOSTDB) != null) {
          • try { - Path path = new Path(job.get(GENERATOR_HOSTDB), "current"); - hostdbReaders = SequenceFileOutputFormat.getReaders(job, path); - maxCountExpr = JexlUtil.parseExpression(job.get(GENERATOR_MAX_COUNT_EXPR, null)); - fetchDelayExpr = JexlUtil.parseExpression(job.get(GENERATOR_FETCH_DELAY_EXPR, null)); - }

            catch (IOException e) {

          • LOG.error("Error reading HostDB because {}", e.getMessage());
          • }
          • }
          • }
            + public static class Selector extends
            + Partitioner<FloatWritable, Writable> {
            + private static LongWritable genTime = new LongWritable(System.currentTimeMillis());
            + private static long curTime;
            + private static long limit;
            + private static int segCounts[];
            + private static int maxCount;
            + private static boolean byDomain = false;
            + private static URLFilters filters;
            + private static URLNormalizers normalizers;
            + private static ScoringFilters scfilters;
            + private static SelectorEntry entry = new SelectorEntry();
            + private static FloatWritable sortValue = new FloatWritable();
            + private static boolean filter;
            + private static boolean normalise;
            + private static long genDelay;
            + private static FetchSchedule schedule;
            + private static float scoreThreshold = 0f;
            + private static int intervalThreshold = -1;
            + private static String restrictStatus = null;
            + private static int maxNumSegments = 1;
            + private static Expression expr = null;
            + private static MapFile.Reader[] hostdbReaders = null;
            + private static Expression maxCountExpr = null;
            + private static Expression fetchDelayExpr = null;
            + private static Configuration config;
            +

          Review comment:
          @sebastian-nagel I will look at the usage of each variable and declare them in their respective mapper and reducer classes as class variables, which are not obligated to be static. Does that sound good?

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#discussion_r138949121 ########## File path: src/java/org/apache/nutch/crawl/Generator.java ########## @@ -114,81 +122,34 @@ public String toString() { } /** Selects entries due for fetch. */ public static class Selector implements Mapper<Text, CrawlDatum, FloatWritable, SelectorEntry>, Partitioner<FloatWritable, Writable>, Reducer<FloatWritable, SelectorEntry, FloatWritable, SelectorEntry> { private LongWritable genTime = new LongWritable(System.currentTimeMillis()); private long curTime; private long limit; private long count; private HashMap<String, int[]> hostCounts = new HashMap<>(); private int segCounts[]; private int maxCount; private boolean byDomain = false; private Partitioner<Text, Writable> partitioner = new URLPartitioner(); private URLFilters filters; private URLNormalizers normalizers; private ScoringFilters scfilters; private SelectorEntry entry = new SelectorEntry(); private FloatWritable sortValue = new FloatWritable(); private boolean filter; private boolean normalise; private long genDelay; private FetchSchedule schedule; private float scoreThreshold = 0f; private int intervalThreshold = -1; private String restrictStatus = null; private int maxNumSegments = 1; private Expression expr = null; private int currentsegmentnum = 1; private SequenceFile.Reader[] hostdbReaders = null; private Expression maxCountExpr = null; private Expression fetchDelayExpr = null; private JobConf conf = null; public void configure(JobConf job) { this.conf = job; curTime = job.getLong(GENERATOR_CUR_TIME, System.currentTimeMillis()); limit = job.getLong(GENERATOR_TOP_N, Long.MAX_VALUE) / job.getNumReduceTasks(); maxCount = job.getInt(GENERATOR_MAX_COUNT, -1); if (maxCount == -1) { - byDomain = false; - } if (GENERATOR_COUNT_VALUE_DOMAIN.equals(job.get(GENERATOR_COUNT_MODE))) byDomain = true; filters = new URLFilters(job); normalise = job.getBoolean(GENERATOR_NORMALISE, true); if (normalise) normalizers = new URLNormalizers(job, URLNormalizers.SCOPE_GENERATE_HOST_COUNT); scfilters = new ScoringFilters(job); partitioner.configure(job); filter = job.getBoolean(GENERATOR_FILTER, true); genDelay = job.getLong(GENERATOR_DELAY, 7L) * 3600L * 24L * 1000L; long time = job.getLong(Nutch.GENERATE_TIME_KEY, 0L); if (time > 0) genTime.set(time); schedule = FetchScheduleFactory.getFetchSchedule(job); scoreThreshold = job.getFloat(GENERATOR_MIN_SCORE, Float.NaN); intervalThreshold = job.getInt(GENERATOR_MIN_INTERVAL, -1); restrictStatus = job.get(GENERATOR_RESTRICT_STATUS, null); expr = JexlUtil.parseExpression(job.get(GENERATOR_EXPR, null)); maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1); segCounts = new int [maxNumSegments] ; if (job.get(GENERATOR_HOSTDB) != null) { try { - Path path = new Path(job.get(GENERATOR_HOSTDB), "current"); - hostdbReaders = SequenceFileOutputFormat.getReaders(job, path); - maxCountExpr = JexlUtil.parseExpression(job.get(GENERATOR_MAX_COUNT_EXPR, null)); - fetchDelayExpr = JexlUtil.parseExpression(job.get(GENERATOR_FETCH_DELAY_EXPR, null)); - } catch (IOException e) { LOG.error("Error reading HostDB because {}", e.getMessage()); } } } + public static class Selector extends + Partitioner<FloatWritable, Writable> { + private static LongWritable genTime = new LongWritable(System.currentTimeMillis()); + private static long curTime; + private static long limit; + private static int segCounts[]; + private static int maxCount; + private static boolean byDomain = false; + private static URLFilters filters; + private static URLNormalizers normalizers; + private static ScoringFilters scfilters; + private static SelectorEntry entry = new SelectorEntry(); + private static FloatWritable sortValue = new FloatWritable(); + private static boolean filter; + private static boolean normalise; + private static long genDelay; + private static FetchSchedule schedule; + private static float scoreThreshold = 0f; + private static int intervalThreshold = -1; + private static String restrictStatus = null; + private static int maxNumSegments = 1; + private static Expression expr = null; + private static MapFile.Reader[] hostdbReaders = null; + private static Expression maxCountExpr = null; + private static Expression fetchDelayExpr = null; + private static Configuration config; + Review comment: @sebastian-nagel I will look at the usage of each variable and declare them in their respective mapper and reducer classes as class variables, which are not obligated to be static. Does that sound good? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#discussion_r138950525

          ##########
          File path: src/java/org/apache/nutch/crawl/Generator.java
          ##########
          @@ -114,81 +122,34 @@ public String toString() {
          }

          /** Selects entries due for fetch. */

          • public static class Selector implements
          • Mapper<Text, CrawlDatum, FloatWritable, SelectorEntry>,
          • Partitioner<FloatWritable, Writable>,
          • Reducer<FloatWritable, SelectorEntry, FloatWritable, SelectorEntry> {
          • private LongWritable genTime = new LongWritable(System.currentTimeMillis());
          • private long curTime;
          • private long limit;
          • private long count;
          • private HashMap<String, int[]> hostCounts = new HashMap<>();
          • private int segCounts[];
          • private int maxCount;
          • private boolean byDomain = false;
          • private Partitioner<Text, Writable> partitioner = new URLPartitioner();
          • private URLFilters filters;
          • private URLNormalizers normalizers;
          • private ScoringFilters scfilters;
          • private SelectorEntry entry = new SelectorEntry();
          • private FloatWritable sortValue = new FloatWritable();
          • private boolean filter;
          • private boolean normalise;
          • private long genDelay;
          • private FetchSchedule schedule;
          • private float scoreThreshold = 0f;
          • private int intervalThreshold = -1;
          • private String restrictStatus = null;
          • private int maxNumSegments = 1;
          • private Expression expr = null;
          • private int currentsegmentnum = 1;
          • private SequenceFile.Reader[] hostdbReaders = null;
          • private Expression maxCountExpr = null;
          • private Expression fetchDelayExpr = null;
          • private JobConf conf = null;
          • public void configure(JobConf job) {
          • this.conf = job;
          • curTime = job.getLong(GENERATOR_CUR_TIME, System.currentTimeMillis());
          • limit = job.getLong(GENERATOR_TOP_N, Long.MAX_VALUE)
          • / job.getNumReduceTasks();
          • maxCount = job.getInt(GENERATOR_MAX_COUNT, -1);
          • if (maxCount == -1) { - byDomain = false; - }
          • if (GENERATOR_COUNT_VALUE_DOMAIN.equals(job.get(GENERATOR_COUNT_MODE)))
          • byDomain = true;
          • filters = new URLFilters(job);
          • normalise = job.getBoolean(GENERATOR_NORMALISE, true);
          • if (normalise)
          • normalizers = new URLNormalizers(job,
          • URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
          • scfilters = new ScoringFilters(job);
          • partitioner.configure(job);
          • filter = job.getBoolean(GENERATOR_FILTER, true);
          • genDelay = job.getLong(GENERATOR_DELAY, 7L) * 3600L * 24L * 1000L;
          • long time = job.getLong(Nutch.GENERATE_TIME_KEY, 0L);
          • if (time > 0)
          • genTime.set(time);
          • schedule = FetchScheduleFactory.getFetchSchedule(job);
          • scoreThreshold = job.getFloat(GENERATOR_MIN_SCORE, Float.NaN);
          • intervalThreshold = job.getInt(GENERATOR_MIN_INTERVAL, -1);
          • restrictStatus = job.get(GENERATOR_RESTRICT_STATUS, null);
          • expr = JexlUtil.parseExpression(job.get(GENERATOR_EXPR, null));
          • maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1);
          • segCounts = new int[maxNumSegments];
          • if (job.get(GENERATOR_HOSTDB) != null) {
          • try { - Path path = new Path(job.get(GENERATOR_HOSTDB), "current"); - hostdbReaders = SequenceFileOutputFormat.getReaders(job, path); - maxCountExpr = JexlUtil.parseExpression(job.get(GENERATOR_MAX_COUNT_EXPR, null)); - fetchDelayExpr = JexlUtil.parseExpression(job.get(GENERATOR_FETCH_DELAY_EXPR, null)); - }

            catch (IOException e) {

          • LOG.error("Error reading HostDB because {}", e.getMessage());
          • }
          • }
          • }
            + public static class Selector extends
            + Partitioner<FloatWritable, Writable> {
            + private static LongWritable genTime = new LongWritable(System.currentTimeMillis());
            + private static long curTime;
            + private static long limit;
            + private static int segCounts[];
            + private static int maxCount;
            + private static boolean byDomain = false;
            + private static URLFilters filters;
            + private static URLNormalizers normalizers;
            + private static ScoringFilters scfilters;
            + private static SelectorEntry entry = new SelectorEntry();
            + private static FloatWritable sortValue = new FloatWritable();
            + private static boolean filter;
            + private static boolean normalise;
            + private static long genDelay;
            + private static FetchSchedule schedule;
            + private static float scoreThreshold = 0f;
            + private static int intervalThreshold = -1;
            + private static String restrictStatus = null;
            + private static int maxNumSegments = 1;
            + private static Expression expr = null;
            + private static MapFile.Reader[] hostdbReaders = null;
            + private static Expression maxCountExpr = null;
            + private static Expression fetchDelayExpr = null;
            + private static Configuration config;
            +

          Review comment:
          Yes, that seems clear and safe. Of course, not to forget about setting the values from the configuration.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#discussion_r138950525 ########## File path: src/java/org/apache/nutch/crawl/Generator.java ########## @@ -114,81 +122,34 @@ public String toString() { } /** Selects entries due for fetch. */ public static class Selector implements Mapper<Text, CrawlDatum, FloatWritable, SelectorEntry>, Partitioner<FloatWritable, Writable>, Reducer<FloatWritable, SelectorEntry, FloatWritable, SelectorEntry> { private LongWritable genTime = new LongWritable(System.currentTimeMillis()); private long curTime; private long limit; private long count; private HashMap<String, int[]> hostCounts = new HashMap<>(); private int segCounts[]; private int maxCount; private boolean byDomain = false; private Partitioner<Text, Writable> partitioner = new URLPartitioner(); private URLFilters filters; private URLNormalizers normalizers; private ScoringFilters scfilters; private SelectorEntry entry = new SelectorEntry(); private FloatWritable sortValue = new FloatWritable(); private boolean filter; private boolean normalise; private long genDelay; private FetchSchedule schedule; private float scoreThreshold = 0f; private int intervalThreshold = -1; private String restrictStatus = null; private int maxNumSegments = 1; private Expression expr = null; private int currentsegmentnum = 1; private SequenceFile.Reader[] hostdbReaders = null; private Expression maxCountExpr = null; private Expression fetchDelayExpr = null; private JobConf conf = null; public void configure(JobConf job) { this.conf = job; curTime = job.getLong(GENERATOR_CUR_TIME, System.currentTimeMillis()); limit = job.getLong(GENERATOR_TOP_N, Long.MAX_VALUE) / job.getNumReduceTasks(); maxCount = job.getInt(GENERATOR_MAX_COUNT, -1); if (maxCount == -1) { - byDomain = false; - } if (GENERATOR_COUNT_VALUE_DOMAIN.equals(job.get(GENERATOR_COUNT_MODE))) byDomain = true; filters = new URLFilters(job); normalise = job.getBoolean(GENERATOR_NORMALISE, true); if (normalise) normalizers = new URLNormalizers(job, URLNormalizers.SCOPE_GENERATE_HOST_COUNT); scfilters = new ScoringFilters(job); partitioner.configure(job); filter = job.getBoolean(GENERATOR_FILTER, true); genDelay = job.getLong(GENERATOR_DELAY, 7L) * 3600L * 24L * 1000L; long time = job.getLong(Nutch.GENERATE_TIME_KEY, 0L); if (time > 0) genTime.set(time); schedule = FetchScheduleFactory.getFetchSchedule(job); scoreThreshold = job.getFloat(GENERATOR_MIN_SCORE, Float.NaN); intervalThreshold = job.getInt(GENERATOR_MIN_INTERVAL, -1); restrictStatus = job.get(GENERATOR_RESTRICT_STATUS, null); expr = JexlUtil.parseExpression(job.get(GENERATOR_EXPR, null)); maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1); segCounts = new int [maxNumSegments] ; if (job.get(GENERATOR_HOSTDB) != null) { try { - Path path = new Path(job.get(GENERATOR_HOSTDB), "current"); - hostdbReaders = SequenceFileOutputFormat.getReaders(job, path); - maxCountExpr = JexlUtil.parseExpression(job.get(GENERATOR_MAX_COUNT_EXPR, null)); - fetchDelayExpr = JexlUtil.parseExpression(job.get(GENERATOR_FETCH_DELAY_EXPR, null)); - } catch (IOException e) { LOG.error("Error reading HostDB because {}", e.getMessage()); } } } + public static class Selector extends + Partitioner<FloatWritable, Writable> { + private static LongWritable genTime = new LongWritable(System.currentTimeMillis()); + private static long curTime; + private static long limit; + private static int segCounts[]; + private static int maxCount; + private static boolean byDomain = false; + private static URLFilters filters; + private static URLNormalizers normalizers; + private static ScoringFilters scfilters; + private static SelectorEntry entry = new SelectorEntry(); + private static FloatWritable sortValue = new FloatWritable(); + private static boolean filter; + private static boolean normalise; + private static long genDelay; + private static FetchSchedule schedule; + private static float scoreThreshold = 0f; + private static int intervalThreshold = -1; + private static String restrictStatus = null; + private static int maxNumSegments = 1; + private static Expression expr = null; + private static MapFile.Reader[] hostdbReaders = null; + private static Expression maxCountExpr = null; + private static Expression fetchDelayExpr = null; + private static Configuration config; + Review comment: Yes, that seems clear and safe. Of course, not to forget about setting the values from the configuration. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#discussion_r138957904

          ##########
          File path: src/java/org/apache/nutch/crawl/Generator.java
          ##########
          @@ -114,81 +122,34 @@ public String toString() {
          }

          /** Selects entries due for fetch. */

          • public static class Selector implements
          • Mapper<Text, CrawlDatum, FloatWritable, SelectorEntry>,
          • Partitioner<FloatWritable, Writable>,
          • Reducer<FloatWritable, SelectorEntry, FloatWritable, SelectorEntry> {
          • private LongWritable genTime = new LongWritable(System.currentTimeMillis());
          • private long curTime;
          • private long limit;
          • private long count;
          • private HashMap<String, int[]> hostCounts = new HashMap<>();
          • private int segCounts[];
          • private int maxCount;
          • private boolean byDomain = false;
          • private Partitioner<Text, Writable> partitioner = new URLPartitioner();
          • private URLFilters filters;
          • private URLNormalizers normalizers;
          • private ScoringFilters scfilters;
          • private SelectorEntry entry = new SelectorEntry();
          • private FloatWritable sortValue = new FloatWritable();
          • private boolean filter;
          • private boolean normalise;
          • private long genDelay;
          • private FetchSchedule schedule;
          • private float scoreThreshold = 0f;
          • private int intervalThreshold = -1;
          • private String restrictStatus = null;
          • private int maxNumSegments = 1;
          • private Expression expr = null;
          • private int currentsegmentnum = 1;
          • private SequenceFile.Reader[] hostdbReaders = null;
          • private Expression maxCountExpr = null;
          • private Expression fetchDelayExpr = null;
          • private JobConf conf = null;
          • public void configure(JobConf job) {
          • this.conf = job;
          • curTime = job.getLong(GENERATOR_CUR_TIME, System.currentTimeMillis());
          • limit = job.getLong(GENERATOR_TOP_N, Long.MAX_VALUE)
          • / job.getNumReduceTasks();
          • maxCount = job.getInt(GENERATOR_MAX_COUNT, -1);
          • if (maxCount == -1) { - byDomain = false; - }
          • if (GENERATOR_COUNT_VALUE_DOMAIN.equals(job.get(GENERATOR_COUNT_MODE)))
          • byDomain = true;
          • filters = new URLFilters(job);
          • normalise = job.getBoolean(GENERATOR_NORMALISE, true);
          • if (normalise)
          • normalizers = new URLNormalizers(job,
          • URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
          • scfilters = new ScoringFilters(job);
          • partitioner.configure(job);
          • filter = job.getBoolean(GENERATOR_FILTER, true);
          • genDelay = job.getLong(GENERATOR_DELAY, 7L) * 3600L * 24L * 1000L;
          • long time = job.getLong(Nutch.GENERATE_TIME_KEY, 0L);
          • if (time > 0)
          • genTime.set(time);
          • schedule = FetchScheduleFactory.getFetchSchedule(job);
          • scoreThreshold = job.getFloat(GENERATOR_MIN_SCORE, Float.NaN);
          • intervalThreshold = job.getInt(GENERATOR_MIN_INTERVAL, -1);
          • restrictStatus = job.get(GENERATOR_RESTRICT_STATUS, null);
          • expr = JexlUtil.parseExpression(job.get(GENERATOR_EXPR, null));
          • maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1);
          • segCounts = new int[maxNumSegments];
          • if (job.get(GENERATOR_HOSTDB) != null) {
          • try { - Path path = new Path(job.get(GENERATOR_HOSTDB), "current"); - hostdbReaders = SequenceFileOutputFormat.getReaders(job, path); - maxCountExpr = JexlUtil.parseExpression(job.get(GENERATOR_MAX_COUNT_EXPR, null)); - fetchDelayExpr = JexlUtil.parseExpression(job.get(GENERATOR_FETCH_DELAY_EXPR, null)); - }

            catch (IOException e) {

          • LOG.error("Error reading HostDB because {}", e.getMessage());
          • }
          • }
          • }
            + public static class Selector extends
            + Partitioner<FloatWritable, Writable> {
            + private static LongWritable genTime = new LongWritable(System.currentTimeMillis());
            + private static long curTime;
            + private static long limit;
            + private static int segCounts[];
            + private static int maxCount;
            + private static boolean byDomain = false;
            + private static URLFilters filters;
            + private static URLNormalizers normalizers;
            + private static ScoringFilters scfilters;
            + private static SelectorEntry entry = new SelectorEntry();
            + private static FloatWritable sortValue = new FloatWritable();
            + private static boolean filter;
            + private static boolean normalise;
            + private static long genDelay;
            + private static FetchSchedule schedule;
            + private static float scoreThreshold = 0f;
            + private static int intervalThreshold = -1;
            + private static String restrictStatus = null;
            + private static int maxNumSegments = 1;
            + private static Expression expr = null;
            + private static MapFile.Reader[] hostdbReaders = null;
            + private static Expression maxCountExpr = null;
            + private static Expression fetchDelayExpr = null;
            + private static Configuration config;
            +

          Review comment:
          >>Sharing configuration values between mapper and reducer via static variables does not work in distributed mode because mapper and reducer running each in its own JavaVM

          @sebastian-nagel regarding this, if in distributed mode mapper and reducer are running on each of its own JVM how do they share the context object?

          >>Of course, not to forget about setting the values from the configuration.

          and regarding this, we set values in configuration object before running the job itself and then they are later on used by mapper and reducer to retrieve the values right? And moreover, if the mapper or reducer are setting any value in configuration object I will make sure that it is happening in setup() respectively. Please feel free to correct me if I am missing something.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#discussion_r138957904 ########## File path: src/java/org/apache/nutch/crawl/Generator.java ########## @@ -114,81 +122,34 @@ public String toString() { } /** Selects entries due for fetch. */ public static class Selector implements Mapper<Text, CrawlDatum, FloatWritable, SelectorEntry>, Partitioner<FloatWritable, Writable>, Reducer<FloatWritable, SelectorEntry, FloatWritable, SelectorEntry> { private LongWritable genTime = new LongWritable(System.currentTimeMillis()); private long curTime; private long limit; private long count; private HashMap<String, int[]> hostCounts = new HashMap<>(); private int segCounts[]; private int maxCount; private boolean byDomain = false; private Partitioner<Text, Writable> partitioner = new URLPartitioner(); private URLFilters filters; private URLNormalizers normalizers; private ScoringFilters scfilters; private SelectorEntry entry = new SelectorEntry(); private FloatWritable sortValue = new FloatWritable(); private boolean filter; private boolean normalise; private long genDelay; private FetchSchedule schedule; private float scoreThreshold = 0f; private int intervalThreshold = -1; private String restrictStatus = null; private int maxNumSegments = 1; private Expression expr = null; private int currentsegmentnum = 1; private SequenceFile.Reader[] hostdbReaders = null; private Expression maxCountExpr = null; private Expression fetchDelayExpr = null; private JobConf conf = null; public void configure(JobConf job) { this.conf = job; curTime = job.getLong(GENERATOR_CUR_TIME, System.currentTimeMillis()); limit = job.getLong(GENERATOR_TOP_N, Long.MAX_VALUE) / job.getNumReduceTasks(); maxCount = job.getInt(GENERATOR_MAX_COUNT, -1); if (maxCount == -1) { - byDomain = false; - } if (GENERATOR_COUNT_VALUE_DOMAIN.equals(job.get(GENERATOR_COUNT_MODE))) byDomain = true; filters = new URLFilters(job); normalise = job.getBoolean(GENERATOR_NORMALISE, true); if (normalise) normalizers = new URLNormalizers(job, URLNormalizers.SCOPE_GENERATE_HOST_COUNT); scfilters = new ScoringFilters(job); partitioner.configure(job); filter = job.getBoolean(GENERATOR_FILTER, true); genDelay = job.getLong(GENERATOR_DELAY, 7L) * 3600L * 24L * 1000L; long time = job.getLong(Nutch.GENERATE_TIME_KEY, 0L); if (time > 0) genTime.set(time); schedule = FetchScheduleFactory.getFetchSchedule(job); scoreThreshold = job.getFloat(GENERATOR_MIN_SCORE, Float.NaN); intervalThreshold = job.getInt(GENERATOR_MIN_INTERVAL, -1); restrictStatus = job.get(GENERATOR_RESTRICT_STATUS, null); expr = JexlUtil.parseExpression(job.get(GENERATOR_EXPR, null)); maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1); segCounts = new int [maxNumSegments] ; if (job.get(GENERATOR_HOSTDB) != null) { try { - Path path = new Path(job.get(GENERATOR_HOSTDB), "current"); - hostdbReaders = SequenceFileOutputFormat.getReaders(job, path); - maxCountExpr = JexlUtil.parseExpression(job.get(GENERATOR_MAX_COUNT_EXPR, null)); - fetchDelayExpr = JexlUtil.parseExpression(job.get(GENERATOR_FETCH_DELAY_EXPR, null)); - } catch (IOException e) { LOG.error("Error reading HostDB because {}", e.getMessage()); } } } + public static class Selector extends + Partitioner<FloatWritable, Writable> { + private static LongWritable genTime = new LongWritable(System.currentTimeMillis()); + private static long curTime; + private static long limit; + private static int segCounts[]; + private static int maxCount; + private static boolean byDomain = false; + private static URLFilters filters; + private static URLNormalizers normalizers; + private static ScoringFilters scfilters; + private static SelectorEntry entry = new SelectorEntry(); + private static FloatWritable sortValue = new FloatWritable(); + private static boolean filter; + private static boolean normalise; + private static long genDelay; + private static FetchSchedule schedule; + private static float scoreThreshold = 0f; + private static int intervalThreshold = -1; + private static String restrictStatus = null; + private static int maxNumSegments = 1; + private static Expression expr = null; + private static MapFile.Reader[] hostdbReaders = null; + private static Expression maxCountExpr = null; + private static Expression fetchDelayExpr = null; + private static Configuration config; + Review comment: >>Sharing configuration values between mapper and reducer via static variables does not work in distributed mode because mapper and reducer running each in its own JavaVM @sebastian-nagel regarding this, if in distributed mode mapper and reducer are running on each of its own JVM how do they share the context object? >>Of course, not to forget about setting the values from the configuration. and regarding this, we set values in configuration object before running the job itself and then they are later on used by mapper and reducer to retrieve the values right? And moreover, if the mapper or reducer are setting any value in configuration object I will make sure that it is happening in setup() respectively. Please feel free to correct me if I am missing something. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#discussion_r139002963

          ##########
          File path: src/java/org/apache/nutch/crawl/Generator.java
          ##########
          @@ -114,81 +122,34 @@ public String toString() {
          }

          /** Selects entries due for fetch. */

          • public static class Selector implements
          • Mapper<Text, CrawlDatum, FloatWritable, SelectorEntry>,
          • Partitioner<FloatWritable, Writable>,
          • Reducer<FloatWritable, SelectorEntry, FloatWritable, SelectorEntry> {
          • private LongWritable genTime = new LongWritable(System.currentTimeMillis());
          • private long curTime;
          • private long limit;
          • private long count;
          • private HashMap<String, int[]> hostCounts = new HashMap<>();
          • private int segCounts[];
          • private int maxCount;
          • private boolean byDomain = false;
          • private Partitioner<Text, Writable> partitioner = new URLPartitioner();
          • private URLFilters filters;
          • private URLNormalizers normalizers;
          • private ScoringFilters scfilters;
          • private SelectorEntry entry = new SelectorEntry();
          • private FloatWritable sortValue = new FloatWritable();
          • private boolean filter;
          • private boolean normalise;
          • private long genDelay;
          • private FetchSchedule schedule;
          • private float scoreThreshold = 0f;
          • private int intervalThreshold = -1;
          • private String restrictStatus = null;
          • private int maxNumSegments = 1;
          • private Expression expr = null;
          • private int currentsegmentnum = 1;
          • private SequenceFile.Reader[] hostdbReaders = null;
          • private Expression maxCountExpr = null;
          • private Expression fetchDelayExpr = null;
          • private JobConf conf = null;
          • public void configure(JobConf job) {
          • this.conf = job;
          • curTime = job.getLong(GENERATOR_CUR_TIME, System.currentTimeMillis());
          • limit = job.getLong(GENERATOR_TOP_N, Long.MAX_VALUE)
          • / job.getNumReduceTasks();
          • maxCount = job.getInt(GENERATOR_MAX_COUNT, -1);
          • if (maxCount == -1) { - byDomain = false; - }
          • if (GENERATOR_COUNT_VALUE_DOMAIN.equals(job.get(GENERATOR_COUNT_MODE)))
          • byDomain = true;
          • filters = new URLFilters(job);
          • normalise = job.getBoolean(GENERATOR_NORMALISE, true);
          • if (normalise)
          • normalizers = new URLNormalizers(job,
          • URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
          • scfilters = new ScoringFilters(job);
          • partitioner.configure(job);
          • filter = job.getBoolean(GENERATOR_FILTER, true);
          • genDelay = job.getLong(GENERATOR_DELAY, 7L) * 3600L * 24L * 1000L;
          • long time = job.getLong(Nutch.GENERATE_TIME_KEY, 0L);
          • if (time > 0)
          • genTime.set(time);
          • schedule = FetchScheduleFactory.getFetchSchedule(job);
          • scoreThreshold = job.getFloat(GENERATOR_MIN_SCORE, Float.NaN);
          • intervalThreshold = job.getInt(GENERATOR_MIN_INTERVAL, -1);
          • restrictStatus = job.get(GENERATOR_RESTRICT_STATUS, null);
          • expr = JexlUtil.parseExpression(job.get(GENERATOR_EXPR, null));
          • maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1);
          • segCounts = new int[maxNumSegments];
          • if (job.get(GENERATOR_HOSTDB) != null) {
          • try { - Path path = new Path(job.get(GENERATOR_HOSTDB), "current"); - hostdbReaders = SequenceFileOutputFormat.getReaders(job, path); - maxCountExpr = JexlUtil.parseExpression(job.get(GENERATOR_MAX_COUNT_EXPR, null)); - fetchDelayExpr = JexlUtil.parseExpression(job.get(GENERATOR_FETCH_DELAY_EXPR, null)); - }

            catch (IOException e) {

          • LOG.error("Error reading HostDB because {}", e.getMessage());
          • }
          • }
          • }
            + public static class Selector extends
            + Partitioner<FloatWritable, Writable> {
            + private static LongWritable genTime = new LongWritable(System.currentTimeMillis());
            + private static long curTime;
            + private static long limit;
            + private static int segCounts[];
            + private static int maxCount;
            + private static boolean byDomain = false;
            + private static URLFilters filters;
            + private static URLNormalizers normalizers;
            + private static ScoringFilters scfilters;
            + private static SelectorEntry entry = new SelectorEntry();
            + private static FloatWritable sortValue = new FloatWritable();
            + private static boolean filter;
            + private static boolean normalise;
            + private static long genDelay;
            + private static FetchSchedule schedule;
            + private static float scoreThreshold = 0f;
            + private static int intervalThreshold = -1;
            + private static String restrictStatus = null;
            + private static int maxNumSegments = 1;
            + private static Expression expr = null;
            + private static MapFile.Reader[] hostdbReaders = null;
            + private static Expression maxCountExpr = null;
            + private static Expression fetchDelayExpr = null;
            + private static Configuration config;
            +

          Review comment:
          Strictly speaking, it's a different object but it represents the "same" "job context" - configuration, counters, output location, etc.

          You can set properties in the configuration object in the run() method of a job. The changed configuration is then passed via setup(Context) to the mapper and reducer objects. But it does not work the other way: if the configuration is modified in the setup() method, this has no effects to other mapper or reducer instances. The setup(context) method is called only once at the beginning of a task which makes it the right place to initialize variables from the configuration, create and configure required objects and plugins.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#discussion_r139002963 ########## File path: src/java/org/apache/nutch/crawl/Generator.java ########## @@ -114,81 +122,34 @@ public String toString() { } /** Selects entries due for fetch. */ public static class Selector implements Mapper<Text, CrawlDatum, FloatWritable, SelectorEntry>, Partitioner<FloatWritable, Writable>, Reducer<FloatWritable, SelectorEntry, FloatWritable, SelectorEntry> { private LongWritable genTime = new LongWritable(System.currentTimeMillis()); private long curTime; private long limit; private long count; private HashMap<String, int[]> hostCounts = new HashMap<>(); private int segCounts[]; private int maxCount; private boolean byDomain = false; private Partitioner<Text, Writable> partitioner = new URLPartitioner(); private URLFilters filters; private URLNormalizers normalizers; private ScoringFilters scfilters; private SelectorEntry entry = new SelectorEntry(); private FloatWritable sortValue = new FloatWritable(); private boolean filter; private boolean normalise; private long genDelay; private FetchSchedule schedule; private float scoreThreshold = 0f; private int intervalThreshold = -1; private String restrictStatus = null; private int maxNumSegments = 1; private Expression expr = null; private int currentsegmentnum = 1; private SequenceFile.Reader[] hostdbReaders = null; private Expression maxCountExpr = null; private Expression fetchDelayExpr = null; private JobConf conf = null; public void configure(JobConf job) { this.conf = job; curTime = job.getLong(GENERATOR_CUR_TIME, System.currentTimeMillis()); limit = job.getLong(GENERATOR_TOP_N, Long.MAX_VALUE) / job.getNumReduceTasks(); maxCount = job.getInt(GENERATOR_MAX_COUNT, -1); if (maxCount == -1) { - byDomain = false; - } if (GENERATOR_COUNT_VALUE_DOMAIN.equals(job.get(GENERATOR_COUNT_MODE))) byDomain = true; filters = new URLFilters(job); normalise = job.getBoolean(GENERATOR_NORMALISE, true); if (normalise) normalizers = new URLNormalizers(job, URLNormalizers.SCOPE_GENERATE_HOST_COUNT); scfilters = new ScoringFilters(job); partitioner.configure(job); filter = job.getBoolean(GENERATOR_FILTER, true); genDelay = job.getLong(GENERATOR_DELAY, 7L) * 3600L * 24L * 1000L; long time = job.getLong(Nutch.GENERATE_TIME_KEY, 0L); if (time > 0) genTime.set(time); schedule = FetchScheduleFactory.getFetchSchedule(job); scoreThreshold = job.getFloat(GENERATOR_MIN_SCORE, Float.NaN); intervalThreshold = job.getInt(GENERATOR_MIN_INTERVAL, -1); restrictStatus = job.get(GENERATOR_RESTRICT_STATUS, null); expr = JexlUtil.parseExpression(job.get(GENERATOR_EXPR, null)); maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1); segCounts = new int [maxNumSegments] ; if (job.get(GENERATOR_HOSTDB) != null) { try { - Path path = new Path(job.get(GENERATOR_HOSTDB), "current"); - hostdbReaders = SequenceFileOutputFormat.getReaders(job, path); - maxCountExpr = JexlUtil.parseExpression(job.get(GENERATOR_MAX_COUNT_EXPR, null)); - fetchDelayExpr = JexlUtil.parseExpression(job.get(GENERATOR_FETCH_DELAY_EXPR, null)); - } catch (IOException e) { LOG.error("Error reading HostDB because {}", e.getMessage()); } } } + public static class Selector extends + Partitioner<FloatWritable, Writable> { + private static LongWritable genTime = new LongWritable(System.currentTimeMillis()); + private static long curTime; + private static long limit; + private static int segCounts[]; + private static int maxCount; + private static boolean byDomain = false; + private static URLFilters filters; + private static URLNormalizers normalizers; + private static ScoringFilters scfilters; + private static SelectorEntry entry = new SelectorEntry(); + private static FloatWritable sortValue = new FloatWritable(); + private static boolean filter; + private static boolean normalise; + private static long genDelay; + private static FetchSchedule schedule; + private static float scoreThreshold = 0f; + private static int intervalThreshold = -1; + private static String restrictStatus = null; + private static int maxNumSegments = 1; + private static Expression expr = null; + private static MapFile.Reader[] hostdbReaders = null; + private static Expression maxCountExpr = null; + private static Expression fetchDelayExpr = null; + private static Configuration config; + Review comment: Strictly speaking, it's a different object but it represents the "same" "job context" - configuration, counters, output location, etc. You can set properties in the configuration object in the run() method of a job. The changed configuration is then passed via setup(Context) to the mapper and reducer objects. But it does not work the other way: if the configuration is modified in the setup() method, this has no effects to other mapper or reducer instances. The setup(context) method is called only once at the beginning of a task which makes it the right place to initialize variables from the configuration, create and configure required objects and plugins. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on issue #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#issuecomment-330280911

          @sebastian-nagel @lewismc please review my latest commit. Thanks.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#issuecomment-330280911 @sebastian-nagel @lewismc please review my latest commit. Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#issuecomment-330283442

          @Omkar20895 thanks, did you update the pull request with the removal of all instances where wildcard's are used for imports?

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#issuecomment-330283442 @Omkar20895 thanks, did you update the pull request with the removal of all instances where wildcard's are used for imports? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on issue #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#issuecomment-330792623

          Hi @lewismc I have opened a separate issue for this[0] and will update that separately since this patch is already a very large patch.

          [0] https://issues.apache.org/jira/browse/NUTCH-2427

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#issuecomment-330792623 Hi @lewismc I have opened a separate issue for this [0] and will update that separately since this patch is already a very large patch. [0] https://issues.apache.org/jira/browse/NUTCH-2427 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on issue #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#issuecomment-330794187

          @sebastian-nagel can you please test this patch on a Hadoop cluster and let us know the results? Thanks in advance.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#issuecomment-330794187 @sebastian-nagel can you please test this patch on a Hadoop cluster and let us know the results? Thanks in advance. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on issue #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#issuecomment-330795100

          I'll do this for sure, but it may take still a couple of days. If you have time, please, test further components (linkdb, indexer Solr/ElasticSearch, webgraph) in local mode. Or try to pseudo-distributed mode (see https://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorial). Sorting out issues earlier will speed-up everything. Thanks!

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on issue #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#issuecomment-330795100 I'll do this for sure, but it may take still a couple of days. If you have time, please, test further components (linkdb, indexer Solr/ElasticSearch, webgraph) in local mode. Or try to pseudo-distributed mode (see https://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorial ). Sorting out issues earlier will speed-up everything. Thanks! ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on issue #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#issuecomment-332860192

          @sebastian-nagel Pardon me I could not get to this sooner, did you get a chance to test this in distributed mode? Thanks.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#issuecomment-332860192 @sebastian-nagel Pardon me I could not get to this sooner, did you get a chance to test this in distributed mode? Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#issuecomment-332929185

          @Omkar20895 you also should be testing this in [pseudo-distributed mode](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation), if you have any issues then let me know offline

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#issuecomment-332929185 @Omkar20895 you also should be testing this in [pseudo-distributed mode] ( http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation ), if you have any issues then let me know offline ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel commented on issue #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#issuecomment-332941883

          Hi @Omkar20895,

          running a test crawl on a Hadoop cluster failed again. I've got two ClassNotFoundException-s, first in Generator, in the mapper of the last step "partitioning":

          ```
          17/09/28 17:38:16 INFO crawl.Generator: Generator: Partitioning selected urls for politeness.

          ...

          17/09/28 17:54:56 INFO mapreduce.Job: Task Id : attempt_1505293155476_0250_m_000000_98, Status : FAILED
          Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.nutch.crawl.Generator$SelectorInverseMapper not found
          at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2203)
          at org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:196)
          at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:751)
          at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
          at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
          at java.security.AccessController.doPrivileged(Native Method)
          at javax.security.auth.Subject.doAs(Subject.java:422)
          at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
          at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
          Caused by: java.lang.ClassNotFoundException: Class org.apache.nutch.crawl.Generator$SelectorInverseMapper not found
          at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2109)
          at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2201)
          ... 8 more

          17/09/28 17:55:07 INFO mapreduce.Job: map 100% reduce 100%
          17/09/28 17:55:07 INFO mapreduce.Job: Job job_1505293155476_0250 failed with state FAILED due to: Task failed task_1505293155476_0250_m_000000
          Job failed as tasks failed. failedMaps:1 failedReduces:0

          17/09/28 17:55:07 INFO mapreduce.Job: Counters: 9
          Job Counters
          Failed map tasks=100
          Killed reduce tasks=2
          Launched map tasks=100
          Other local map tasks=100
          Total time spent by all maps in occupied slots (ms)=2073900
          Total time spent by all reduces in occupied slots (ms)=0
          Total time spent by all map tasks (ms)=691300
          Total vcore-milliseconds taken by all map tasks=691300
          Total megabyte-milliseconds taken by all map tasks=2123673600
          17/09/28 17:55:07 INFO crawl.Generator: Generator: finished at 2017-09-28 17:55:07, elapsed: 00:28:26
          ```

          Second, in the mapper of the updatedb tool:

          ```
          17/09/28 18:14:22 INFO crawl.CrawlDb: CrawlDb update: starting at 2017-09-28 18:14:22

          ...

          17/09/28 18:29:06 INFO mapreduce.Job: Task Id : attempt_1505293155476_0253_m_000003_98, Status : FAILED
          Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.nutch.crawl.CrawlDbFilter not found
          at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2203)
          at org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:196)
          at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:751)
          at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
          at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
          at java.security.AccessController.doPrivileged(Native Method)
          at javax.security.auth.Subject.doAs(Subject.java:422)
          at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
          at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
          Caused by: java.lang.ClassNotFoundException: Class org.apache.nutch.crawl.CrawlDbFilter not found
          at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2109)
          at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2201)
          ... 8 more

          17/09/28 18:29:07 INFO mapreduce.Job: map 33% reduce 100%
          17/09/28 18:29:08 INFO mapreduce.Job: map 100% reduce 100%
          17/09/28 18:29:08 INFO mapreduce.Job: Job job_1505293155476_0253 failed with state FAILED due to: Task failed task_1505293155476_0253_m_000001
          Job failed as tasks failed. failedMaps:1 failedReduces:0

          ...

          17/09/28 18:29:08 INFO crawl.CrawlDb: CrawlDb update: finished at 2017-09-28 18:29:08, elapsed: 00:14:45
          ```

          I have no glue why, the classes are in the job file. It could be because of some incompatibilities when running on Cloudera CDH 5.12.1 (Hadoop 2.6). I'll try to investigate this problem.

          Meanwhile, please, have a look at another issue uncovered: although a job failed (note: a MapReduce job can be just one of multiple steps), the generate or updatedb "job" (here: one run of a tool) signalized "success" and the crawl script just continued as if there wasn't any problem. Please, always check the return value of job.waitForCompletion(...) and if it returns false:

          • perform the necessary cleanups: delete temporary data, etc.
          • make the main routine return 1

          I can only second Lewis: please, try to run tests independently in local and pseudo-distributed mode. One iteration (commit/PR, test, analyze and report error) takes too long otherwise. Thanks!

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on issue #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#issuecomment-332941883 Hi @Omkar20895, running a test crawl on a Hadoop cluster failed again. I've got two ClassNotFoundException-s, first in Generator, in the mapper of the last step "partitioning": ``` 17/09/28 17:38:16 INFO crawl.Generator: Generator: Partitioning selected urls for politeness. ... 17/09/28 17:54:56 INFO mapreduce.Job: Task Id : attempt_1505293155476_0250_m_000000_98, Status : FAILED Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.nutch.crawl.Generator$SelectorInverseMapper not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2203) at org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:196) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:751) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.lang.ClassNotFoundException: Class org.apache.nutch.crawl.Generator$SelectorInverseMapper not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2109) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2201) ... 8 more 17/09/28 17:55:07 INFO mapreduce.Job: map 100% reduce 100% 17/09/28 17:55:07 INFO mapreduce.Job: Job job_1505293155476_0250 failed with state FAILED due to: Task failed task_1505293155476_0250_m_000000 Job failed as tasks failed. failedMaps:1 failedReduces:0 17/09/28 17:55:07 INFO mapreduce.Job: Counters: 9 Job Counters Failed map tasks=100 Killed reduce tasks=2 Launched map tasks=100 Other local map tasks=100 Total time spent by all maps in occupied slots (ms)=2073900 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=691300 Total vcore-milliseconds taken by all map tasks=691300 Total megabyte-milliseconds taken by all map tasks=2123673600 17/09/28 17:55:07 INFO crawl.Generator: Generator: finished at 2017-09-28 17:55:07, elapsed: 00:28:26 ``` Second, in the mapper of the updatedb tool: ``` 17/09/28 18:14:22 INFO crawl.CrawlDb: CrawlDb update: starting at 2017-09-28 18:14:22 ... 17/09/28 18:29:06 INFO mapreduce.Job: Task Id : attempt_1505293155476_0253_m_000003_98, Status : FAILED Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.nutch.crawl.CrawlDbFilter not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2203) at org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:196) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:751) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.lang.ClassNotFoundException: Class org.apache.nutch.crawl.CrawlDbFilter not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2109) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2201) ... 8 more 17/09/28 18:29:07 INFO mapreduce.Job: map 33% reduce 100% 17/09/28 18:29:08 INFO mapreduce.Job: map 100% reduce 100% 17/09/28 18:29:08 INFO mapreduce.Job: Job job_1505293155476_0253 failed with state FAILED due to: Task failed task_1505293155476_0253_m_000001 Job failed as tasks failed. failedMaps:1 failedReduces:0 ... 17/09/28 18:29:08 INFO crawl.CrawlDb: CrawlDb update: finished at 2017-09-28 18:29:08, elapsed: 00:14:45 ``` I have no glue why, the classes are in the job file. It could be because of some incompatibilities when running on Cloudera CDH 5.12.1 (Hadoop 2.6). I'll try to investigate this problem. Meanwhile, please, have a look at another issue uncovered: although a job failed (note: a MapReduce job can be just one of multiple steps), the generate or updatedb "job" (here: one run of a tool) signalized "success" and the crawl script just continued as if there wasn't any problem. Please, always check the return value of job.waitForCompletion(...) and if it returns false: perform the necessary cleanups: delete temporary data, etc. make the main routine return 1 I can only second Lewis: please, try to run tests independently in local and pseudo-distributed mode. One iteration (commit/PR, test, analyze and report error) takes too long otherwise. Thanks! ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          Omkar20895 commented on issue #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
          URL: https://github.com/apache/nutch/pull/221#issuecomment-333180101

          @lewismc @sebastian-nagel you are right, I will start testing it in pseudo-distributed mode first. Thanks.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/221#issuecomment-333180101 @lewismc @sebastian-nagel you are right, I will start testing it in pseudo-distributed mode first. Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org

            People

            • Assignee:
              Unassigned
              Reporter:
              omkar20895 Omkar Reddy
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:

                Development