Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2375

Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: deployment
    • Labels:
      None

      Description

      Nutch is still using the deprecated org.apache.hadoop.mapred dependency which has been deprecated. It need to be updated to org.apache.hadoop.mapreduce dependency.

        Activity

        Hide
        githubbot ASF GitHub Bot added a comment -

        Omkar20895 opened a new pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188

        • This PR is a part of the upgrade and will be updated continuously by me.
        • Please feel free to review the PR.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - Omkar20895 opened a new pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188 This PR is a part of the upgrade and will be updated continuously by me. Please feel free to review the PR. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-296388783

        Excellent @Omkar20895 thank you for starting this pull request. Now you will see that EVERY class is broken e.g. does not compile... lets gradually fix those classes. Please update this PR as you progress. Thank you for submitting PR early it makes a huge difference for review.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-296388783 Excellent @Omkar20895 thank you for starting this pull request. Now you will see that EVERY class is broken e.g. does not compile... lets gradually fix those classes. Please update this PR as you progress. Thank you for submitting PR early it makes a huge difference for review. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-297560916

        @Omkar20895 please keep the updates coming as we can review the code incrementally. Thanks.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-297560916 @Omkar20895 please keep the updates coming as we can review the code incrementally. Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-297653471

        @lewismc I am working on upgrading crawldb, I will send the update as soon as I am done with it. Thanks.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-297653471 @lewismc I am working on upgrading crawldb, I will send the update as soon as I am done with it. Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        omkar20895 Omkar Reddy added a comment -

        Hello dev@,

        I am using the following url : https://www.slideshare.net/sh1mmer/upgrading-to-the-new-map-reduce-api, to upgrade the codebase. Please post on this thread if there is any discrepancy in the ppt in the above link.

        Thanks,
        Omkar.

        Show
        omkar20895 Omkar Reddy added a comment - Hello dev@, I am using the following url : https://www.slideshare.net/sh1mmer/upgrading-to-the-new-map-reduce-api , to upgrade the codebase. Please post on this thread if there is any discrepancy in the ppt in the above link. Thanks, Omkar.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-298213968

        CrawlDb needs to be updated more, this will be done in the coming weeks. Major changes still need to be done in DeDuplicationJob.java, Thanks.
        @lewismc

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-298213968 CrawlDb needs to be updated more, this will be done in the coming weeks. Major changes still need to be done in DeDuplicationJob.java, Thanks. @lewismc ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-300236434

        @lewismc please review the latest commit so that I can incorporate the changes suggested by you.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-300236434 @lewismc please review the latest commit so that I can incorporate the changes suggested by you. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#discussion_r116651511

        ##########
        File path: src/java/org/apache/nutch/crawl/CrawlDbReader.java
        ##########
        @@ -47,22 +47,10 @@
        import org.apache.hadoop.io.SequenceFile;
        import org.apache.hadoop.io.Text;
        import org.apache.hadoop.io.Writable;
        -import org.apache.hadoop.mapred.FileInputFormat;
        -import org.apache.hadoop.mapred.FileOutputFormat;
        -import org.apache.hadoop.mapred.JobClient;
        -import org.apache.hadoop.mapred.JobConf;
        -import org.apache.hadoop.mapred.MapFileOutputFormat;
        -import org.apache.hadoop.mapred.Mapper;
        -import org.apache.hadoop.mapred.OutputCollector;
        -import org.apache.hadoop.mapred.RecordWriter;
        -import org.apache.hadoop.mapred.Reducer;
        -import org.apache.hadoop.mapred.Reporter;
        -import org.apache.hadoop.mapred.SequenceFileInputFormat;
        -import org.apache.hadoop.mapred.SequenceFileOutputFormat;
        -import org.apache.hadoop.mapred.TextOutputFormat;
        -import org.apache.hadoop.mapred.lib.HashPartitioner;
        -import org.apache.hadoop.mapred.lib.IdentityMapper;
        -import org.apache.hadoop.mapred.lib.IdentityReducer;
        +import org.apache.hadoop.mapreduce.*;

        Review comment:
        Same here.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r116651511 ########## File path: src/java/org/apache/nutch/crawl/CrawlDbReader.java ########## @@ -47,22 +47,10 @@ import org.apache.hadoop.io.SequenceFile; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.Writable; -import org.apache.hadoop.mapred.FileInputFormat; -import org.apache.hadoop.mapred.FileOutputFormat; -import org.apache.hadoop.mapred.JobClient; -import org.apache.hadoop.mapred.JobConf; -import org.apache.hadoop.mapred.MapFileOutputFormat; -import org.apache.hadoop.mapred.Mapper; -import org.apache.hadoop.mapred.OutputCollector; -import org.apache.hadoop.mapred.RecordWriter; -import org.apache.hadoop.mapred.Reducer; -import org.apache.hadoop.mapred.Reporter; -import org.apache.hadoop.mapred.SequenceFileInputFormat; -import org.apache.hadoop.mapred.SequenceFileOutputFormat; -import org.apache.hadoop.mapred.TextOutputFormat; -import org.apache.hadoop.mapred.lib.HashPartitioner; -import org.apache.hadoop.mapred.lib.IdentityMapper; -import org.apache.hadoop.mapred.lib.IdentityReducer; +import org.apache.hadoop.mapreduce.*; Review comment: Same here. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#discussion_r116651544

        ##########
        File path: src/java/org/apache/nutch/crawl/CrawlDbReader.java
        ##########
        @@ -371,26 +361,27 @@ public void close() {
        private TreeMap<String, LongWritable> processStatJobHelper(String crawlDb, Configuration config, boolean sort) throws IOException{
        Path tmpFolder = new Path(crawlDb, "stat_tmp" + System.currentTimeMillis());

        • JobConf job = new NutchJob(config);
          + Job job = new NutchJob(config);
          + config = job.getConfiguration();

        Review comment:
        Formatting.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r116651544 ########## File path: src/java/org/apache/nutch/crawl/CrawlDbReader.java ########## @@ -371,26 +361,27 @@ public void close() { private TreeMap<String, LongWritable> processStatJobHelper(String crawlDb, Configuration config, boolean sort) throws IOException{ Path tmpFolder = new Path(crawlDb, "stat_tmp" + System.currentTimeMillis()); JobConf job = new NutchJob(config); + Job job = new NutchJob(config); + config = job.getConfiguration(); Review comment: Formatting. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#discussion_r116651333

        ##########
        File path: src/java/org/apache/nutch/crawl/CrawlDb.java
        ##########
        @@ -28,8 +28,9 @@
        import org.apache.hadoop.io.*;
        import org.apache.hadoop.fs.*;
        import org.apache.hadoop.conf.*;
        -import org.apache.hadoop.mapred.*;
        import org.apache.hadoop.mapreduce.Job;
        +import org.apache.hadoop.mapreduce.lib.input.*;

        Review comment:
        Please never use the wildcard imports e.g. ```import org.apache.hadoop.mapreduce.lib.input.*;```, always make individual imports.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r116651333 ########## File path: src/java/org/apache/nutch/crawl/CrawlDb.java ########## @@ -28,8 +28,9 @@ import org.apache.hadoop.io.*; import org.apache.hadoop.fs.*; import org.apache.hadoop.conf.*; -import org.apache.hadoop.mapred.*; import org.apache.hadoop.mapreduce.Job; +import org.apache.hadoop.mapreduce.lib.input.*; Review comment: Please never use the wildcard imports e.g. ```import org.apache.hadoop.mapreduce.lib.input.*;```, always make individual imports. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#discussion_r116651673

        ##########
        File path: src/java/org/apache/nutch/crawl/Generator.java
        ##########
        @@ -29,8 +29,10 @@
        import org.apache.commons.jexl2.Expression;
        import org.apache.hadoop.io.*;
        import org.apache.hadoop.conf.*;
        -import org.apache.hadoop.mapred.*;
        -import org.apache.hadoop.mapred.lib.MultipleSequenceFileOutputFormat;
        +import org.apache.hadoop.mapreduce.*;

        Review comment:
        Same here.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r116651673 ########## File path: src/java/org/apache/nutch/crawl/Generator.java ########## @@ -29,8 +29,10 @@ import org.apache.commons.jexl2.Expression; import org.apache.hadoop.io.*; import org.apache.hadoop.conf.*; -import org.apache.hadoop.mapred.*; -import org.apache.hadoop.mapred.lib.MultipleSequenceFileOutputFormat; +import org.apache.hadoop.mapreduce.*; Review comment: Same here. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#discussion_r116651780

        ##########
        File path: src/java/org/apache/nutch/util/NutchJob.java
        ##########
        @@ -17,14 +17,16 @@

        package org.apache.nutch.util;

        +import java.io.IOException;
        +
        import org.apache.hadoop.conf.Configuration;
        -import org.apache.hadoop.mapred.JobConf;
        +import org.apache.hadoop.mapreduce.Job;

        -/** A

        {@link JobConf}

        for Nutch jobs. */
        -public class NutchJob extends JobConf {
        +/** A

        {@link Job}

        for Nutch jobs. */
        +public class NutchJob extends Job {

        • public NutchJob(Configuration conf) {
        • super(conf, NutchJob.class);
          + public NutchJob(Configuration conf) throws IOException {
          + super(conf, "NutchJob");

        Review comment:
        Does this mean that every Job will be named ```"NutchJob"```?

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r116651780 ########## File path: src/java/org/apache/nutch/util/NutchJob.java ########## @@ -17,14 +17,16 @@ package org.apache.nutch.util; +import java.io.IOException; + import org.apache.hadoop.conf.Configuration; -import org.apache.hadoop.mapred.JobConf; +import org.apache.hadoop.mapreduce.Job; -/** A {@link JobConf} for Nutch jobs. */ -public class NutchJob extends JobConf { +/** A {@link Job} for Nutch jobs. */ +public class NutchJob extends Job { public NutchJob(Configuration conf) { super(conf, NutchJob.class); + public NutchJob(Configuration conf) throws IOException { + super(conf, "NutchJob"); Review comment: Does this mean that every Job will be named ```"NutchJob"```? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#discussion_r116651456

        ##########
        File path: src/java/org/apache/nutch/crawl/CrawlDbMerger.java
        ##########
        @@ -32,7 +32,9 @@
        import org.apache.hadoop.fs.Path;
        import org.apache.hadoop.io.Text;
        import org.apache.hadoop.io.Writable;
        -import org.apache.hadoop.mapred.*;
        +import org.apache.hadoop.mapreduce.*;

        Review comment:
        Same here.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r116651456 ########## File path: src/java/org/apache/nutch/crawl/CrawlDbMerger.java ########## @@ -32,7 +32,9 @@ import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.Writable; -import org.apache.hadoop.mapred.*; +import org.apache.hadoop.mapreduce.*; Review comment: Same here. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#discussion_r116651646

        ##########
        File path: src/java/org/apache/nutch/crawl/DeduplicationJob.java
        ##########
        @@ -31,7 +31,7 @@
        import org.apache.hadoop.fs.Path;
        import org.apache.hadoop.io.BytesWritable;
        import org.apache.hadoop.io.Text;
        -import org.apache.hadoop.mapred.Counters.Group;
        +/*import org.apache.hadoop.mapred.Counters.Group;

        Review comment:
        Never leave code commented out like this please. Also make explicit individual imports please.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r116651646 ########## File path: src/java/org/apache/nutch/crawl/DeduplicationJob.java ########## @@ -31,7 +31,7 @@ import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.BytesWritable; import org.apache.hadoop.io.Text; -import org.apache.hadoop.mapred.Counters.Group; +/*import org.apache.hadoop.mapred.Counters.Group; Review comment: Never leave code commented out like this please. Also make explicit individual imports please. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        Omkar20895 commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#discussion_r116693140

        ##########
        File path: src/java/org/apache/nutch/util/NutchJob.java
        ##########
        @@ -17,14 +17,16 @@

        package org.apache.nutch.util;

        +import java.io.IOException;
        +
        import org.apache.hadoop.conf.Configuration;
        -import org.apache.hadoop.mapred.JobConf;
        +import org.apache.hadoop.mapreduce.Job;

        -/** A

        {@link JobConf}

        for Nutch jobs. */
        -public class NutchJob extends JobConf {
        +/** A

        {@link Job}

        for Nutch jobs. */
        +public class NutchJob extends Job {

        • public NutchJob(Configuration conf) {
        • super(conf, NutchJob.class);
          + public NutchJob(Configuration conf) throws IOException {
          + super(conf, "NutchJob");

        Review comment:
        yes, @lewismc if we create a job as NutchJob job = new NutchJob(); then it will be named by default NutchJob and it can be overriden as job.setJobName("<name>"); in the code where the job is being created. Should it be in any other way?

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - Omkar20895 commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r116693140 ########## File path: src/java/org/apache/nutch/util/NutchJob.java ########## @@ -17,14 +17,16 @@ package org.apache.nutch.util; +import java.io.IOException; + import org.apache.hadoop.conf.Configuration; -import org.apache.hadoop.mapred.JobConf; +import org.apache.hadoop.mapreduce.Job; -/** A {@link JobConf} for Nutch jobs. */ -public class NutchJob extends JobConf { +/** A {@link Job} for Nutch jobs. */ +public class NutchJob extends Job { public NutchJob(Configuration conf) { super(conf, NutchJob.class); + public NutchJob(Configuration conf) throws IOException { + super(conf, "NutchJob"); Review comment: yes, @lewismc if we create a job as NutchJob job = new NutchJob(); then it will be named by default NutchJob and it can be overriden as job.setJobName("<name>"); in the code where the job is being created. Should it be in any other way? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        Omkar20895 commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#discussion_r116693310

        ##########
        File path: src/java/org/apache/nutch/crawl/CrawlDb.java
        ##########
        @@ -28,8 +28,9 @@
        import org.apache.hadoop.io.*;
        import org.apache.hadoop.fs.*;
        import org.apache.hadoop.conf.*;
        -import org.apache.hadoop.mapred.*;
        import org.apache.hadoop.mapreduce.Job;
        +import org.apache.hadoop.mapreduce.lib.input.*;

        Review comment:
        I am changing it in the local copy and I will update the PR today.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - Omkar20895 commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r116693310 ########## File path: src/java/org/apache/nutch/crawl/CrawlDb.java ########## @@ -28,8 +28,9 @@ import org.apache.hadoop.io.*; import org.apache.hadoop.fs.*; import org.apache.hadoop.conf.*; -import org.apache.hadoop.mapred.*; import org.apache.hadoop.mapreduce.Job; +import org.apache.hadoop.mapreduce.lib.input.*; Review comment: I am changing it in the local copy and I will update the PR today. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-302783499

        Please ignore the number of commits, I will squash them while this PR is being merged. Thanks.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-302783499 Please ignore the number of commits, I will squash them while this PR is being merged. Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-302793054

        np @Omkar20895 keep them coming

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-302793054 np @Omkar20895 keep them coming ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-302793286

        Also @Omkar20895 please ensure that this PR is kept up-to-date with master branch or else we will end up in mess. Thank you

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-302793286 Also @Omkar20895 please ensure that this PR is kept up-to-date with master branch or else we will end up in mess. Thank you ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-302852374

        yes I will @lewismc

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-302852374 yes I will @lewismc ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-304350642

        @lewismc please review the latest commit. The errors I was mentioning will be found in the files crawl/Generator.java, crawl/CrawlDbReader.java, and crawl/DeduplicationJob.java

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-304350642 @lewismc please review the latest commit. The errors I was mentioning will be found in the files crawl/Generator.java, crawl/CrawlDbReader.java, and crawl/DeduplicationJob.java ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#discussion_r118763652

        ##########
        File path: src/java/org/apache/nutch/crawl/CrawlDb.java
        ##########
        @@ -28,8 +28,9 @@
        import org.apache.hadoop.io.*;
        import org.apache.hadoop.fs.*;
        import org.apache.hadoop.conf.*;
        -import org.apache.hadoop.mapred.*;
        import org.apache.hadoop.mapreduce.Job;
        +import org.apache.hadoop.mapreduce.lib.input.*;

        Review comment:
        Can you please update this as well?

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r118763652 ########## File path: src/java/org/apache/nutch/crawl/CrawlDb.java ########## @@ -28,8 +28,9 @@ import org.apache.hadoop.io.*; import org.apache.hadoop.fs.*; import org.apache.hadoop.conf.*; -import org.apache.hadoop.mapred.*; import org.apache.hadoop.mapreduce.Job; +import org.apache.hadoop.mapreduce.lib.input.*; Review comment: Can you please update this as well? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-304371425

        @Omkar20895 I have no idea what the errors are as I am getting a whole host of compile issues due to the changing nature of NutchJob with the upgrade.
        Please go ahead and upgrade all instances of
        ```
        JobConf job = new NutchJob(getConf());
        ```
        We can then take it job-by-job, thanks. Also please make the updates as per my previous review and comments. Let's keep this branch up-to-date with master and also up-to-date with my comments. Thank you very much @Omkar20895

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-304371425 @Omkar20895 I have no idea what the errors are as I am getting a whole host of compile issues due to the changing nature of NutchJob with the upgrade. Please go ahead and upgrade all instances of ``` JobConf job = new NutchJob(getConf()); ``` We can then take it job-by-job, thanks. Also please make the updates as per my previous review and comments. Let's keep this branch up-to-date with master and also up-to-date with my comments. Thank you very much @Omkar20895 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-304371641

        @Omkar20895 you will also see the following
        ```
        /usr/local/nutch/src/java/org/apache/nutch/util/NutchJob.java:29: warning: [deprecation] Job(Configuration,String) in Job has been deprecated
        [javac] super(conf, "NutchJob");
        ```
        This also needs to be fixed before you go on and do anything else.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-304371641 @Omkar20895 you will also see the following ``` /usr/local/nutch/src/java/org/apache/nutch/util/NutchJob.java:29: warning: [deprecation] Job(Configuration,String) in Job has been deprecated [javac] super(conf, "NutchJob"); ``` This also needs to be fixed before you go on and do anything else. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#discussion_r121465277

        ##########
        File path: src/java/org/apache/nutch/crawl/CrawlDb.java
        ##########
        @@ -28,8 +28,12 @@
        import org.apache.hadoop.io.*;

        Review comment:
        @Omkar20895 can you please sort out all instances of wildcard imports for Hadoop?

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r121465277 ########## File path: src/java/org/apache/nutch/crawl/CrawlDb.java ########## @@ -28,8 +28,12 @@ import org.apache.hadoop.io.*; Review comment: @Omkar20895 can you please sort out all instances of wildcard imports for Hadoop? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#discussion_r121465424

        ##########
        File path: src/java/org/apache/nutch/crawl/CrawlDbMerger.java
        ##########
        @@ -32,7 +32,9 @@
        import org.apache.hadoop.fs.Path;
        import org.apache.hadoop.io.Text;
        import org.apache.hadoop.io.Writable;
        -import org.apache.hadoop.mapred.*;
        +import org.apache.hadoop.mapreduce.*;

        Review comment:
        @Omkar20895 please address the above issue

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r121465424 ########## File path: src/java/org/apache/nutch/crawl/CrawlDbMerger.java ########## @@ -32,7 +32,9 @@ import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.Writable; -import org.apache.hadoop.mapred.*; +import org.apache.hadoop.mapreduce.*; Review comment: @Omkar20895 please address the above issue ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-308030299

        Hi @lewismc , the major is error is produced by SequenceFileOutputFormat.getReaders() which has been removed from the latest mapreduce API, I cannot find the replacement for it. Thanks.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-308030299 Hi @lewismc , the major is error is produced by SequenceFileOutputFormat.getReaders() which has been removed from the latest mapreduce API, I cannot find the replacement for it. Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-308946744

        Hi @lewismc, did you get a chance to look at this issue? Thanks.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-308946744 Hi @lewismc, did you get a chance to look at this issue? Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-308948137

        Yes I did @Omkar20895

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-308948137 Yes I did @Omkar20895 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-308948181

        I'll comment on thr PR

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-308948181 I'll comment on thr PR ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-309027522

        Hi @Omkar20895 I am getting the following build errors
        ```
        compile-core:
        [javac] Compiling 284 source files to /usr/local/nutch/build/classes
        [javac] /usr/local/nutch/src/java/org/apache/nutch/crawl/CrawlDb.java:119: error: unreported exception InterruptedException; must be caught or declared to be thrown
        [javac] int complete = job.waitForCompletion(true)?0:1;
        [javac] ^
        [javac] /usr/local/nutch/src/java/org/apache/nutch/crawl/CrawlDbReader.java:399: error: cannot find symbol
        [javac] SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(config,
        [javac] ^
        [javac] symbol: method getReaders(Configuration,Path)
        [javac] location: class SequenceFileOutputFormat
        [javac] /usr/local/nutch/src/java/org/apache/nutch/hostdb/ReadHostDb.java:208: error: cannot find symbol
        [javac] SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(conf, hostDb);
        [javac] ^
        [javac] symbol: method getReaders(Configuration,Path)
        [javac] location: class SequenceFileOutputFormat
        [javac] /usr/local/nutch/src/java/org/apache/nutch/segment/SegmentReader.java:434: error: cannot find symbol
        [javac] SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(
        [javac] ^
        [javac] symbol: method getReaders(Configuration,Path)
        [javac] location: class SequenceFileOutputFormat
        [javac] /usr/local/nutch/src/java/org/apache/nutch/segment/SegmentReader.java:511: error: cannot find symbol
        [javac] SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(
        [javac] ^
        [javac] symbol: method getReaders(Configuration,Path)
        [javac] location: class SequenceFileOutputFormat
        [javac] /usr/local/nutch/src/java/org/apache/nutch/tools/FreeGenerator.java:197: error: cannot find symbol
        [javac] job.setOutputKeyComparatorClass(Generator.HashComparator.class);
        [javac] ^
        [javac] symbol: method setOutputKeyComparatorClass(Class<HashComparator>)
        [javac] location: variable job of type Job
        [javac] /usr/local/nutch/src/java/org/apache/nutch/util/NutchJob.java:29: warning: [deprecation] Job(Configuration,String) in Job has been deprecated
        [javac] super(conf, "NutchJob");
        [javac] ^
        [javac] Note: Some input files use unchecked or unsafe operations.
        [javac] Note: Recompile with -Xlint:unchecked for details.
        [javac] 6 errors
        [javac] 1 warning
        ```

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-309027522 Hi @Omkar20895 I am getting the following build errors ``` compile-core: [javac] Compiling 284 source files to /usr/local/nutch/build/classes [javac] /usr/local/nutch/src/java/org/apache/nutch/crawl/CrawlDb.java:119: error: unreported exception InterruptedException; must be caught or declared to be thrown [javac] int complete = job.waitForCompletion(true)?0:1; [javac] ^ [javac] /usr/local/nutch/src/java/org/apache/nutch/crawl/CrawlDbReader.java:399: error: cannot find symbol [javac] SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(config, [javac] ^ [javac] symbol: method getReaders(Configuration,Path) [javac] location: class SequenceFileOutputFormat [javac] /usr/local/nutch/src/java/org/apache/nutch/hostdb/ReadHostDb.java:208: error: cannot find symbol [javac] SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(conf, hostDb); [javac] ^ [javac] symbol: method getReaders(Configuration,Path) [javac] location: class SequenceFileOutputFormat [javac] /usr/local/nutch/src/java/org/apache/nutch/segment/SegmentReader.java:434: error: cannot find symbol [javac] SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders( [javac] ^ [javac] symbol: method getReaders(Configuration,Path) [javac] location: class SequenceFileOutputFormat [javac] /usr/local/nutch/src/java/org/apache/nutch/segment/SegmentReader.java:511: error: cannot find symbol [javac] SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders( [javac] ^ [javac] symbol: method getReaders(Configuration,Path) [javac] location: class SequenceFileOutputFormat [javac] /usr/local/nutch/src/java/org/apache/nutch/tools/FreeGenerator.java:197: error: cannot find symbol [javac] job.setOutputKeyComparatorClass(Generator.HashComparator.class); [javac] ^ [javac] symbol: method setOutputKeyComparatorClass(Class<HashComparator>) [javac] location: variable job of type Job [javac] /usr/local/nutch/src/java/org/apache/nutch/util/NutchJob.java:29: warning: [deprecation] Job(Configuration,String) in Job has been deprecated [javac] super(conf, "NutchJob"); [javac] ^ [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [javac] 6 errors [javac] 1 warning ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-309027722

        Please start with the following issue
        ```
        [javac] symbol: method setOutputKeyComparatorClass(Class<HashComparator>)
        [javac] location: variable job of type Job
        [javac] /usr/local/nutch/src/java/org/apache/nutch/util/NutchJob.java:29: warning: [deprecation] Job(Configuration,String) in Job has been deprecated
        [javac] super(conf, "NutchJob");
        [javac] ^
        ```

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-309027722 Please start with the following issue ``` [javac] symbol: method setOutputKeyComparatorClass(Class<HashComparator>) [javac] location: variable job of type Job [javac] /usr/local/nutch/src/java/org/apache/nutch/util/NutchJob.java:29: warning: [deprecation] Job(Configuration,String) in Job has been deprecated [javac] super(conf, "NutchJob"); [javac] ^ ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-309028312

        @Omkar20895 please refer to following class for guidance
        https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/util/NutchJob.java
        Once you push a commit please ping me. Thank you.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-309028312 @Omkar20895 please refer to following class for guidance https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/util/NutchJob.java Once you push a commit please ping me. Thank you. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-309217381

        @lewismc please take a look at this commit.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-309217381 @lewismc please take a look at this commit. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        Omkar20895 commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#discussion_r122575484

        ##########
        File path: src/java/org/apache/nutch/util/NutchJob.java
        ##########
        @@ -25,8 +25,9 @@
        /** A

        {@link Job}

        for Nutch jobs. */
        public class NutchJob extends Job {

        • public NutchJob(Configuration conf) throws IOException { - super(conf, "NutchJob"); - }

          + public static NutchJob getJobInstance(Configuration conf){

        Review comment:
        This latest addition will still give error because we do not have an constructor in this class and JVM will add a default constructor which will have a default super call to the parent class. Something of this form :
        **default constructor**
        public NutchJob()

        { super() }

        But this won't create any discrepancy because we are not utilizing that default constructor in the whole code base.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - Omkar20895 commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r122575484 ########## File path: src/java/org/apache/nutch/util/NutchJob.java ########## @@ -25,8 +25,9 @@ /** A {@link Job} for Nutch jobs. */ public class NutchJob extends Job { public NutchJob(Configuration conf) throws IOException { - super(conf, "NutchJob"); - } + public static NutchJob getJobInstance(Configuration conf){ Review comment: This latest addition will still give error because we do not have an constructor in this class and JVM will add a default constructor which will have a default super call to the parent class. Something of this form : ** default constructor ** public NutchJob() { super() } But this won't create any discrepancy because we are not utilizing that default constructor in the whole code base. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        Omkar20895 commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#discussion_r122575534

        ##########
        File path: src/java/org/apache/nutch/util/NutchJob.java
        ##########
        @@ -25,8 +25,9 @@
        /** A

        {@link Job}

        for Nutch jobs. */
        public class NutchJob extends Job {

        • public NutchJob(Configuration conf) throws IOException { - super(conf, "NutchJob"); - }

          + public static NutchJob getJobInstance(Configuration conf){

        Review comment:
        This doesn't give an error but gives a warning.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - Omkar20895 commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r122575534 ########## File path: src/java/org/apache/nutch/util/NutchJob.java ########## @@ -25,8 +25,9 @@ /** A {@link Job} for Nutch jobs. */ public class NutchJob extends Job { public NutchJob(Configuration conf) throws IOException { - super(conf, "NutchJob"); - } + public static NutchJob getJobInstance(Configuration conf){ Review comment: This doesn't give an error but gives a warning. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-309652343

        Hi, @lewismc any update on this?

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-309652343 Hi, @lewismc any update on this? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-309812412

        Hi @Omkar20895 please update all instance of ```SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(getConf(), dir);``` to ```MapFile.Reader[] readers = MapFileOutputFormat.getReaders(dir, getConf());```

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-309812412 Hi @Omkar20895 please update all instance of ```SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(getConf(), dir);``` to ```MapFile.Reader[] readers = MapFileOutputFormat.getReaders(dir, getConf());``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-309812412

        Hi @Omkar20895 please update all instance of ```SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(getConf(), dir);``` to ```MapFile.Reader[] readers = MapFileOutputFormat.getReaders(dir, getConf());```.
        Please push an update once this is done and we can take it from there. Thank you.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-309812412 Hi @Omkar20895 please update all instance of ```SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(getConf(), dir);``` to ```MapFile.Reader[] readers = MapFileOutputFormat.getReaders(dir, getConf());```. Please push an update once this is done and we can take it from there. Thank you. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-309813213

        But mapfileoutputformat and sequencefileoutputformat are different aren't they? Won't it cause some discrepancy?
        @lewismc

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-309813213 But mapfileoutputformat and sequencefileoutputformat are different aren't they? Won't it cause some discrepancy? @lewismc ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-309815700

        > But mapfileoutputformat and sequencefileoutputformat are different aren't they?

        Yes they have different semantics. You can see difference between [MapFile](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/MapFile.html) and [SequenceFile](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/SequenceFile.html).

        > Won't it cause some discrepancy?

        It may well do. Right now though you can implement them (there are only 4 or 5 instances) we have this thread for reference. If there are discrepancies then we can come back and fix them.
        \Most import thing is that you continue to move onwards.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-309815700 > But mapfileoutputformat and sequencefileoutputformat are different aren't they? Yes they have different semantics. You can see difference between [MapFile] ( https://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/MapFile.html ) and [SequenceFile] ( https://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/SequenceFile.html ). > Won't it cause some discrepancy? It may well do. Right now though you can implement them (there are only 4 or 5 instances) we have this thread for reference. If there are discrepancies then we can come back and fix them. \Most import thing is that you continue to move onwards. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-310011122

        Done! what about the other two errors.
        One is I cannot find the replacement for
        /home/omkar/Documents/GIT/nutch1/src/java/org/apache/nutch/tools/FreeGenerator.java:197: error: cannot find symbol
        [javac] job.setOutputKeyComparatorClass(Generator.HashComparator.class);

        @lewismc

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-310011122 Done! what about the other two errors. One is I cannot find the replacement for /home/omkar/Documents/GIT/nutch1/src/java/org/apache/nutch/tools/FreeGenerator.java:197: error: cannot find symbol [javac] job.setOutputKeyComparatorClass(Generator.HashComparator.class); @lewismc ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-310097433

        @Omkar20895 before I look at this, which Class or Method is no longer found? ```setOutputKeyComparatorClass```, ```Generator```, or ```HashComparator```?

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-310097433 @Omkar20895 before I look at this, which Class or Method is no longer found? ```setOutputKeyComparatorClass```, ```Generator```, or ```HashComparator```? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-310113233

        @lewismc setOutputKeyComparatorClass.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-310113233 @lewismc setOutputKeyComparatorClass. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-310138744

        @Omkar20895 did you push your new commits? I don't see them yet and I just tried pulling.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-310138744 @Omkar20895 did you push your new commits? I don't see them yet and I just tried pulling. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-310139507

        Please replace ```job.setOutputKeyComparatorClass``` with ```job.setSortComparatorClass```

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-310139507 Please replace ```job.setOutputKeyComparatorClass``` with ```job.setSortComparatorClass``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-310141065

        Also, the NutchJob class should look as follows
        ```
        package org.apache.nutch.util;

        import org.apache.hadoop.conf.Configuration;
        import org.apache.hadoop.mapred.JobConf;

        /** A

        {@link JobConf}

        for Nutch jobs. */
        public class NutchJob extends JobConf {

        public NutchJob(Configuration conf)

        { super(conf, NutchJob.class); }

        }
        ```

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-310141065 Also, the NutchJob class should look as follows ``` package org.apache.nutch.util; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.mapred.JobConf; /** A {@link JobConf} for Nutch jobs. */ public class NutchJob extends JobConf { public NutchJob(Configuration conf) { super(conf, NutchJob.class); } } ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-310144221

        `public class NutchJob extends JobConf `
        Okay let me tell you my why I changed it, you see JobConf is no longer there in new Mapreduce API and hence I replaced it with Job class and now NutchJob extends Job class.
        ```
        public NutchJob(Configuration conf)

        { super(conf, NutchJob.class); }

        ```
        If you look at the constructor it has a super call that will call Job(conf, class) which is deprecated and the documentation suggests to use getInstance instead.

        Please refer to constructor summary section [here](https://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/mapreduce/Job.html).

        Please feel free to correct me if I am wrong. Thanks.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-310144221 `public class NutchJob extends JobConf ` Okay let me tell you my why I changed it, you see JobConf is no longer there in new Mapreduce API and hence I replaced it with Job class and now NutchJob extends Job class. ``` public NutchJob(Configuration conf) { super(conf, NutchJob.class); } ``` If you look at the constructor it has a super call that will call Job(conf, class) which is deprecated and the documentation suggests to use getInstance instead. Please refer to constructor summary section [here] ( https://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/mapreduce/Job.html ). Please feel free to correct me if I am wrong. Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-310156248

        OK, just as long as the deprecation WARN is removed at some stage I am happy with that. Looking forward to this upgrade finishing and we can test it out. Thank you @Omkar20895

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-310156248 OK, just as long as the deprecation WARN is removed at some stage I am happy with that. Looking forward to this upgrade finishing and we can test it out. Thank you @Omkar20895 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-310279028

        Hi there is this one last error that I am concerned about.

        ```
        [javac] /home/omkar/Documents/GIT/nutch1/src/java/org/apache/nutch/crawl/CrawlDb.java:119: error: unreported exception InterruptedException; must be caught or declared to be thrown
        [javac] int complete = job.waitForCompletion(true)?0:1;
        ```
        I have used `int complete = job.waitForCompletion(true)?0:1;` in multiple files and I did not get any error in those files and I am unable to understand why I am getting this here. Thanks.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-310279028 Hi there is this one last error that I am concerned about. ``` [javac] /home/omkar/Documents/GIT/nutch1/src/java/org/apache/nutch/crawl/CrawlDb.java:119: error: unreported exception InterruptedException; must be caught or declared to be thrown [javac] int complete = job.waitForCompletion(true)?0:1; ``` I have used `int complete = job.waitForCompletion(true)?0:1;` in multiple files and I did not get any error in those files and I am unable to understand why I am getting this here. Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-310300795

        @Omkar20895 ... again, you've not pushed anything to the remote branch. Please push so I can pull locally. Thanks.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-310300795 @Omkar20895 ... again, you've not pushed anything to the remote branch. Please push so I can pull locally. Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-310357695

        @lewismc pardon me for the delay. I thought I pushed a commit. 😅

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-310357695 @lewismc pardon me for the delay. I thought I pushed a commit. 😅 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-310437374

        Check out [Job.waitForCompletion](https://hadoop.apache.org/docs/current/api/index.html?org/apache/hadoop/mapreduce/Job.html) it
        ```
        Throws:
        IOException - thrown if the communication with the JobTracker is lost
        InterruptedException
        ClassNotFoundException
        ```
        These need to be handled appropriately, please go and have a look at Exception handling within method calls e.g. try... catch

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-310437374 Check out [Job.waitForCompletion] ( https://hadoop.apache.org/docs/current/api/index.html?org/apache/hadoop/mapreduce/Job.html ) it ``` Throws: IOException - thrown if the communication with the JobTracker is lost InterruptedException ClassNotFoundException ``` These need to be handled appropriately, please go and have a look at Exception handling within method calls e.g. try... catch ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-310624270

        @lewismc

        >>>`I have used int complete = job.waitForCompletion(true)?0:1; in multiple files and I did not get any error in those files and I am unable to understand why I am getting this here`

        This exception handling needs to be done in every file that used the above line, I am working on it and will push a commit as soon as I am done. I am making following changes.

        Converting `int complete = job.waitForCompletion(true)?0:1;` to

        ```
        try

        { int complete = job.waitForCompletion(true)?0:1; }

        catch (InterruptedException e)

        { LOG.info("Exception: "+e); throw e; }

        catch (ClassNotFoundException e)

        { LOG.info("Exception: "+e); throw e; }

        ```
        Thanks.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-310624270 @lewismc >>>`I have used int complete = job.waitForCompletion(true)?0:1; in multiple files and I did not get any error in those files and I am unable to understand why I am getting this here` This exception handling needs to be done in every file that used the above line, I am working on it and will push a commit as soon as I am done. I am making following changes. Converting `int complete = job.waitForCompletion(true)?0:1;` to ``` try { int complete = job.waitForCompletion(true)?0:1; } catch (InterruptedException e) { LOG.info("Exception: "+e); throw e; } catch (ClassNotFoundException e) { LOG.info("Exception: "+e); throw e; } ``` Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        sebastian-nagel commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-310639864

        Hi @Omkar20895, blindly catching exceptions (logging and throwing them again) does not make the code better. It's important to think whether the exception needs to be handled or not If the job job.waitForCompletion(true) throws an exception the job has failed

        • sometimes you need to clean up to avoid that long-living data structures (CrawlDb, LinkDb) are broken, e.g. in [Injector](https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/Injector.java#L402)
        • for write-once structures (segments) this is usually not a requirement, as broken segments are just ignored by other tools. In this case it's enough to throw the exception (it needs to be declared to be thrown)
          Afaics, you're on the right way. Just make sure before a push that everything compiles (`ant clean runtime javadoc test`).

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-310639864 Hi @Omkar20895, blindly catching exceptions (logging and throwing them again) does not make the code better. It's important to think whether the exception needs to be handled or not If the job job.waitForCompletion(true) throws an exception the job has failed sometimes you need to clean up to avoid that long-living data structures (CrawlDb, LinkDb) are broken, e.g. in [Injector] ( https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/Injector.java#L402 ) for write-once structures (segments) this is usually not a requirement, as broken segments are just ignored by other tools. In this case it's enough to throw the exception (it needs to be declared to be thrown) Afaics, you're on the right way. Just make sure before a push that everything compiles (`ant clean runtime javadoc test`). ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        sebastian-nagel commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#discussion_r123725912

        ##########
        File path: src/java/org/apache/nutch/crawl/CrawlDb.java
        ##########
        @@ -111,7 +116,7 @@ public void update(Path crawlDb, Path[] segments, boolean normalize,
        LOG.info("CrawlDb update: Merging segment data into db.");
        }
        try

        { - JobClient.runJob(job); + int complete = job.waitForCompletion(true)?0:1; }

        catch (IOException e) {

        Review comment:
        Need to catch all thrown exceptions, incl. ClassNotFoundException and InterruptedException. Could also just catch `Exception`.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r123725912 ########## File path: src/java/org/apache/nutch/crawl/CrawlDb.java ########## @@ -111,7 +116,7 @@ public void update(Path crawlDb, Path[] segments, boolean normalize, LOG.info("CrawlDb update: Merging segment data into db."); } try { - JobClient.runJob(job); + int complete = job.waitForCompletion(true)?0:1; } catch (IOException e) { Review comment: Need to catch all thrown exceptions, incl. ClassNotFoundException and InterruptedException. Could also just catch `Exception`. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        sebastian-nagel commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-310639864

        Hi @Omkar20895, blindly catching exceptions (logging and throwing them again) does not make the code better. It's important to think whether the exception needs to be handled or not If the job job.waitForCompletion(true) throws an exception the job has failed

        Afaics, you're on the right way. Just make sure before a push that everything compiles (`ant clean runtime javadoc test`).

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-310639864 Hi @Omkar20895, blindly catching exceptions (logging and throwing them again) does not make the code better. It's important to think whether the exception needs to be handled or not If the job job.waitForCompletion(true) throws an exception the job has failed sometimes you need to clean up to avoid that long-living data structures (CrawlDb, LinkDb) are broken, e.g. in [Injector] ( https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/Injector.java#L402 ) for write-once structures (segments) this is usually not a requirement, as broken segments are just ignored by other tools. In this case it's enough to throw the exception (it needs to be declared to be thrown) Afaics, you're on the right way. Just make sure before a push that everything compiles (`ant clean runtime javadoc test`). ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        Omkar20895 commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#discussion_r123726840

        ##########
        File path: src/java/org/apache/nutch/crawl/CrawlDb.java
        ##########
        @@ -111,7 +116,7 @@ public void update(Path crawlDb, Path[] segments, boolean normalize,
        LOG.info("CrawlDb update: Merging segment data into db.");
        }
        try

        { - JobClient.runJob(job); + int complete = job.waitForCompletion(true)?0:1; }

        catch (IOException e) {

        Review comment:
        @sebastian-nagel in the above job when there is an IOException the output path is being removed (cleaning up) before throwing up the error and the same needs to be done when we catch the other two exceptions i.e ClassNotFoundException and InterruptedException also, isn't it? Please feel free to correct me if I am wrong.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - Omkar20895 commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r123726840 ########## File path: src/java/org/apache/nutch/crawl/CrawlDb.java ########## @@ -111,7 +116,7 @@ public void update(Path crawlDb, Path[] segments, boolean normalize, LOG.info("CrawlDb update: Merging segment data into db."); } try { - JobClient.runJob(job); + int complete = job.waitForCompletion(true)?0:1; } catch (IOException e) { Review comment: @sebastian-nagel in the above job when there is an IOException the output path is being removed (cleaning up) before throwing up the error and the same needs to be done when we catch the other two exceptions i.e ClassNotFoundException and InterruptedException also, isn't it? Please feel free to correct me if I am wrong. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        sebastian-nagel commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#discussion_r123728900

        ##########
        File path: src/java/org/apache/nutch/crawl/CrawlDb.java
        ##########
        @@ -111,7 +116,7 @@ public void update(Path crawlDb, Path[] segments, boolean normalize,
        LOG.info("CrawlDb update: Merging segment data into db.");
        }
        try

        { - JobClient.runJob(job); + int complete = job.waitForCompletion(true)?0:1; }

        catch (IOException e) {

        Review comment:
        Yes, exactly, e.g.
        ```
        public void update(...) throws Exception {
        ...
        try

        { job.waitForCompletion(true); }

        catch (IOException | ClassNotFoundException | InterruptedException e) {
        // do cleanup
        throw e
        ```

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r123728900 ########## File path: src/java/org/apache/nutch/crawl/CrawlDb.java ########## @@ -111,7 +116,7 @@ public void update(Path crawlDb, Path[] segments, boolean normalize, LOG.info("CrawlDb update: Merging segment data into db."); } try { - JobClient.runJob(job); + int complete = job.waitForCompletion(true)?0:1; } catch (IOException e) { Review comment: Yes, exactly, e.g. ``` public void update(...) throws Exception { ... try { job.waitForCompletion(true); } catch (IOException | ClassNotFoundException | InterruptedException e) { // do cleanup throw e ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        sebastian-nagel commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#discussion_r123728981

        ##########
        File path: src/java/org/apache/nutch/crawl/CrawlDb.java
        ##########
        @@ -111,7 +116,7 @@ public void update(Path crawlDb, Path[] segments, boolean normalize,
        LOG.info("CrawlDb update: Merging segment data into db.");
        }
        try

        { - JobClient.runJob(job); + int complete = job.waitForCompletion(true)?0:1; }

        catch (IOException e) {

        Review comment:
        I left `complete` away because it is not used.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r123728981 ########## File path: src/java/org/apache/nutch/crawl/CrawlDb.java ########## @@ -111,7 +116,7 @@ public void update(Path crawlDb, Path[] segments, boolean normalize, LOG.info("CrawlDb update: Merging segment data into db."); } try { - JobClient.runJob(job); + int complete = job.waitForCompletion(true)?0:1; } catch (IOException e) { Review comment: I left `complete` away because it is not used. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        sebastian-nagel commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#discussion_r123728900

        ##########
        File path: src/java/org/apache/nutch/crawl/CrawlDb.java
        ##########
        @@ -111,7 +116,7 @@ public void update(Path crawlDb, Path[] segments, boolean normalize,
        LOG.info("CrawlDb update: Merging segment data into db.");
        }
        try

        { - JobClient.runJob(job); + int complete = job.waitForCompletion(true)?0:1; }

        catch (IOException e) {

        Review comment:
        Yes, exactly, e.g.
        ```
        public void update(...) throws Exception {
        ...
        try

        { job.waitForCompletion(true); }

        catch (IOException | ClassNotFoundException | InterruptedException e) {
        // do cleanup
        throw e
        ```

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r123728900 ########## File path: src/java/org/apache/nutch/crawl/CrawlDb.java ########## @@ -111,7 +116,7 @@ public void update(Path crawlDb, Path[] segments, boolean normalize, LOG.info("CrawlDb update: Merging segment data into db."); } try { - JobClient.runJob(job); + int complete = job.waitForCompletion(true)?0:1; } catch (IOException e) { Review comment: Yes, exactly, e.g. ``` public void update(...) throws Exception { ... try { job.waitForCompletion(true); } catch (IOException | ClassNotFoundException | InterruptedException e) { // do cleanup throw e ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        sebastian-nagel commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#discussion_r123730458

        ##########
        File path: src/java/org/apache/nutch/crawl/CrawlDb.java
        ##########
        @@ -111,7 +116,7 @@ public void update(Path crawlDb, Path[] segments, boolean normalize,
        LOG.info("CrawlDb update: Merging segment data into db.");
        }
        try

        { - JobClient.runJob(job); + int complete = job.waitForCompletion(true)?0:1; }

        catch (IOException e) {

        Review comment:
        Of course, you could also (maybe a matter of taste):
        ```
        public void update(...) throws IOException, ClassNotFoundException, InterruptedException {
        ```

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - sebastian-nagel commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r123730458 ########## File path: src/java/org/apache/nutch/crawl/CrawlDb.java ########## @@ -111,7 +116,7 @@ public void update(Path crawlDb, Path[] segments, boolean normalize, LOG.info("CrawlDb update: Merging segment data into db."); } try { - JobClient.runJob(job); + int complete = job.waitForCompletion(true)?0:1; } catch (IOException e) { Review comment: Of course, you could also (maybe a matter of taste): ``` public void update(...) throws IOException, ClassNotFoundException, InterruptedException { ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-310693000

        The next commit will be updating the plugins and tests which I am working on. I will update the PR as soon as I am done. Thanks.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-310693000 The next commit will be updating the plugins and tests which I am working on. I will update the PR as soon as I am done. Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-310733679

        Thank you @sebastian-nagel excellent advice

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-310733679 Thank you @sebastian-nagel excellent advice ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-311569394

        How are things coming along @Omkar20895 ?

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-311569394 How are things coming along @Omkar20895 ? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-312704405

        Hi,
        These are the warnings and errors that I am getting right now and unable to solve.

        ```
        compile-core-test:
        [javac] Compiling 54 source files to /Users/omkar/Documents/Git/nutch/build/test/classes
        [javac] /Users/omkar/Documents/Git/nutch/src/test/org/apache/nutch/crawl/CrawlDbUpdateTestDriver.java:95: warning: [deprecation] setConfiguration(Configuration) in TestDriver has been deprecated
        [javac] reduceDriver.setConfiguration(configuration);
        [javac] ^
        [javac] /Users/omkar/Documents/Git/nutch/src/test/org/apache/nutch/crawl/CrawlDbUpdateUtil.java:58: error: cannot find symbol
        [javac] reducer.configure(Job.getInstance(conf));
        [javac] ^
        [javac] symbol: method configure(Job)
        [javac] location: variable reducer of type T
        [javac] where T is a type-variable:
        [javac] T extends Reducer<Text,CrawlDatum,Text,CrawlDatum> declared in class CrawlDbUpdateUtil
        ```
        The reducer we are referring to here is [CrawlDbReducer](https://github.com/apache/nutch/blob/3e2d3d456489bf52bc586dae0e2e71fb7aad8fe7/src/java/org/apache/nutch/crawl/CrawlDbReducer.java) and even though it has a configure method in it I am getting the error.

        ```
        [javac] /Users/omkar/Documents/Git/nutch/src/test/org/apache/nutch/crawl/CrawlDbUpdateUtil.java:62: error: CrawlDbUpdateUtil.DummyContext is not abstract and does not override abstract method write(Object,Object) in TaskInputOutputContext
        [javac] private class DummyContext extends Context {
        [javac] ^
        [javac]
        ```
        Here I have already overrided the write method but still I am getting this error.

        ```
        /Users/omkar/Documents/Git/nutch/src/test/org/apache/nutch/crawl/CrawlDbUpdateUtil.java:124: error: reduce(KEYIN,Iterable<VALUEIN>,Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>.Context) has protected access in Reducer
        [javac] reducer.reduce(dummyURL, (Iterable)values, context);
        [javac] ^
        [javac] where KEYIN,VALUEIN,KEYOUT,VALUEOUT are type-variables:
        [javac] KEYIN extends Object declared in class Reducer
        [javac] VALUEIN extends Object declared in class Reducer
        [javac] KEYOUT extends Object declared in class Reducer
        [javac] VALUEOUT extends Object declared in class Reducer
        [javac] /Users/omkar/Documents/Git/nutch/src/test/org/apache/nutch/indexer/TestIndexerMapReduce.java:172: warning: [deprecation] setConfiguration(Configuration) in TestDriver has been deprecated
        [javac] reduceDriver.setConfiguration(configuration);
        [javac] ^
        [javac] Note: /Users/omkar/Documents/Git/nutch/src/test/org/apache/nutch/crawl/CrawlDbUpdateUtil.java uses unchecked or unsafe operations.
        [javac] Note: Recompile with -Xlint:unchecked for details.
        [javac] 3 errors
        [javac] 2 warnings

        ```
        Please pull and try out the latest commit to reproduce the errors and use the command "ant clean runtime test". The runtime build is successful but the test will give errors. Thanks.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-312704405 Hi, These are the warnings and errors that I am getting right now and unable to solve. ``` compile-core-test: [javac] Compiling 54 source files to /Users/omkar/Documents/Git/nutch/build/test/classes [javac] /Users/omkar/Documents/Git/nutch/src/test/org/apache/nutch/crawl/CrawlDbUpdateTestDriver.java:95: warning: [deprecation] setConfiguration(Configuration) in TestDriver has been deprecated [javac] reduceDriver.setConfiguration(configuration); [javac] ^ [javac] /Users/omkar/Documents/Git/nutch/src/test/org/apache/nutch/crawl/CrawlDbUpdateUtil.java:58: error: cannot find symbol [javac] reducer.configure(Job.getInstance(conf)); [javac] ^ [javac] symbol: method configure(Job) [javac] location: variable reducer of type T [javac] where T is a type-variable: [javac] T extends Reducer<Text,CrawlDatum,Text,CrawlDatum> declared in class CrawlDbUpdateUtil ``` The reducer we are referring to here is [CrawlDbReducer] ( https://github.com/apache/nutch/blob/3e2d3d456489bf52bc586dae0e2e71fb7aad8fe7/src/java/org/apache/nutch/crawl/CrawlDbReducer.java ) and even though it has a configure method in it I am getting the error. ``` [javac] /Users/omkar/Documents/Git/nutch/src/test/org/apache/nutch/crawl/CrawlDbUpdateUtil.java:62: error: CrawlDbUpdateUtil.DummyContext is not abstract and does not override abstract method write(Object,Object) in TaskInputOutputContext [javac] private class DummyContext extends Context { [javac] ^ [javac] ``` Here I have already overrided the write method but still I am getting this error. ``` /Users/omkar/Documents/Git/nutch/src/test/org/apache/nutch/crawl/CrawlDbUpdateUtil.java:124: error: reduce(KEYIN,Iterable<VALUEIN>,Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>.Context) has protected access in Reducer [javac] reducer.reduce(dummyURL, (Iterable)values, context); [javac] ^ [javac] where KEYIN,VALUEIN,KEYOUT,VALUEOUT are type-variables: [javac] KEYIN extends Object declared in class Reducer [javac] VALUEIN extends Object declared in class Reducer [javac] KEYOUT extends Object declared in class Reducer [javac] VALUEOUT extends Object declared in class Reducer [javac] /Users/omkar/Documents/Git/nutch/src/test/org/apache/nutch/indexer/TestIndexerMapReduce.java:172: warning: [deprecation] setConfiguration(Configuration) in TestDriver has been deprecated [javac] reduceDriver.setConfiguration(configuration); [javac] ^ [javac] Note: /Users/omkar/Documents/Git/nutch/src/test/org/apache/nutch/crawl/CrawlDbUpdateUtil.java uses unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [javac] 3 errors [javac] 2 warnings ``` Please pull and try out the latest commit to reproduce the errors and use the command "ant clean runtime test". The runtime build is successful but the test will give errors. Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-312743093

        @Omkar20895

        regarding ```CrawlDbUpdateTestDriver.java:95: warning: [deprecation] setConfiguration(Configuration) in TestDriver has been deprecated```
        please see [javadoc](https://mrunit.apache.org/documentation/javadocs/1.1.0/index.html?org/apache/hadoop/mrunit/TestDriver.html): Deprecated. Use getConfiguration() to set configuration items as opposed to overriding the entire configuration object as it's used internally. Also make sure to update the Class and Method-level API documentation to ```

        {@link CrawlDbReducer#reduce(Text, Iterator, Context)}

        ``` this is present in several Classes and you have not updated it yet.

        regarding ```CrawlDbUpdateUtil.java:58: error: cannot find symbol``` this is due to removal of ```configure``` method for Reducer.java. Please see [Javadoc](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Reducer.html). You may have to work with ```setup``` instead.

        Fix both of the above, which once addressed will bring the compilation back to stable I think. Also check out the Javadoc for [TaskInputOutputContext](https://hadoop.apache.org/docs/current/api/index.html?org/apache/hadoop/mapreduce/TaskInputOutputContext.html) and ensure that CrawlDbUpdateUtil.DummyContext overrides the appropriate methods with the appropriate peramaters.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-312743093 @Omkar20895 regarding ```CrawlDbUpdateTestDriver.java:95: warning: [deprecation] setConfiguration(Configuration) in TestDriver has been deprecated``` please see [javadoc] ( https://mrunit.apache.org/documentation/javadocs/1.1.0/index.html?org/apache/hadoop/mrunit/TestDriver.html): Deprecated. Use getConfiguration() to set configuration items as opposed to overriding the entire configuration object as it's used internally. Also make sure to update the Class and Method-level API documentation to ``` {@link CrawlDbReducer#reduce(Text, Iterator, Context)} ``` this is present in several Classes and you have not updated it yet. regarding ```CrawlDbUpdateUtil.java:58: error: cannot find symbol``` this is due to removal of ```configure``` method for Reducer.java. Please see [Javadoc] ( https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Reducer.html ). You may have to work with ```setup``` instead. Fix both of the above, which once addressed will bring the compilation back to stable I think. Also check out the Javadoc for [TaskInputOutputContext] ( https://hadoop.apache.org/docs/current/api/index.html?org/apache/hadoop/mapreduce/TaskInputOutputContext.html ) and ensure that CrawlDbUpdateUtil.DummyContext overrides the appropriate methods with the appropriate peramaters. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-312743093

        @Omkar20895

        regarding ```CrawlDbUpdateTestDriver.java:95: warning: [deprecation] setConfiguration(Configuration) in TestDriver has been deprecated```
        please see [javadoc](https://mrunit.apache.org/documentation/javadocs/1.1.0/index.html?org/apache/hadoop/mrunit/TestDriver.html): Deprecated. Use getConfiguration() to set configuration items as opposed to overriding the entire configuration object as it's used internally. Also make sure to update the Class and Method-level API documentation to ```

        {@link CrawlDbReducer#reduce(Text, Iterator, Context)}

        ``` this is present in several Classes and you have not updated it yet.

        regarding ```CrawlDbUpdateUtil.java:58: error: cannot find symbol``` this is due to removal of ```configure``` method for Reducer.java. Please see [Javadoc](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Reducer.html). You may have to work with ```setup``` instead.

        Fix both of the above, which once addressed will bring the compilation back to stable I think. Also check out the Javadoc for [TaskInputOutputContext](https://hadoop.apache.org/docs/current/api/index.html?org/apache/hadoop/mapreduce/TaskInputOutputContext.html) and ensure that CrawlDbUpdateUtil.DummyContext overrides the appropriate methods with the appropriate paramaters.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-312743093 @Omkar20895 regarding ```CrawlDbUpdateTestDriver.java:95: warning: [deprecation] setConfiguration(Configuration) in TestDriver has been deprecated``` please see [javadoc] ( https://mrunit.apache.org/documentation/javadocs/1.1.0/index.html?org/apache/hadoop/mrunit/TestDriver.html): Deprecated. Use getConfiguration() to set configuration items as opposed to overriding the entire configuration object as it's used internally. Also make sure to update the Class and Method-level API documentation to ``` {@link CrawlDbReducer#reduce(Text, Iterator, Context)} ``` this is present in several Classes and you have not updated it yet. regarding ```CrawlDbUpdateUtil.java:58: error: cannot find symbol``` this is due to removal of ```configure``` method for Reducer.java. Please see [Javadoc] ( https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Reducer.html ). You may have to work with ```setup``` instead. Fix both of the above, which once addressed will bring the compilation back to stable I think. Also check out the Javadoc for [TaskInputOutputContext] ( https://hadoop.apache.org/docs/current/api/index.html?org/apache/hadoop/mapreduce/TaskInputOutputContext.html ) and ensure that CrawlDbUpdateUtil.DummyContext overrides the appropriate methods with the appropriate paramaters. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-312743093

        @Omkar20895

        regarding ```CrawlDbUpdateTestDriver.java:95: warning: [deprecation] setConfiguration(Configuration) in TestDriver has been deprecated```
        please see [javadoc](https://mrunit.apache.org/documentation/javadocs/1.1.0/index.html?org/apache/hadoop/mrunit/TestDriver.html): Deprecated. Use getConfiguration() to set configuration items as opposed to overriding the entire configuration object as it's used internally. Also make sure to update the Class and Method-level API documentation to ```

        {@link CrawlDbReducer#reduce(Text, Iterator, Context)}

        ``` this is present in several Classes and you have not updated it yet.

        regarding ```CrawlDbUpdateUtil.java:58: error: cannot find symbol``` this is due to removal of ```configure``` method for Reducer.java. Please see [Javadoc](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Reducer.html). You may have to work with ```setup``` instead.

        Fix both of the above, which once addressed will bring the compilation back to stable I think. Also check out the Javadoc for [TaskInputOutputContext](https://hadoop.apache.org/docs/current/api/index.html?org/apache/hadoop/mapreduce/TaskInputOutputContext.html) and ensure that CrawlDbUpdateUtil.DummyContext overrides the appropriate methods with the appropriate parameters.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-312743093 @Omkar20895 regarding ```CrawlDbUpdateTestDriver.java:95: warning: [deprecation] setConfiguration(Configuration) in TestDriver has been deprecated``` please see [javadoc] ( https://mrunit.apache.org/documentation/javadocs/1.1.0/index.html?org/apache/hadoop/mrunit/TestDriver.html): Deprecated. Use getConfiguration() to set configuration items as opposed to overriding the entire configuration object as it's used internally. Also make sure to update the Class and Method-level API documentation to ``` {@link CrawlDbReducer#reduce(Text, Iterator, Context)} ``` this is present in several Classes and you have not updated it yet. regarding ```CrawlDbUpdateUtil.java:58: error: cannot find symbol``` this is due to removal of ```configure``` method for Reducer.java. Please see [Javadoc] ( https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Reducer.html ). You may have to work with ```setup``` instead. Fix both of the above, which once addressed will bring the compilation back to stable I think. Also check out the Javadoc for [TaskInputOutputContext] ( https://hadoop.apache.org/docs/current/api/index.html?org/apache/hadoop/mapreduce/TaskInputOutputContext.html ) and ensure that CrawlDbUpdateUtil.DummyContext overrides the appropriate methods with the appropriate parameters. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-313889464

        Thank you for updating PR @Omkar20895 , please begin making your way through the tests and fixing the errors. AFAICT, the following tests currently fail. You can see all of the reports in ```build/tests```

        ```
        [junit] Running org.apache.nutch.crawl.TestCrawlDbFilter
        [junit] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 2.099 sec
        [junit] Test org.apache.nutch.crawl.TestCrawlDbFilter FAILED
        [junit] Running org.apache.nutch.crawl.TestCrawlDbMerger
        [junit] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.986 sec
        [junit] Test org.apache.nutch.crawl.TestCrawlDbMerger FAILED
        [junit] Running org.apache.nutch.crawl.TestGenerator
        [junit] Tests run: 4, Failures: 0, Errors: 4, Skipped: 0, Time elapsed: 8.483 sec
        [junit] Test org.apache.nutch.crawl.TestGenerator FAILED
        [junit] Running org.apache.nutch.crawl.TestLinkDbMerger
        [junit] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 2.448 sec
        [junit] Test org.apache.nutch.crawl.TestLinkDbMerger FAILED
        [junit] Running org.apache.nutch.fetcher.TestFetcher
        [junit] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 4.323 sec
        [junit] Test org.apache.nutch.fetcher.TestFetcher FAILED
        [junit] Running org.apache.nutch.indexer.TestIndexerMapReduce
        [junit] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.766 sec
        [junit] Test org.apache.nutch.indexer.TestIndexerMapReduce FAILED
        [junit] Running org.apache.nutch.plugin.TestPluginSystem
        [junit] Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.073 sec
        [junit] Test org.apache.nutch.plugin.TestPluginSystem FAILED
        [junit] Running org.apache.nutch.segment.TestSegmentMerger
        [junit] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 23.399 sec
        [junit] Test org.apache.nutch.segment.TestSegmentMerger FAILED
        [junit] Running org.apache.nutch.segment.TestSegmentMergerCrawlDatums
        [junit] Tests run: 7, Failures: 7, Errors: 0, Skipped: 0, Time elapsed: 32.267 sec
        [junit] Test org.apache.nutch.segment.TestSegmentMergerCrawlDatums FAILED
        ```

        @Omkar20895 have you tried to execute an end-to-end test crawl and validate the results? If not, then start there and see where it is all going wrong.
        Thanks

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-313889464 Thank you for updating PR @Omkar20895 , please begin making your way through the tests and fixing the errors. AFAICT, the following tests currently fail. You can see all of the reports in ```build/tests``` ``` [junit] Running org.apache.nutch.crawl.TestCrawlDbFilter [junit] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 2.099 sec [junit] Test org.apache.nutch.crawl.TestCrawlDbFilter FAILED [junit] Running org.apache.nutch.crawl.TestCrawlDbMerger [junit] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.986 sec [junit] Test org.apache.nutch.crawl.TestCrawlDbMerger FAILED [junit] Running org.apache.nutch.crawl.TestGenerator [junit] Tests run: 4, Failures: 0, Errors: 4, Skipped: 0, Time elapsed: 8.483 sec [junit] Test org.apache.nutch.crawl.TestGenerator FAILED [junit] Running org.apache.nutch.crawl.TestLinkDbMerger [junit] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 2.448 sec [junit] Test org.apache.nutch.crawl.TestLinkDbMerger FAILED [junit] Running org.apache.nutch.fetcher.TestFetcher [junit] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 4.323 sec [junit] Test org.apache.nutch.fetcher.TestFetcher FAILED [junit] Running org.apache.nutch.indexer.TestIndexerMapReduce [junit] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.766 sec [junit] Test org.apache.nutch.indexer.TestIndexerMapReduce FAILED [junit] Running org.apache.nutch.plugin.TestPluginSystem [junit] Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.073 sec [junit] Test org.apache.nutch.plugin.TestPluginSystem FAILED [junit] Running org.apache.nutch.segment.TestSegmentMerger [junit] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 23.399 sec [junit] Test org.apache.nutch.segment.TestSegmentMerger FAILED [junit] Running org.apache.nutch.segment.TestSegmentMergerCrawlDatums [junit] Tests run: 7, Failures: 7, Errors: 0, Skipped: 0, Time elapsed: 32.267 sec [junit] Test org.apache.nutch.segment.TestSegmentMergerCrawlDatums FAILED ``` @Omkar20895 have you tried to execute an end-to-end test crawl and validate the results? If not, then start there and see where it is all going wrong. Thanks ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#discussion_r126293237

        ##########
        File path: src/java/org/apache/nutch/segment/SegmentMerger.java
        ##########
        @@ -384,40 +385,43 @@ public void setConf(Configuration conf) {
        public void close() throws IOException {
        }

        • public void configure(JobConf conf) {
          + public void configure(Job job) {
          + Configuration conf = job.getConfiguration();
          setConf(conf);
          if (sliceSize > 0) {
        • sliceSize = sliceSize / conf.getNumReduceTasks();
          + sliceSize = sliceSize / Integer.parseInt(conf.get("mapreduce.map.tasks"));

        Review comment:
        This does not look correct at all... we should be attempting to obtain a slice size value by diving an integer whose value is greater than zero by the configured number of Reduce tasks... not by the configured number of Map tasks.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r126293237 ########## File path: src/java/org/apache/nutch/segment/SegmentMerger.java ########## @@ -384,40 +385,43 @@ public void setConf(Configuration conf) { public void close() throws IOException { } public void configure(JobConf conf) { + public void configure(Job job) { + Configuration conf = job.getConfiguration(); setConf(conf); if (sliceSize > 0) { sliceSize = sliceSize / conf.getNumReduceTasks(); + sliceSize = sliceSize / Integer.parseInt(conf.get("mapreduce.map.tasks")); Review comment: This does not look correct at all... we should be attempting to obtain a slice size value by diving an integer whose value is greater than zero by the configured number of Reduce tasks... not by the configured number of Map tasks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#discussion_r126293242

        ##########
        File path: src/java/org/apache/nutch/segment/SegmentMerger.java
        ##########
        @@ -384,40 +385,43 @@ public void setConf(Configuration conf) {
        public void close() throws IOException {
        }

        • public void configure(JobConf conf) {
          + public void configure(Job job)
          Unknown macro: {+ Configuration conf = job.getConfiguration(); setConf(conf); if (sliceSize > 0) { - sliceSize = sliceSize / conf.getNumReduceTasks(); + sliceSize = sliceSize / Integer.parseInt(conf.get("mapreduce.map.tasks")); } }
        • private Text newKey = new Text();
        • public void map(Text key, MetaWrapper value,
        • OutputCollector<Text, MetaWrapper> output, Reporter reporter)
        • throws IOException {
        • String url = key.toString();
        • if (normalizers != null) {
        • try { - url = normalizers.normalize(url, URLNormalizers.SCOPE_DEFAULT); // normalize - // the - // url - }

          catch (Exception e) {

        • LOG.warn("Skipping " + url + ":" + e.getMessage());
        • url = null;
          + public static class SegmentMergerMapper extends
          + Mapper<Text, MetaWrapper, Text, MetaWrapper> {
          + public void map(Text key, MetaWrapper value,
          + Context context) throws IOException, InterruptedException {
          + Text newKey = new Text();
          + String url = key.toString();
          + if (normalizers != null) {
          + try {
          + url = normalizers.normalize(url, URLNormalizers.SCOPE_DEFAULT); // normalize

        Review comment:
        Sort the comment out, make sure everything is on the same line.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r126293242 ########## File path: src/java/org/apache/nutch/segment/SegmentMerger.java ########## @@ -384,40 +385,43 @@ public void setConf(Configuration conf) { public void close() throws IOException { } public void configure(JobConf conf) { + public void configure(Job job) Unknown macro: {+ Configuration conf = job.getConfiguration(); setConf(conf); if (sliceSize > 0) { - sliceSize = sliceSize / conf.getNumReduceTasks(); + sliceSize = sliceSize / Integer.parseInt(conf.get("mapreduce.map.tasks")); } } private Text newKey = new Text(); public void map(Text key, MetaWrapper value, OutputCollector<Text, MetaWrapper> output, Reporter reporter) throws IOException { String url = key.toString(); if (normalizers != null) { try { - url = normalizers.normalize(url, URLNormalizers.SCOPE_DEFAULT); // normalize - // the - // url - } catch (Exception e) { LOG.warn("Skipping " + url + ":" + e.getMessage()); url = null; + public static class SegmentMergerMapper extends + Mapper<Text, MetaWrapper, Text, MetaWrapper> { + public void map(Text key, MetaWrapper value, + Context context) throws IOException, InterruptedException { + Text newKey = new Text(); + String url = key.toString(); + if (normalizers != null) { + try { + url = normalizers.normalize(url, URLNormalizers.SCOPE_DEFAULT); // normalize Review comment: Sort the comment out, make sure everything is on the same line. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-313890314

        @jnioche @sebastian-nagel @chrismattmann @jorgelbg and rest of team, would it be preferred for all Mapper and Reducer functions to be extracted from individual Job's which ```extends Configured implements Tool``` into a ```mapper``` and ```reducer``` package respectively or is it fine to retain them as nested classes within the Job they belong to? Thanks for any input.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-313890314 @jnioche @sebastian-nagel @chrismattmann @jorgelbg and rest of team, would it be preferred for all Mapper and Reducer functions to be extracted from individual Job's which ```extends Configured implements Tool``` into a ```mapper``` and ```reducer``` package respectively or is it fine to retain them as nested classes within the Job they belong to? Thanks for any input. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-313890439

        @Omkar20895 start with ```TestSegmentMergerCrawlDatums.java```, the primary issue is as follows
        ```
        2017-07-08 17:27:43,515 WARN mapred.LocalJobRunner (LocalJobRunner.java:run(560)) - job_local1442546213_0007
        java.lang.Exception: java.lang.ClassCastException: org.apache.nutch.crawl.CrawlDatum cannot be cast to org.apache.nutch.metadata.MetaWrapper
        at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
        Caused by: java.lang.ClassCastException: org.apache.nutch.crawl.CrawlDatum cannot be cast to org.apache.nutch.metadata.MetaWrapper
        at org.apache.nutch.segment.SegmentMerger$SegmentMergerMapper.map(SegmentMerger.java:397)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
        at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
        ```

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-313890439 @Omkar20895 start with ```TestSegmentMergerCrawlDatums.java```, the primary issue is as follows ``` 2017-07-08 17:27:43,515 WARN mapred.LocalJobRunner (LocalJobRunner.java:run(560)) - job_local1442546213_0007 java.lang.Exception: java.lang.ClassCastException: org.apache.nutch.crawl.CrawlDatum cannot be cast to org.apache.nutch.metadata.MetaWrapper at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522) Caused by: java.lang.ClassCastException: org.apache.nutch.crawl.CrawlDatum cannot be cast to org.apache.nutch.metadata.MetaWrapper at org.apache.nutch.segment.SegmentMerger$SegmentMergerMapper.map(SegmentMerger.java:397) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        Omkar20895 commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#discussion_r126295341

        ##########
        File path: src/java/org/apache/nutch/segment/SegmentMerger.java
        ##########
        @@ -384,40 +385,43 @@ public void setConf(Configuration conf) {
        public void close() throws IOException {
        }

        • public void configure(JobConf conf) {
          + public void configure(Job job) {
          + Configuration conf = job.getConfiguration();
          setConf(conf);
          if (sliceSize > 0) {
        • sliceSize = sliceSize / conf.getNumReduceTasks();
          + sliceSize = sliceSize / Integer.parseInt(conf.get("mapreduce.map.tasks"));

        Review comment:
        @lewismc pardon me, my bad! I will make this change and commit it in the next update.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - Omkar20895 commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r126295341 ########## File path: src/java/org/apache/nutch/segment/SegmentMerger.java ########## @@ -384,40 +385,43 @@ public void setConf(Configuration conf) { public void close() throws IOException { } public void configure(JobConf conf) { + public void configure(Job job) { + Configuration conf = job.getConfiguration(); setConf(conf); if (sliceSize > 0) { sliceSize = sliceSize / conf.getNumReduceTasks(); + sliceSize = sliceSize / Integer.parseInt(conf.get("mapreduce.map.tasks")); Review comment: @lewismc pardon me, my bad! I will make this change and commit it in the next update. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        Omkar20895 commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#discussion_r126473415

        ##########
        File path: src/java/org/apache/nutch/segment/SegmentMerger.java
        ##########
        @@ -384,40 +385,43 @@ public void setConf(Configuration conf) {
        public void close() throws IOException {
        }

        • public void configure(JobConf conf) {
          + public void configure(Job job)
          Unknown macro: {+ Configuration conf = job.getConfiguration(); setConf(conf); if (sliceSize > 0) { - sliceSize = sliceSize / conf.getNumReduceTasks(); + sliceSize = sliceSize / Integer.parseInt(conf.get("mapreduce.map.tasks")); } }
        • private Text newKey = new Text();
        • public void map(Text key, MetaWrapper value,
        • OutputCollector<Text, MetaWrapper> output, Reporter reporter)
        • throws IOException {
        • String url = key.toString();
        • if (normalizers != null) {
        • try { - url = normalizers.normalize(url, URLNormalizers.SCOPE_DEFAULT); // normalize - // the - // url - }

          catch (Exception e) {

        • LOG.warn("Skipping " + url + ":" + e.getMessage());
        • url = null;
          + public static class SegmentMergerMapper extends
          + Mapper<Text, MetaWrapper, Text, MetaWrapper> {
          + public void map(Text key, MetaWrapper value,
          + Context context) throws IOException, InterruptedException {
          + Text newKey = new Text();
          + String url = key.toString();
          + if (normalizers != null) {
          + try {
          + url = normalizers.normalize(url, URLNormalizers.SCOPE_DEFAULT); // normalize

        Review comment:
        Done!

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - Omkar20895 commented on a change in pull request #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#discussion_r126473415 ########## File path: src/java/org/apache/nutch/segment/SegmentMerger.java ########## @@ -384,40 +385,43 @@ public void setConf(Configuration conf) { public void close() throws IOException { } public void configure(JobConf conf) { + public void configure(Job job) Unknown macro: {+ Configuration conf = job.getConfiguration(); setConf(conf); if (sliceSize > 0) { - sliceSize = sliceSize / conf.getNumReduceTasks(); + sliceSize = sliceSize / Integer.parseInt(conf.get("mapreduce.map.tasks")); } } private Text newKey = new Text(); public void map(Text key, MetaWrapper value, OutputCollector<Text, MetaWrapper> output, Reporter reporter) throws IOException { String url = key.toString(); if (normalizers != null) { try { - url = normalizers.normalize(url, URLNormalizers.SCOPE_DEFAULT); // normalize - // the - // url - } catch (Exception e) { LOG.warn("Skipping " + url + ":" + e.getMessage()); url = null; + public static class SegmentMergerMapper extends + Mapper<Text, MetaWrapper, Text, MetaWrapper> { + public void map(Text key, MetaWrapper value, + Context context) throws IOException, InterruptedException { + Text newKey = new Text(); + String url = key.toString(); + if (normalizers != null) { + try { + url = normalizers.normalize(url, URLNormalizers.SCOPE_DEFAULT); // normalize Review comment: Done! ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-314161600

        @lewismc working on it! Thanks.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-314161600 @lewismc working on it! Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-315180650

        @lewismc I think we need to replace the configure() method with setup(). What would you suggest? Thanks for any suggestion.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-315180650 @lewismc I think we need to replace the configure() method with setup(). What would you suggest? Thanks for any suggestion. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-315181534

        @Omkar20895 yes

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - lewismc commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-315181534 @Omkar20895 yes ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
        URL: https://github.com/apache/nutch/pull/188#issuecomment-315181823

        @lewismc Thanks, I will start working on it.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce URL: https://github.com/apache/nutch/pull/188#issuecomment-315181823 @lewismc Thanks, I will start working on it. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org

          People

          • Assignee:
            Unassigned
            Reporter:
            omkar20895 Omkar Reddy
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development