Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-5886

Allow wordcount example job to accept multiple input paths.

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 3.0.0, 2.4.0
    • Fix Version/s: 2.5.0
    • Component/s: examples
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      It would be convenient if the wordcount example MapReduce job could accept multiple input paths and run the word count on all of them.

      1. MAPREDUCE-5886.1.patch
        2 kB
        Chris Nauroth
      2. MAPREDUCE-5886.2.patch
        2 kB
        Gera Shegalov
      3. MAPREDUCE-5886.3.patch
        6 kB
        Gera Shegalov

        Issue Links

          Activity

          Hide
          Chris Nauroth added a comment -

          I'm attaching a patch. I've tested this manually by running wordcount jobs that span multiple file systems ("hdfs", "webhdfs", and one custom file system scheme).

          Show
          Chris Nauroth added a comment - I'm attaching a patch. I've tested this manually by running wordcount jobs that span multiple file systems ("hdfs", "webhdfs", and one custom file system scheme).
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12644185/MAPREDUCE-5886.1.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-examples.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4594//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4594//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12644185/MAPREDUCE-5886.1.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-examples. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4594//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4594//console This message is automatically generated.
          Hide
          Gera Shegalov added a comment -

          LGTM. Possibly we can add a method to FIF:

          /**
           *  add numArgs paths starting at offset to the input 
           */
          public static addInputPaths(Job job, String[] args, int offset, int numArgs);
          

          Then it can be used in other jobs and reused in FIF for

            public static void addInputPaths(Job job, 
                                             String commaSeparatedPaths
                                             ) throws IOException {
          
          Show
          Gera Shegalov added a comment - LGTM. Possibly we can add a method to FIF: /** * add numArgs paths starting at offset to the input */ public static addInputPaths(Job job, String [] args, int offset, int numArgs); Then it can be used in other jobs and reused in FIF for public static void addInputPaths(Job job, String commaSeparatedPaths ) throws IOException {
          Hide
          Gera Shegalov added a comment -

          Alternatively, we can simply reuse FIF.addInputPaths in WordCount.

          Tested on pseudo-disrtributed cluster with the

          wordcount file:///local/path,hdfspath wc-out1
          
          Show
          Gera Shegalov added a comment - Alternatively, we can simply reuse FIF.addInputPaths in WordCount. Tested on pseudo-disrtributed cluster with the wordcount file:///local/path,hdfspath wc-out1
          Hide
          Akira AJISAKA added a comment -

          +1 (non-binding) for the v1 patch. The v2 patch cannot handle the path which includes a comma.

          Show
          Akira AJISAKA added a comment - +1 (non-binding) for the v1 patch. The v2 patch cannot handle the path which includes a comma.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12644305/MAPREDUCE-5886.2.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-examples.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4597//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4597//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12644305/MAPREDUCE-5886.2.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-examples. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4597//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4597//console This message is automatically generated.
          Hide
          Gera Shegalov added a comment -

          The v2 patch cannot handle the path which includes a comma.

          Akira, thanks for chiming in! That looks like a framework bug to me then. Are you suggesting to deprecate FIF.addInputPaths((Job job, String commaSeparatedPaths)

          Show
          Gera Shegalov added a comment - The v2 patch cannot handle the path which includes a comma. Akira, thanks for chiming in! That looks like a framework bug to me then. Are you suggesting to deprecate FIF.addInputPaths((Job job, String commaSeparatedPaths)
          Hide
          Akira AJISAKA added a comment -

          Thanks Gera Shegalov for the comment. Filed MAPREDUCE-5889 to deprecate FIF.addInputPaths((Job job, String commaSeparatedPaths).

          Show
          Akira AJISAKA added a comment - Thanks Gera Shegalov for the comment. Filed MAPREDUCE-5889 to deprecate FIF.addInputPaths((Job job, String commaSeparatedPaths) .
          Hide
          Gera Shegalov added a comment -

          Chris Nauroth, apologies for spamming you JIRA. This is just to show FIF API change I had in mind.

          Show
          Gera Shegalov added a comment - Chris Nauroth , apologies for spamming you JIRA. This is just to show FIF API change I had in mind.
          Hide
          Chris Nauroth added a comment -

          Hi, Gera Shegalov and Akira AJISAKA. Thanks for looking at this and contributing some new ideas.

          Regarding FileInputFormat#addInputPaths, in addition to the issue raised by Akira for supporting comma in a file name, there is another reason why I didn't use that method. On Windows Command Prompt, the comma acts as an argument separator, much like space. This would have the potential to create confusion for users on Windows.

          The basic concept of the new API looks good to me. We might instead consider passing varargs and no range indices. Word count could chop the input args down to the correct range using Arrays#copyOfRange or List#subList.

          Would you mind moving all of the API work to another jira? MAPREDUCE-5889 probably would work for that. For this issue, I was hoping to put in a quick trivial patch in just word count to enable this. IOW, I'd like to pursue a binding +1 on patch v1 and commit it.

          Thanks again!

          Show
          Chris Nauroth added a comment - Hi, Gera Shegalov and Akira AJISAKA . Thanks for looking at this and contributing some new ideas. Regarding FileInputFormat#addInputPaths , in addition to the issue raised by Akira for supporting comma in a file name, there is another reason why I didn't use that method. On Windows Command Prompt, the comma acts as an argument separator, much like space. This would have the potential to create confusion for users on Windows. The basic concept of the new API looks good to me. We might instead consider passing varargs and no range indices. Word count could chop the input args down to the correct range using Arrays#copyOfRange or List#subList . Would you mind moving all of the API work to another jira? MAPREDUCE-5889 probably would work for that. For this issue, I was hoping to put in a quick trivial patch in just word count to enable this. IOW, I'd like to pursue a binding +1 on patch v1 and commit it. Thanks again!
          Hide
          Gera Shegalov added a comment -

          I considered both Arrays#copyOfRange and List#subList but discarded this due to creation of throwaway objects. Thanks for discussion, Akira AJISAKA and Chris Nauroth. We can move FIF changes to another JIRA.

          Show
          Gera Shegalov added a comment - I considered both Arrays#copyOfRange and List#subList but discarded this due to creation of throwaway objects. Thanks for discussion, Akira AJISAKA and Chris Nauroth . We can move FIF changes to another JIRA.
          Hide
          Chris Nauroth added a comment -

          I considered both Arrays#copyOfRange and List#subList but discarded this due to creation of throwaway objects.

          It's not too bad for ArrayList#subList. It retains the original array and wraps it with different offset indices:

          http://hg.openjdk.java.net/jdk6/jdk6/jdk/file/tip/src/share/classes/java/util/ArrayList.java#l891

          You pay a flat cost for the extra indices and object overhead, but it's not a full array reallocation.

          Show
          Chris Nauroth added a comment - I considered both Arrays#copyOfRange and List#subList but discarded this due to creation of throwaway objects. It's not too bad for ArrayList#subList . It retains the original array and wraps it with different offset indices: http://hg.openjdk.java.net/jdk6/jdk6/jdk/file/tip/src/share/classes/java/util/ArrayList.java#l891 You pay a flat cost for the extra indices and object overhead, but it's not a full array reallocation.
          Hide
          Gera Shegalov added a comment -

          Chris, thanks for the JDK pointer, I am aware of the behavior.

          Show
          Gera Shegalov added a comment - Chris, thanks for the JDK pointer, I am aware of the behavior.
          Hide
          Siddharth Seth added a comment -

          +1. The original patch looks good to me. In a subsequent jira, does something need to be done with the way the addInputPath(Job, Path) eventually propagates the additional paths (comma separated property) to handle commas in filenames.

          Show
          Siddharth Seth added a comment - +1. The original patch looks good to me. In a subsequent jira, does something need to be done with the way the addInputPath(Job, Path) eventually propagates the additional paths (comma separated property) to handle commas in filenames.
          Hide
          Chris Nauroth added a comment -

          Thanks for the review, Sid.

          In a subsequent jira, does something need to be done with the way the addInputPath(Job, Path) eventually propagates the additional paths (comma separated property) to handle commas in filenames.

          Yes, Akira is handling this in MAPREDUCE-5889.

          Show
          Chris Nauroth added a comment - Thanks for the review, Sid. In a subsequent jira, does something need to be done with the way the addInputPath(Job, Path) eventually propagates the additional paths (comma separated property) to handle commas in filenames. Yes, Akira is handling this in MAPREDUCE-5889 .
          Hide
          Chris Nauroth added a comment -

          I committed this to trunk and branch-2. Thanks again, everyone.

          Show
          Chris Nauroth added a comment - I committed this to trunk and branch-2. Thanks again, everyone.
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in Hadoop-trunk-Commit #5673 (See https://builds.apache.org/job/Hadoop-trunk-Commit/5673/)
          MAPREDUCE-5886. Allow wordcount example job to accept multiple input paths. Contributed by Chris Nauroth. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1601704)

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-examples/src/main/java/org/apache/hadoop/examples/WordCount.java
          Show
          Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #5673 (See https://builds.apache.org/job/Hadoop-trunk-Commit/5673/ ) MAPREDUCE-5886 . Allow wordcount example job to accept multiple input paths. Contributed by Chris Nauroth. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1601704 ) /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-examples/src/main/java/org/apache/hadoop/examples/WordCount.java
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12644677/MAPREDUCE-5886.3.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-examples.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4645//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4645//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12644677/MAPREDUCE-5886.3.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-examples. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4645//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4645//console This message is automatically generated.
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hadoop-Yarn-trunk #580 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/580/)
          MAPREDUCE-5886. Allow wordcount example job to accept multiple input paths. Contributed by Chris Nauroth. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1601704)

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-examples/src/main/java/org/apache/hadoop/examples/WordCount.java
          Show
          Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk #580 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/580/ ) MAPREDUCE-5886 . Allow wordcount example job to accept multiple input paths. Contributed by Chris Nauroth. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1601704 ) /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-examples/src/main/java/org/apache/hadoop/examples/WordCount.java
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Hdfs-trunk #1771 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1771/)
          MAPREDUCE-5886. Allow wordcount example job to accept multiple input paths. Contributed by Chris Nauroth. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1601704)

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-examples/src/main/java/org/apache/hadoop/examples/WordCount.java
          Show
          Hudson added a comment - SUCCESS: Integrated in Hadoop-Hdfs-trunk #1771 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1771/ ) MAPREDUCE-5886 . Allow wordcount example job to accept multiple input paths. Contributed by Chris Nauroth. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1601704 ) /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-examples/src/main/java/org/apache/hadoop/examples/WordCount.java
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk #1798 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1798/)
          MAPREDUCE-5886. Allow wordcount example job to accept multiple input paths. Contributed by Chris Nauroth. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1601704)

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-examples/src/main/java/org/apache/hadoop/examples/WordCount.java
          Show
          Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #1798 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1798/ ) MAPREDUCE-5886 . Allow wordcount example job to accept multiple input paths. Contributed by Chris Nauroth. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1601704 ) /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-examples/src/main/java/org/apache/hadoop/examples/WordCount.java

            People

            • Assignee:
              Chris Nauroth
              Reporter:
              Chris Nauroth
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development