Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1466

FileInputFormat should save #input-files in JobConf

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.22.0
    • Fix Version/s: 0.21.0
    • Component/s: client
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Added a private configuration variable mapreduce.input.num.files, to store number of input files being processed by M/R job.

      Description

      We already track the amount of data consumed by MR applications (MAP_INPUT_BYTES), alongwith, it would be useful to #input-files from the client-side for analysis. Along the lines of MAPREDUCE-1403, it would be easy to stick in the JobConf during job-submission.

      1. mr-1466-trunk-v5.patch
        17 kB
        Luke Lu
      2. mr-1466-trunk-v4.patch
        17 kB
        Luke Lu
      3. mr-1466-trunk-v3.patch
        17 kB
        Luke Lu
      4. mr-1466-trunk-v2.patch
        13 kB
        Luke Lu
      5. mr-1466-trunk-v1.patch
        10 kB
        Luke Lu
      6. MAPREDUCE-1466_yhadoop20-3.patch
        9 kB
        Hemanth Yamijala
      7. MAPREDUCE-1466_yhadoop20-2.patch
        9 kB
        Hemanth Yamijala
      8. MAPREDUCE-1466_yhadoop20-1.patch
        9 kB
        Hemanth Yamijala
      9. MAPREDUCE-1466_yhadoop20.patch
        9 kB
        Arun C Murthy

        Activity

        Arun C Murthy created issue -
        Hide
        Arun C Murthy added a comment -

        Straight-forward patch for yhadoop20 (not to be committed), a test-case alongwith.

        Show
        Arun C Murthy added a comment - Straight-forward patch for yhadoop20 (not to be committed), a test-case alongwith.
        Arun C Murthy made changes -
        Field Original Value New Value
        Attachment MAPREDUCE-1466_yhadoop20.patch [ 12435527 ]
        Hide
        Hemanth Yamijala added a comment -

        Minor changes to the earlier patch in the newly attached one:

        • Removed a System.err println in the old FileInputFormat. Please note that the same data (about number of paths to process) is available via a log statement in getSplits as well.
        • Removed a duplicate call to listStatus in the new FileInputFormat, which was like this:
          +    List<FileStatus>files = listStatus(job);
               for (FileStatus file: listStatus(job)) {
          

        I also suppose we need testcases for the new API. However, there are no tests for any of the classes in the org.apache.hadoop.mapreduce.lib.input package. So possibly this should be a separate JIRA.

        Please let me know if the changes seem fine.

        Show
        Hemanth Yamijala added a comment - Minor changes to the earlier patch in the newly attached one: Removed a System.err println in the old FileInputFormat. Please note that the same data (about number of paths to process) is available via a log statement in getSplits as well. Removed a duplicate call to listStatus in the new FileInputFormat, which was like this: + List<FileStatus>files = listStatus(job); for (FileStatus file: listStatus(job)) { I also suppose we need testcases for the new API. However, there are no tests for any of the classes in the org.apache.hadoop.mapreduce.lib.input package. So possibly this should be a separate JIRA. Please let me know if the changes seem fine.
        Hemanth Yamijala made changes -
        Attachment MAPREDUCE-1466_yhadoop20-1.patch [ 12435808 ]
        Hide
        Arun C Murthy added a comment -

        +1 (evils of copy-paste, sigh!), thanks for the review Hemanth.

        Show
        Arun C Murthy added a comment - +1 (evils of copy-paste, sigh!), thanks for the review Hemanth.
        Hide
        Hemanth Yamijala added a comment -

        Patch removing prefixes in file names generated due to incorrect diff command.

        Show
        Hemanth Yamijala added a comment - Patch removing prefixes in file names generated due to incorrect diff command.
        Hemanth Yamijala made changes -
        Attachment MAPREDUCE-1466_yhadoop20-2.patch [ 12435948 ]
        Hemanth Yamijala made changes -
        Release Note Added a private configuration variable mapreduce.input.num.files, to store number of input files being processed by M/R job.
        Hide
        Hemanth Yamijala added a comment -

        There was a findbugs warning in the previous patch. Corrected it now. For record, Findbugs was complaining that the static fields NUM_INPUT_FILES was not declared final, when it should.

        Show
        Hemanth Yamijala added a comment - There was a findbugs warning in the previous patch. Corrected it now. For record, Findbugs was complaining that the static fields NUM_INPUT_FILES was not declared final, when it should.
        Hemanth Yamijala made changes -
        Attachment MAPREDUCE-1466_yhadoop20-3.patch [ 12436886 ]
        Hide
        Luke Lu added a comment -

        Ported to trunk

        Show
        Luke Lu added a comment - Ported to trunk
        Luke Lu made changes -
        Attachment mr-1466-trunk-v1.patch [ 12437051 ]
        Luke Lu made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Affects Version/s 0.22.0 [ 12314184 ]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12437051/mr-1466-trunk-v1.patch
        against trunk revision 916823.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/491/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/491/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/491/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/491/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12437051/mr-1466-trunk-v1.patch against trunk revision 916823. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/491/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/491/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/491/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/491/console This message is automatically generated.
        Hide
        Chris Douglas added a comment -
        • Please retain javadoc for INPUT_NUM_FILES
        • Since INPUT_NUM_FILES will only be set for FileInputFormat, putting the constant there- instead of JobContext- is probably correct.
        Show
        Chris Douglas added a comment - Please retain javadoc for INPUT_NUM_FILES Since INPUT_NUM_FILES will only be set for FileInputFormat , putting the constant there- instead of JobContext - is probably correct.
        Chris Douglas made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Hide
        Chris Douglas added a comment -

        Talked with Luke. Since INPUT_NUM_FILES is shared between both implementations of FileInputFormat, most config keys were moved to JobContext in MAPREDUCE-849 anyway, and config keys are undocumented in our current regime: the current patch is correct.

        +1

        Show
        Chris Douglas added a comment - Talked with Luke. Since INPUT_NUM_FILES is shared between both implementations of FileInputFormat , most config keys were moved to JobContext in MAPREDUCE-849 anyway, and config keys are undocumented in our current regime: the current patch is correct. +1
        Hide
        Amareshwari Sriramadasu added a comment -

        After MAPREDUCE-849, all library specific configuration parameters are present respective libraries, not JobContext. For example, FileInputFormat.INPUT_DIR, SPLIT_MAXSIZE, SPLIT_MINSIZE are configuration for FileInputFormat. So, INPUT_NUM_FILES should also be in FileInputFormat and its name should be something like mapreduce.input.fileinputformat.numinputfiles

        Show
        Amareshwari Sriramadasu added a comment - After MAPREDUCE-849 , all library specific configuration parameters are present respective libraries, not JobContext. For example, FileInputFormat.INPUT_DIR, SPLIT_MAXSIZE, SPLIT_MINSIZE are configuration for FileInputFormat. So, INPUT_NUM_FILES should also be in FileInputFormat and its name should be something like mapreduce.input.fileinputformat.numinputfiles
        Hide
        Luke Lu added a comment -

        Incorporated Amareshwari's feedback, rebased against trunk.

        Show
        Luke Lu added a comment - Incorporated Amareshwari's feedback, rebased against trunk.
        Luke Lu made changes -
        Attachment mr-1466-trunk-v2.patch [ 12439221 ]
        Luke Lu made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12439221/mr-1466-trunk-v2.patch
        against trunk revision 924991.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 4 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/38/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/38/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/38/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/38/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12439221/mr-1466-trunk-v2.patch against trunk revision 924991. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/38/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/38/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/38/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/38/console This message is automatically generated.
        Luke Lu made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Hide
        Luke Lu added a comment -

        Resubmit patch to clear the nfs/jar problem.

        Show
        Luke Lu added a comment - Resubmit patch to clear the nfs/jar problem.
        Luke Lu made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Chris Douglas made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Chris Douglas made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Amareshwari Sriramadasu added a comment -

        A couple of comments:

        1. The string constant NUM_INPUT_FILES need not be defined in org.apache.hadoop.mapred.FileInputFormat. It can use the one from org.apache.hadoop.mapreduce.lib.input.FileInputFormat
        2. Can you add the unit test in org.apache.hadoop.mapreduce.lib.input.TestFileInputFormat also?
        Show
        Amareshwari Sriramadasu added a comment - A couple of comments: The string constant NUM_INPUT_FILES need not be defined in org.apache.hadoop.mapred.FileInputFormat. It can use the one from org.apache.hadoop.mapreduce.lib.input.FileInputFormat Can you add the unit test in org.apache.hadoop.mapreduce.lib.input.TestFileInputFormat also?
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12439221/mr-1466-trunk-v2.patch
        against trunk revision 928104.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 4 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/63/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/63/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/63/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/63/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12439221/mr-1466-trunk-v2.patch against trunk revision 928104. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/63/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/63/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/63/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/63/console This message is automatically generated.
        Hide
        Luke Lu added a comment -

        Good points, Amareshwari. v3 should address all those points.

        Show
        Luke Lu added a comment - Good points, Amareshwari. v3 should address all those points.
        Luke Lu made changes -
        Attachment mr-1466-trunk-v3.patch [ 12440103 ]
        Luke Lu made changes -
        Assignee Arun C Murthy [ acmurthy ] Luke Lu [ vicaya ]
        Luke Lu made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Luke Lu made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Luke Lu added a comment -

        v4 removes some unused imports.

        Show
        Luke Lu added a comment - v4 removes some unused imports.
        Luke Lu made changes -
        Attachment mr-1466-trunk-v4.patch [ 12440105 ]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12440105/mr-1466-trunk-v4.patch
        against trunk revision 928104.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 7 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/68/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/68/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/68/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/68/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12440105/mr-1466-trunk-v4.patch against trunk revision 928104. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 7 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/68/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/68/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/68/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/68/console This message is automatically generated.
        Hide
        Luke Lu added a comment -

        v5 added braces for an else and a comment. Note, this issue only covers straight FileInputFormat implementations and not CombineFileInputFormat and friends.

        Show
        Luke Lu added a comment - v5 added braces for an else and a comment. Note, this issue only covers straight FileInputFormat implementations and not CombineFileInputFormat and friends.
        Luke Lu made changes -
        Attachment mr-1466-trunk-v5.patch [ 12440263 ]
        Hide
        Amareshwari Sriramadasu added a comment -

        I don't think the test in org.apache.hadoop.mapreduce.lib.input.TestFileInputFormat needs to use Mockitto framework. Why don't we add a test similar to the one in old api?

        Show
        Amareshwari Sriramadasu added a comment - I don't think the test in org.apache.hadoop.mapreduce.lib.input.TestFileInputFormat needs to use Mockitto framework. Why don't we add a test similar to the one in old api?
        Hide
        Chris Douglas added a comment -

        Aren't the tests verifying different properties? The test in mapreduce seems to be verifying the case where a zero-length file is the only entry returned...

        Show
        Chris Douglas added a comment - Aren't the tests verifying different properties? The test in mapreduce seems to be verifying the case where a zero-length file is the only entry returned...
        Hide
        Amareshwari Sriramadasu added a comment -

        I don't think the test in org.apache.hadoop.mapreduce.lib.input.TestFileInputFormat needs to use Mockitto framework. Why don't we add a test similar to the one in old api?

        I thought the test need not use Mockitto, because it is not easy to understand. But, it is fine with me.

        Show
        Amareshwari Sriramadasu added a comment - I don't think the test in org.apache.hadoop.mapreduce.lib.input.TestFileInputFormat needs to use Mockitto framework. Why don't we add a test similar to the one in old api? I thought the test need not use Mockitto, because it is not easy to understand. But, it is fine with me.
        Hide
        Chris Douglas added a comment -

        +1

        I committed this. Thanks, Luke & Arun!

        Show
        Chris Douglas added a comment - +1 I committed this. Thanks, Luke & Arun!
        Chris Douglas made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Hadoop Flags [Reviewed]
        Resolution Fixed [ 1 ]
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk-Commit #302 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/302/)
        . Record number of files processed in FileInputFormat in the
        Configuration for offline analysis. Contributed by Luke Lu and Arun Murthy

        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #302 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/302/ ) . Record number of files processed in FileInputFormat in the Configuration for offline analysis. Contributed by Luke Lu and Arun Murthy
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk #280 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/280/)

        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #280 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/280/ )
        Tom White made changes -
        Fix Version/s 0.21.0 [ 12314045 ]
        Fix Version/s 0.22.0 [ 12314184 ]
        Tom White made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Luke Lu
            Reporter:
            Arun C Murthy
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development