Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1170

MultipleInputs doesn't work with new API in 0.20 branch

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 0.20.1
    • Fix Version/s: 0.20.2
    • Component/s: None
    • Labels:
      None

      Description

      This patch adds support for MultipleInputs (and KeyValueTextInputFormat) in o.a.h.mapreduce.lib.input, working with the new API. Included passing unit test. Include for 0.20.2?

        Activity

        Jay Booth created issue -
        Jay Booth made changes -
        Field Original Value New Value
        Attachment multipleInputs.patch [ 12423644 ]
        Hide
        Jay Booth added a comment -

        backported directly from 0.21 branch, only degradation is when splitting KeyValueTextInputFormat, it doesn't recognize Bzip2Codec as splittable, because that would have dragged in a bunch more classes

        Show
        Jay Booth added a comment - backported directly from 0.21 branch, only degradation is when splitting KeyValueTextInputFormat, it doesn't recognize Bzip2Codec as splittable, because that would have dragged in a bunch more classes
        Jay Booth made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Jay Booth added a comment -

        Turns out the test only passes because it doesn't try to actually execute the job. It just uses MultipleInputs to add the inputs, then checks that they were added to the appropriate structures in memory.

        When you run an actual job using TextInputFormat, we get:

        java.lang.ClassCastException: org.apache.hadoop.mapreduce.lib.input.TaggedInputSplit cannot be cast to org.apache.hadoop.mapreduce.lib.input.FileSplit
        at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:55)
        at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:418)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:582)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:176)

        This probably affects 0.21 as well, based on my brief reading of the code.. any suggestions? Seems kinda hard to work around without changing the signature of InputSplit, which would be pretty disruptive.

        One (very hacky) method that could be used would be to have LineRecordReader do something along the lines of
        if (split instanceof TaggedInputSplit) split = ((TaggedInputSplit)split).getInnerSplit()

        Any other ideas?

        Show
        Jay Booth added a comment - Turns out the test only passes because it doesn't try to actually execute the job. It just uses MultipleInputs to add the inputs, then checks that they were added to the appropriate structures in memory. When you run an actual job using TextInputFormat, we get: java.lang.ClassCastException: org.apache.hadoop.mapreduce.lib.input.TaggedInputSplit cannot be cast to org.apache.hadoop.mapreduce.lib.input.FileSplit at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:55) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:418) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:582) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:176) This probably affects 0.21 as well, based on my brief reading of the code.. any suggestions? Seems kinda hard to work around without changing the signature of InputSplit, which would be pretty disruptive. One (very hacky) method that could be used would be to have LineRecordReader do something along the lines of if (split instanceof TaggedInputSplit) split = ((TaggedInputSplit)split).getInnerSplit() Any other ideas?
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12423644/multipleInputs.patch
        against trunk revision 831037.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/114/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/114/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/114/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/114/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12423644/multipleInputs.patch against trunk revision 831037. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/114/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/114/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/114/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/114/console This message is automatically generated.
        Hide
        Jay Booth added a comment -

        Cancelling patch until this fully works

        Show
        Jay Booth added a comment - Cancelling patch until this fully works
        Jay Booth made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Hide
        Jay Booth added a comment -

        New patch fixes ClassCastException in LineRecordReader via
        <pre>
        if (split instanceof TaggedInputSplit) fileSplit = (FileSplit) ((TaggedInputSplit) split).getInputSplit();
        else fileSplit = (FileSplit) split;
        </pre>

        The old test just added the inputs and verified they were added, didn't actually run a job, so this error snuck through.

        New test runs a job with MultipleInputs and 2 different mapper classes, ensuring that output is correct. Passes.

        The test fails on 0.21 branch though – I'll make a separate JIRA and post a patch for that as well

        Show
        Jay Booth added a comment - New patch fixes ClassCastException in LineRecordReader via <pre> if (split instanceof TaggedInputSplit) fileSplit = (FileSplit) ((TaggedInputSplit) split).getInputSplit(); else fileSplit = (FileSplit) split; </pre> The old test just added the inputs and verified they were added, didn't actually run a job, so this error snuck through. New test runs a job with MultipleInputs and 2 different mapper classes, ensuring that output is correct. Passes. The test fails on 0.21 branch though – I'll make a separate JIRA and post a patch for that as well
        Jay Booth made changes -
        Attachment MAPREDUCE-1170.patch [ 12423858 ]
        Jay Booth made changes -
        Attachment multipleInputs.patch [ 12423644 ]
        Jay Booth made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12423858/MAPREDUCE-1170.patch
        against trunk revision 831816.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/219/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12423858/MAPREDUCE-1170.patch against trunk revision 831816. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/219/console This message is automatically generated.
        Hide
        Jay Booth added a comment -

        Old patch was against an internal snapshot, this one's against hadoop/common/branches/branch-0.20/, should work.

        Show
        Jay Booth added a comment - Old patch was against an internal snapshot, this one's against hadoop/common/branches/branch-0.20/, should work.
        Jay Booth made changes -
        Attachment MAPREDUCE-1170-apache.patch [ 12423868 ]
        Jay Booth made changes -
        Attachment MAPREDUCE-1170.patch [ 12423858 ]
        Hide
        Chris Douglas added a comment -

        As in MAPREDUCE-1145, we cannot push new code into the 0.20 branch without testing and supporting it.

        Show
        Chris Douglas added a comment - As in MAPREDUCE-1145 , we cannot push new code into the 0.20 branch without testing and supporting it.
        Chris Douglas made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Won't Fix [ 2 ]
        Hide
        Jay Booth added a comment -

        Better patch including Amareshwari's more maintainable fix from MAPREDUCE-1178.

        Not targeting for inclusion, just posting here in case any passersby want the patch.

        Show
        Jay Booth added a comment - Better patch including Amareshwari's more maintainable fix from MAPREDUCE-1178 . Not targeting for inclusion, just posting here in case any passersby want the patch.
        Jay Booth made changes -
        Attachment MAPREDUCE-1170-branch-20.patch [ 12426139 ]
        Jay Booth made changes -
        Attachment MAPREDUCE-1170-apache.patch [ 12423868 ]

          People

          • Assignee:
            Unassigned
            Reporter:
            Jay Booth
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development