Hadoop Common
  1. Hadoop Common
  2. HADOOP-3702

add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.19.0
    • Component/s: None
    • Labels:
      None
    • Environment:

      all

    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      Introduced ChainMapper and the ChainReducer classes to allow composing chains of Maps and Reduces in a single Map/Reduce job, something like MAP+ REDUCE MAP*.
      Show
      Introduced ChainMapper and the ChainReducer classes to allow composing chains of Maps and Reduces in a single Map/Reduce job, something like MAP+ REDUCE MAP*.

      Description

      On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.

      If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided.

      Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

      1. patch3702.txt
        30 kB
        Alejandro Abdelnur
      2. patch3702.txt
        30 kB
        Alejandro Abdelnur
      3. patch3702.txt
        33 kB
        Alejandro Abdelnur
      4. patch3702.txt
        45 kB
        Alejandro Abdelnur
      5. patch3702.txt
        47 kB
        Alejandro Abdelnur
      6. patch3702.txt
        53 kB
        Alejandro Abdelnur
      7. patch3702.txt
        54 kB
        Alejandro Abdelnur
      8. patch3702.txt
        56 kB
        Alejandro Abdelnur
      9. patch3702.txt
        57 kB
        Alejandro Abdelnur
      10. patch3702.txt
        58 kB
        Alejandro Abdelnur
      11. patch3702.txt
        59 kB
        Alejandro Abdelnur
      12. patch3702.txt
        59 kB
        Alejandro Abdelnur
      13. patch3702.txt
        60 kB
        Alejandro Abdelnur
      14. Hadoop-3702.patch
        60 kB
        Enis Soztutar

        Issue Links

          Activity

          Owen O'Malley made changes -
          Component/s mapred [ 12310690 ]
          Nigel Daley made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Robert Chansler made changes -
          Release Note The ChainMapper and the ChainReducer classes allow composing chains of Maps and Reduces in a single Map/Reduce job, something like MAP+ / REDUCE MAP*. An immediate benefit of this pattern is reduction in disk IO as many Maps can be club together in a single job.
          Introduced ChainMapper and the ChainReducer classes to allow composing chains of Maps and Reduces in a single Map/Reduce job, something like MAP+ REDUCE MAP*.
          Devaraj Das made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hadoop Flags [Reviewed]
          Alejandro Abdelnur made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Alejandro Abdelnur made changes -
          Attachment patch3702.txt [ 12389651 ]
          Alejandro Abdelnur made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Alejandro Abdelnur made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Fix Version/s 0.19.0 [ 12313211 ]
          Alejandro Abdelnur made changes -
          Attachment patch3702.txt [ 12389247 ]
          Alejandro Abdelnur made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Alejandro Abdelnur made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Alejandro Abdelnur made changes -
          Attachment patch3702.txt [ 12389067 ]
          Alejandro Abdelnur made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Enis Soztutar made changes -
          Link This issue incorporates HADOOP-3927 [ HADOOP-3927 ]
          Enis Soztutar made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Enis Soztutar made changes -
          Attachment Hadoop-3702.patch [ 12388256 ]
          Enis Soztutar made changes -
          Priority Minor [ 4 ] Major [ 3 ]
          Enis Soztutar made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Alejandro Abdelnur made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Alejandro Abdelnur made changes -
          Hadoop Flags [Incompatible change]
          Release Note The ChainMapper and the ChainReducer classes allow composing chains of Maps and Reduces in a single Map/Reduce job, something like MAP+ / REDUCE MAP*. An immediate benefit of this pattern is reduction in disk IO as many Maps can be club together in a single job.

          The Configuration write(OutputStream) method has been renamed to writeXml(OutputStream) to avoid ambiguity with the Writable write(DataOutput) method.
          The ChainMapper and the ChainReducer classes allow composing chains of Maps and Reduces in a single Map/Reduce job, something like MAP+ / REDUCE MAP*. An immediate benefit of this pattern is reduction in disk IO as many Maps can be club together in a single job.
          Alejandro Abdelnur made changes -
          Attachment patch3702.txt [ 12387102 ]
          Alejandro Abdelnur made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Alejandro Abdelnur made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Alejandro Abdelnur made changes -
          Release Note The ChainMapper and the ChainReducer classes allow composing chains of Maps and Reduces in a single Map/Reduce job, something like MAP+ / REDUCE MAP*. An immediate benefit of this pattern is reduction in disk IO as many Maps can be club together in a single job.
          The ChainMapper and the ChainReducer classes allow composing chains of Maps and Reduces in a single Map/Reduce job, something like MAP+ / REDUCE MAP*. An immediate benefit of this pattern is reduction in disk IO as many Maps can be club together in a single job.

          The Configuration write(OutputStream) method has been renamed to writeXml(OutputStream) to avoid ambiguity with the Writable write(DataOutput) method.
          Hadoop Flags [Incompatible change]
          Alejandro Abdelnur made changes -
          Attachment patch3702.txt [ 12386850 ]
          Alejandro Abdelnur made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Alejandro Abdelnur made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Alejandro Abdelnur made changes -
          Attachment patch3702.txt [ 12386782 ]
          Alejandro Abdelnur made changes -
          Assignee Christophe Taton [ kryzthov ] Alejandro Abdelnur [ tucu00 ]
          Alejandro Abdelnur made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Alejandro Abdelnur made changes -
          Assignee Alejandro Abdelnur [ tucu00 ] Christophe Taton [ kryzthov ]
          Status Open [ 1 ] Patch Available [ 10002 ]
          Alejandro Abdelnur made changes -
          Attachment patch3702.txt [ 12386720 ]
          Alejandro Abdelnur made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Alejandro Abdelnur made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Alejandro Abdelnur made changes -
          Attachment patch3702.txt [ 12386713 ]
          Alejandro Abdelnur made changes -
          Attachment patch3702.txt [ 12386712 ]
          Alejandro Abdelnur made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Alejandro Abdelnur made changes -
          Attachment patch3702.txt [ 12386712 ]
          Alejandro Abdelnur made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Release Note The ChainMapper and the ChainReducer classes allow composing chains of Maps and Reduces in a single Map/Reduce job, something like MAP+ / REDUCE MAP*.
          An immediate benefit of this pattern is reduction in disk IO as many Maps can be club together in a single job.
          The ChainMapper and the ChainReducer classes allow composing chains of Maps and Reduces in a single Map/Reduce job, something like MAP+ / REDUCE MAP*. An immediate benefit of this pattern is reduction in disk IO as many Maps can be club together in a single job.
          Alejandro Abdelnur made changes -
          Attachment patch3702.txt [ 12386208 ]
          Alejandro Abdelnur made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Alejandro Abdelnur made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Alejandro Abdelnur made changes -
          Attachment patch3702.txt [ 12386160 ]
          Alejandro Abdelnur made changes -
          Attachment patch3702.txt [ 12385965 ]
          Alejandro Abdelnur made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Alejandro Abdelnur made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Alejandro Abdelnur made changes -
          Attachment patch3702.txt [ 12385953 ]
          Alejandro Abdelnur made changes -
          Attachment patch3702.txt [ 12385952 ]
          Alejandro Abdelnur made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Alejandro Abdelnur made changes -
          Attachment patch3702.txt [ 12385952 ]
          Alejandro Abdelnur made changes -
          Release Note The ChainMapper and the ChainReducer classes allow composing chains of Maps and Reduces in a single Map/Reduce job, something like MAP+ / REDUCE MAP*.
          An immediate benefit of this pattern is reduction in disk IO as many Maps can be club together in a single job.
          Status Open [ 1 ] Patch Available [ 10002 ]
          Alejandro Abdelnur made changes -
          Attachment patch3702.txt [ 12385705 ]
          Alejandro Abdelnur made changes -
          Field Original Value New Value
          Description On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.

          If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided.

          Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

          This could be done with ChainMapper and ChainReducer classes that would manage the chain of Maps and they would override the OutputCollector to implement the chaining.

          The Maps and Reduce that are part of the Chain are unware they are executed in a Chain, they receive records via the {{map}} and {{reduce}} methods and do the output via the {{OutputCollector}}.

          The API would look something like:

          {code:java}

          public class ChainMapper implements Mapper {

            public static void addMapper(JobConf job, Class<? extends Mapper> klass, Properties mapperConf);
            ...
          }

          public class ChainReducer implements Reducer {

            public static void setReducer(JobConf job, Class<? extends Reducer> klass, Properties reducerConf);

            public static void addMapper(JobConf job, Class<? extends Mapper> klass, Properties mapperConf);
            ...
          }

          {code}

          The {{Properties}} configuration passed to the {{Mapper}} and {{Reducer}} when setting them into the chain are injected into a copy of the job's configuration. This allows maps to be configured as usual without being aware that they are in a chain.
          On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.

          If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided.

          Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.
          Alejandro Abdelnur created issue -

            People

            • Assignee:
              Alejandro Abdelnur
              Reporter:
              Alejandro Abdelnur
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development