Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-1120

Contribute some code helping implement map/reduce apps for joining data from multiple sources

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.13.0
    • None
    • None

    Description

      With the current Hadoop, it is a bit hard for the user to implement data joining apps.
      HADOOP-475/485 attempt to provide some support for data joining jobs, but it seems to be had to implement.

      This Jira rather calls for a application level support.
      The idea is to provide a generic map/reduce classes implementing data join jobs,
      and allows the user to extend those classes to add their special logic.

      In particular, the user needs to define a mapper class
      that extends DataJoinMapperBase class to implement methods for the
      following functionalities:

      1. Compute the source tag of input values
      2. Compute the map output value object
      3. Compute the map output key object

      The source tag will be used by the reducer to determine from which source
      (which table in SQL terminology) a value comes. Computing the map output
      value object amounts to performing projecting/filtering work in a SQL
      statement (through the select/where clauses). Computing the map output key
      amounts to choosing the join key. This class provides the appropriate plugin
      points for the user defined subclasses to implement the appropriate logic.

      The the user needs to define a reducer class
      that extends DataJoinReduceBase class to implement the following:

      protected abstract TaggedMapOutput combine(Object[] tags, Object[] values);

      The above method is expected to produce one output value from an array of
      records of different sources. The user code can also perform filtering here.
      It can return null if it decides to the records do not meet certain conditions.

      That is pretty much the user need to do in order to create a map/reduce job to join data
      from different sources.

      Attachments

        1. data_join.patch
          34 kB
          Runping Qi

        Activity

          People

            runping Runping Qi
            runping Runping Qi
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: