Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-3247

Add hash aggregation style data flow and/or new API

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.23.0
    • Fix Version/s: None
    • Component/s: task
    • Labels:

      Description

      In many join/aggregation like queries run on top of mapreduce, sort is not need, in fact a hash table based join/aggregation is more efficient, this is described in "Tenzing A SQL Implementation On The MapReduce Framework" in detail. There are two ways to support hash table based join/aggregation in hadoop mapreduce:

      1. Only support no sort, the framework do nothing, just pass partitioned k/v pair from mapper to reducer
        The upper application use hash table in their mapper & reducer to do aggregation, and emit all hashtable enties in cleanup() of mapper/reducer, this is how Google did in Tenzing. The main problem is memory control of hashtable.
      1. Add new "fold" API, it can coexist with combiner/reducer API, user can use mapper-combiner-reducer or "mapper-folder" (maybe a bad name, welcome to propose a better name..)
        Like foldl in functional programming: folder should have the semantic:
        foldl folder z (x:xs) = foldl folder (folder z x) xs
        In this way, upper applications only need to provide folder, underlying framework create and maintains hashtable for key/value pairs, it can be managed & optimized by the framework. For example, in mapper side, we can pre emit entire hashtable or use some policies like cache algorithm to emit part of k/v pairs to free some memory, if the memory consumption reach io.sort.mb

        Issue Links

          Activity

          Hide
          Jerry Chen added a comment -

          Binglin, I noticed that you create this bug from MAPREDUCE-1639, while I think this two bugs are more or less similar. And also there are a lot other things related are going on such as MAPREDUCE-2454 and MAPREDUCE-4049.

          If you are not working on this, I would like to take time to work on this feature.

          Show
          Jerry Chen added a comment - Binglin, I noticed that you create this bug from MAPREDUCE-1639 , while I think this two bugs are more or less similar. And also there are a lot other things related are going on such as MAPREDUCE-2454 and MAPREDUCE-4049 . If you are not working on this, I would like to take time to work on this feature.
          Binglin Chang made changes -
          Link This issue is related to MAPREDUCE-2841 [ MAPREDUCE-2841 ]
          Binglin Chang made changes -
          Field Original Value New Value
          Link This issue is related to MAPREDUCE-3246 [ MAPREDUCE-3246 ]
          Binglin Chang created issue -

            People

            • Assignee:
              Unassigned
              Reporter:
              Binglin Chang
            • Votes:
              1 Vote for this issue
              Watchers:
              25 Start watching this issue

              Dates

              • Created:
                Updated:

                Development