[MAPREDUCE-3247] Add hash aggregation style data flow and/or new API - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 0.23.0
Fix Version/s: None
Component/s: task
Labels:
- api
- perfomance

Description

In many join/aggregation like queries run on top of mapreduce, sort is not need, in fact a hash table based join/aggregation is more efficient, this is described in "Tenzing A SQL Implementation On The MapReduce Framework" in detail. There are two ways to support hash table based join/aggregation in hadoop mapreduce:

Only support no sort, the framework do nothing, just pass partitioned k/v pair from mapper to reducer
The upper application use hash table in their mapper & reducer to do aggregation, and emit all hashtable enties in cleanup() of mapper/reducer, this is how Google did in Tenzing. The main problem is memory control of hashtable.

Add new "fold" API, it can coexist with combiner/reducer API, user can use mapper-combiner-reducer or "mapper-folder" (maybe a bad name, welcome to propose a better name..)
Like foldl in functional programming: folder should have the semantic:
foldl folder z (x:xs) = foldl folder (folder z x) xs
In this way, upper applications only need to provide folder, underlying framework create and maintains hashtable for key/value pairs, it can be managed & optimized by the framework. For example, in mapper side, we can pre emit entire hashtable or use some policies like cache algorithm to emit part of k/v pairs to free some memory, if the memory consumption reach io.sort.mb

Attachments

Issue Links

is related to

MAPREDUCE-3246 Make Task extensible to support modifications of Task or even alternate programming paradigms

Open

MAPREDUCE-2841 Task level native optimization

Resolved

Sub-Tasks

Support no sort dataflow in map output and reduce merge phrase

Open

Binglin Chang

Activity

People

Assignee:: Unassigned

Reporter:: Binglin Chang

Votes:: 1 Vote for this issue

Watchers:: 29 Start watching this issue

Dates

Created:: 22/Oct/11 12:44

Updated:: 11/Dec/12 01:59