[MAPREDUCE-3397] Support no sort dataflow in map output and reduce merge phrase - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 0.20.205.0
Fix Version/s: None
Component/s: task
Labels:
None

Description

In our experience, many data aggregation style queries/jobs don't need to sort the intermediate data. In fact reducer side can use hashmap or even array to do application level aggregations. For example, consider computing CTR using display log & click log in sponsored search. Map side just emit (adv_id, clk_cnt, dis_cnt), reduce side aggregate clk_cnt and dis_cnt for every adv_id, cause adv_id is integer, we can partition adv_id by range:

- reduce0: 0-100000
- reduce1: 100000-200000
- ...
- reduceM: xxx-max adv-id
  Then the reducer can use an array(for example: int [1000000][2]) to store the aggregated clk_cnt & dis_cnt, and we don't need the framework to sort intermediate data anymore.
  By supporting no sort, we can gain a lot of performance improvements:

Eliminate map side sort & merge.
KV paris need to sort by partition first, but this can be done using a liner time counting sort, which is much faster than quick sort.
Just merge spill segments one by one, doesn't need to use heap merge.
Eliminate shuffle phrase barrier, reducer can start to processing data before all map output data are copied & merged.

For most cases, memory won't be a problem, cause keys are divided to many partitions, each reducers only process a small subset of the global key set.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MAPREDUCE-3397-nosort.v1.patch
14/Nov/11 10:12
37 kB
Binglin Chang

Activity

People

Assignee:: Binglin Chang

Reporter:: Binglin Chang

Votes:: 0 Vote for this issue

Watchers:: 20 Start watching this issue

Dates

Created:: 14/Nov/11 10:10

Updated:: 12/Jan/12 17:56