[MAPREDUCE-1639] Grouping using hashing instead of sorting - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

most applications of map-reduce care about grouping and not sorting. Sorting is a (relatively expensive) way to achieve grouping. In order to achieve just grouping - one can:

replace the sort on the Mappers with a HashTable - and maintain lists of key-values against each hash-bucket.
key-value tuples inside each hash bucket are sorted - before spilling or sending to Reducer. Anytime this is done - Combiner can be invoked.
HashTable is serialized by hash-bucketid. So merges (of either spills or Map Outputs) works similar to today (at least there's no change in overall compute complexity of merge)

Of course this hashtable has nothing to do with partitioning. it's just a replacement for map-side sort.

–

this is (pretty much) straight from the MARS project paper: http://www.cse.ust.hk/catalac/papers/mars_pact08.pdf. They report a 45% speedup in inverted index calculation using hashing instead of sorting (reference implementation is NOT against Hadoop though).

Attachments

Issue Links

depends upon

MAPREDUCE-2454 Allow external sorter plugin for MR

Closed

relates to

MAPREDUCE-3235 Improve CPU cache behavior in map side sort

Open

HADOOP-7761 Improve performance of raw comparisons

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Joydeep Sen Sarma

Votes:: 1 Vote for this issue

Watchers:: 43 Start watching this issue

Dates

Created:: 27/Mar/10 13:57

Updated:: 02/May/13 02:29