[MAPREDUCE-2841] Task level native optimization - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.0.0-alpha1
Component/s: task
Labels:
None
Environment:

x86-64 Linux/Unix

Hadoop Flags:

Reviewed
Release Note:

Hide
Adds a native implementation of the map output collector. The native library will build automatically with -Pnative. Users may choose the new collector on a job-by-job basis by setting mapreduce.job.map.output.collector.class=org.apache.hadoop.mapred.
nativetask.NativeMapOutputCollectorDelegator in their job configuration. For shuffle-intensive jobs this may provide speed-ups of 30% or more.

Show
Adds a native implementation of the map output collector. The native library will build automatically with -Pnative. Users may choose the new collector on a job-by-job basis by setting mapreduce.job.map.output.collector.class=org.apache.hadoop.mapred. nativetask.NativeMapOutputCollectorDelegator in their job configuration. For shuffle-intensive jobs this may provide speed-ups of 30% or more.
Tags:
optimization task

Description

I'm recently working on native optimization for MapTask based on JNI.

The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs emitted by mapper, therefore sort, spill, IFile serialization can all be done in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising results:

1. Sort is about 3x-10x as fast as java(only binary string compare is supported)

2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware CRC32C is used, things can get much faster(1G/

3. Merge code is not completed yet, so the test use enough io.sort.mb to prevent mid-spill

This leads to a total speed up of 2x~3x for the whole MapTask, if IdentityMapper(mapper does nothing) is used

There are limitations of course, currently only Text and BytesWritable is supported, and I have not think through many things right now, such as how to support map side combine. I had some discussion with somebody familiar with hive, it seems that these limitations won't be much problem for Hive to benefit from those optimizations, at least. Advices or discussions about improving compatibility are most welcome

Currently NativeMapOutputCollector has a static method called canEnable(), which checks if key/value type, comparator type, combiner are all compatible, then MapTask can choose to enable NativeMapOutputCollector.

This is only a preliminary test, more work need to be done. I expect better final results, and I believe similar optimization can be adopt to reduce task and shuffle too.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MAPREDUCE-2841.v1.patch
18/Aug/11 11:59
180 kB
Binglin Chang
dualpivot-0.patch
28/Aug/11 11:48
5 kB
Christopher Douglas
dualpivotv20-0.patch
28/Aug/11 11:48
4 kB
Christopher Douglas
MAPREDUCE-2841.v2.patch
30/Aug/11 07:11
190 kB
Binglin Chang
DESIGN.html
02/Feb/12 15:55
42 kB
Binglin Chang
fb-shuffle.patch
05/Apr/14 21:38
76 kB
Todd Lipcon
hadoop-3.0-mapreduce-2841-2014-7-17.patch
17/Jul/14 16:15
3.50 MB
Sean Zhong
micro-benchmark.txt
29/Aug/14 22:49
13 kB
Todd Lipcon
MR-2841benchmarks.pdf
03/Sep/14 03:13
213 kB
Todd Lipcon
mr-2841-merge.txt
05/Sep/14 17:53
2.72 MB
Todd Lipcon
mr-2841-merge-2.txt
06/Sep/14 02:39
2.70 MB
Sean Zhong
mr-2841-merge-3.patch
06/Sep/14 04:00
2.73 MB
Sean Zhong
mr-2841-merge-4.patch
07/Sep/14 05:18
2.68 MB
Sean Zhong

Issue Links

duplicates

MAPREDUCE-1270 Hadoop C++ Extention

Resolved

MAPREDUCE-2446 HCE 2.0

Resolved

is depended upon by

HIVE-17498 Does hive have mr-nativetask support refer to MAPREDUCE-2841

Open

is related to

MAPREDUCE-6985 MapReduce native optimization does not work properly due to a shuffle error (LocalJobRunner)

Open

MAPREDUCE-5962 Support CRC32C in IFile

Resolved

HADOOP-10855 Allow Text to be read with a known length

Closed

relates to

MAPREDUCE-6106 hadoop-mapreduce-client-nativetask fails to compile on OS X

Resolved

MAPREDUCE-3247 Add hash aggregation style data flow and/or new API

Open

MAPREDUCE-3246 Make Task extensible to support modifications of Task or even alternate programming paradigms

Open

MAPREDUCE-1270 Hadoop C++ Extention

Resolved

(1 is related to, 4 relates to)

Sub-Tasks

native-task test logs should not write to console

Open

Unassigned

Activity

People

Assignee:: Sean Zhong

Reporter:: Binglin Chang

Votes:: 4 Vote for this issue

Watchers:: 79 Start watching this issue

Dates

Created:: 13/Aug/11 07:29

Updated:: 19/Oct/17 05:54

Resolved:: 13/Sep/14 01:47