[MAPREDUCE-2841] Task level native optimization - ASF JIRA

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.0.0-alpha1
Component/s: task
Labels:
None
Environment:

x86-64 Linux/Unix

Hadoop Flags:

Reviewed
Release Note:

Hide
Adds a native implementation of the map output collector. The native library will build automatically with -Pnative. Users may choose the new collector on a job-by-job basis by setting mapreduce.job.map.output.collector.class=org.apache.hadoop.mapred.
nativetask.NativeMapOutputCollectorDelegator in their job configuration. For shuffle-intensive jobs this may provide speed-ups of 30% or more.

Show
Adds a native implementation of the map output collector. The native library will build automatically with -Pnative. Users may choose the new collector on a job-by-job basis by setting mapreduce.job.map.output.collector.class=org.apache.hadoop.mapred. nativetask.NativeMapOutputCollectorDelegator in their job configuration. For shuffle-intensive jobs this may provide speed-ups of 30% or more.
Tags:
optimization task

Description

I'm recently working on native optimization for MapTask based on JNI.

The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs emitted by mapper, therefore sort, spill, IFile serialization can all be done in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising results:

1. Sort is about 3x-10x as fast as java(only binary string compare is supported)

2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware CRC32C is used, things can get much faster(1G/

3. Merge code is not completed yet, so the test use enough io.sort.mb to prevent mid-spill

This leads to a total speed up of 2x~3x for the whole MapTask, if IdentityMapper(mapper does nothing) is used

There are limitations of course, currently only Text and BytesWritable is supported, and I have not think through many things right now, such as how to support map side combine. I had some discussion with somebody familiar with hive, it seems that these limitations won't be much problem for Hive to benefit from those optimizations, at least. Advices or discussions about improving compatibility are most welcome

Currently NativeMapOutputCollector has a static method called canEnable(), which checks if key/value type, comparator type, combiner are all compatible, then MapTask can choose to enable NativeMapOutputCollector.

This is only a preliminary test, more work need to be done. I expect better final results, and I believe similar optimization can be adopt to reduce task and shuffle too.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

DESIGN.html
02/Feb/12 19:55
42 kB
Binglin Chang
dualpivot-0.patch
28/Aug/11 15:48
5 kB
Christopher Douglas
dualpivotv20-0.patch
28/Aug/11 15:48
4 kB
Christopher Douglas
fb-shuffle.patch
06/Apr/14 01:38
76 kB
Todd Lipcon
hadoop-3.0-mapreduce-2841-2014-7-17.patch
17/Jul/14 20:15
3.50 MB
Sean Zhong
MAPREDUCE-2841.v1.patch
18/Aug/11 15:59
180 kB
Binglin Chang
MAPREDUCE-2841.v2.patch
30/Aug/11 11:11
190 kB
Binglin Chang
micro-benchmark.txt
30/Aug/14 02:49
13 kB
Todd Lipcon
MR-2841benchmarks.pdf
03/Sep/14 07:13
213 kB
Todd Lipcon
mr-2841-merge.txt
05/Sep/14 21:53
2.72 MB
Todd Lipcon
mr-2841-merge-2.txt
06/Sep/14 06:39
2.70 MB
Sean Zhong
mr-2841-merge-3.patch
06/Sep/14 08:00
2.73 MB
Sean Zhong
mr-2841-merge-4.patch
07/Sep/14 09:18
2.68 MB
Sean Zhong

Issue Links

Add Link

duplicates

MAPREDUCE-1270 Hadoop C++ Extention

Resolved

Delete this link

MAPREDUCE-2446 HCE 2.0

Resolved

Delete this link

is depended upon by

HIVE-17498 Does hive have mr-nativetask support refer to MAPREDUCE-2841

Open

Delete this link

is related to

MAPREDUCE-6985 MapReduce native optimization does not work properly due to a shuffle error (LocalJobRunner)

Open

Delete this link

MAPREDUCE-5962 Support CRC32C in IFile

Resolved

Delete this link

HADOOP-10855 Allow Text to be read with a known length

Closed

Delete this link

relates to

MAPREDUCE-6106 hadoop-mapreduce-client-nativetask fails to compile on OS X

Resolved

Delete this link

MAPREDUCE-3247 Add hash aggregation style data flow and/or new API

Open

Delete this link

MAPREDUCE-3246 Make Task extensible to support modifications of Task or even alternate programming paradigms

Open

Delete this link

MAPREDUCE-1270 Hadoop C++ Extention

Resolved

Delete this link

(1 is related to, 4 relates to)

Sub-Tasks

Create Sub-Task

1.	Allow specifying multiple MapOutputCollectors with fallback	Closed	Todd Lipcon	Actions
2.	Fix native-task build on Ubuntu 13.10	Resolved	Todd Lipcon	Actions
3.	native-task should not fail to build if snappy is missing	Resolved	Sean Zhong	Actions
4.	Fix or suppress native-task gcc warnings	Resolved	Manu Zhang	Actions
5.	native-task CompressTest failure on Ubuntu	Resolved	Manu Zhang	Actions
6.	native-task: reuse lz4 sources in hadoop-common	Resolved	Binglin Chang	Actions
7.	native-task: Fix build on macosx	Resolved	Binglin Chang	Actions
8.	native-task: Unit test TestGlibCBug fails on ubuntu	Resolved	Sean Zhong	Actions
9.	native-task should not run unit tests if native profile is not enabled	Resolved	Binglin Chang	Actions
10.	native-task test logs should not write to console	Open	Unassigned	Actions
11.	native-task: simplify/remove dead code	Resolved	Unassigned	Actions
12.	native-task: TestBytesUtil fails	Resolved	Todd Lipcon	Actions
13.	native-task: revert changes which expose Text internals	Resolved	Todd Lipcon	Actions
14.	native-task: Rename system tests into standard directory layout	Resolved	Todd Lipcon	Actions
15.	native-task: Use DirectBufferPool from Hadoop Common	Resolved	Todd Lipcon	Actions
16.	native-task: Simplify ByteBufferDataReader/Writer	Resolved	Todd Lipcon	Actions
17.	native-task should not fail to build if zlib is missing	Resolved	Unassigned	Actions
18.	native-task: fix some valgrind errors	Resolved	Binglin Chang	Actions
19.	native-task: add native tests to maven and fix bug in pom.xml	Resolved	Binglin Chang	Actions
20.	native-task: fix native library distribution	Resolved	Manu Zhang	Actions
21.	native-task: fix logging	Resolved	Manu Zhang	Actions
22.	native-task: sources/test-sources jar distribution	Resolved	Manu Zhang	Actions
23.	native-task: speed up test runs	Resolved	Todd Lipcon	Actions
24.	native-task: findbugs, interface annotations, and other misc cleanup	Resolved	Todd Lipcon	Actions
25.	nativetask: move system test working dir to target dir and cleanup test config xml files	Resolved	Manu Zhang	Actions
26.	native-task: KVTest and LargeKVTest should check mr job is sucessful	Resolved	Binglin Chang	Actions
27.	native-task: warnings about illegal Progress values	Resolved	Manu Zhang	Actions
28.	native-task: fix some counter issues	Resolved	Binglin Chang	Actions
29.	native-task: Style fixups and dead code removal	Resolved	Todd Lipcon	Actions
30.	native-task: fix release audit, javadoc, javac warnings	Resolved	Todd Lipcon	Actions
31.	Remove CustomModule examples in nativetask	Resolved	Sean Zhong	Actions
32.	native-task: fix gtest build on macosx	Resolved	Binglin Chang	Actions

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Sean Zhong

Reporter:: Binglin Chang

Votes:: 4 Vote for this issue

Watchers:: 79 Start watching this issue

Dates

Created:: 13/Aug/11 11:29

Updated:: 19/Oct/17 08:54

Resolved:: 13/Sep/14 05:47

Agile

View on Board

Task level native optimization

Details

Description

Attachments

Attachments

Issue Links

Sub-Tasks

Activity

People

Dates

Agile

Slack

Issue deployment