[TAJO-374] Investigate more efficient intermediate shuffle methods - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: None
Component/s: Data Shuffle
Labels:
None

Description

Motivation

Currently, Tajo materializes intermediate data on local disks. Tajo stores one file for each partition. It becomes inefficient and not scalable as data volume and increase. In MR, this challenge was resolved by sorting intermediate key-values, grouping the same key data, and indexing on keys. But, It requires unnecessary sort and disk I/O. This is not feasible in Tajo.

References

~~TAJO-292~~ is an ad-hoc resolution to reduce the number of intermediate files. But, it still is not scalable.
Optimizing MapReduce Job Performance (http://www.slideshare.net/cloudera/mr-perf)
Multilevel aggregation for Hadoop/MapReduce (http://www.slideshare.net/ozax86/prestrata-hadoop-word-meetup)
SAILFISH: A FRAMEWORK FOR LARGE SCALE DATA PROCESSING (http://research.yahoo.com/files/yl-2012-002.pdf)
MAPREDUCE-4502 - Node-level aggregation with combining the result of maps
~~MAPREDUCE-2841~~ - Task level native optimization

Attachments

Issue Links

relates to

TAJO-292 Too many intermediate partition files

Resolved

MAPREDUCE-4502 Node-level aggregation with combining the result of maps

Patch Available

Sub-Tasks

1.	Running PullServer on a dedicated JVM process which separates from worker.		Resolved	Hyoungjun Kim
2.	Reduce number of hash shuffle output file.		Resolved	Hyoungjun Kim

Activity

People

Assignee:: Unassigned

Reporter:: Hyunsik Choi

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 03/Dec/13 15:31

Updated:: 14/Sep/14 13:33

Resolved:: 14/Sep/14 13:33