Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
Motivation
Currently, Tajo materializes intermediate data on local disks. Tajo stores one file for each partition. It becomes inefficient and not scalable as data volume and increase. In MR, this challenge was resolved by sorting intermediate key-values, grouping the same key data, and indexing on keys. But, It requires unnecessary sort and disk I/O. This is not feasible in Tajo.
References
TAJO-292is an ad-hoc resolution to reduce the number of intermediate files. But, it still is not scalable.- Optimizing MapReduce Job Performance (http://www.slideshare.net/cloudera/mr-perf)
- Multilevel aggregation for Hadoop/MapReduce (http://www.slideshare.net/ozax86/prestrata-hadoop-word-meetup)
- SAILFISH: A FRAMEWORK FOR LARGE SCALE DATA PROCESSING (http://research.yahoo.com/files/yl-2012-002.pdf)
- MAPREDUCE-4502 - Node-level aggregation with combining the result of maps
MAPREDUCE-2841- Task level native optimization
Attachments
Issue Links
- relates to
-
TAJO-292 Too many intermediate partition files
- Resolved
-
MAPREDUCE-4502 Node-level aggregation with combining the result of maps
- Patch Available