Tajo
  1. Tajo
  2. TAJO-374

Investigate more efficient intermediate shuffle methods

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: data shuffle
    • Labels:
      None

      Description

      Motivation

      Currently, Tajo materializes intermediate data on local disks. Tajo stores one file for each partition. It becomes inefficient and not scalable as data volume and increase. In MR, this challenge was resolved by sorting intermediate key-values, grouping the same key data, and indexing on keys. But, It requires unnecessary sort and disk I/O. This is not feasible in Tajo.

      References

        Issue Links

          Activity

          Hyunsik Choi created issue -
          Hyunsik Choi made changes -
          Field Original Value New Value
          Link This issue relates to TAJO-292 [ TAJO-292 ]
          Hyunsik Choi made changes -
          Link This issue relates to MAPREDUCE-4502 [ MAPREDUCE-4502 ]
          Hyunsik Choi made changes -
          Summary Investigate more efficient Intermedaite data handling Investigate more efficient intermediate shuffle methods
          Hyunsik Choi made changes -
          Description h3. Motivation

          Currently, Tajo materializes intermediate data on local disks. Tajo stores one file for each partition. It becomes inefficient and not scalable as data volume and increase. In MR, this challenge was resolved by sorting intermediate key-values, grouping the same key data, and indexing on keys. But, It requires unnecessary sort and disk I/O. This is not feasible in Tajo.

          h3. References
           * TAJO-292 is an ad-hoc resolution to reduce the number of intermediate files. But, it still is not scalable.
           * Optimizing MapReduce Job Performance (http://www.slideshare.net/cloudera/mr-perf)
           * Multilevel aggregation for Hadoop/MapReduce (http://www.slideshare.net/ozax86/prestrata-hadoop-word-meetup)
           * SAILFISH: A FRAMEWORK FOR LARGE SCALE DATA PROCESSING (http://research.yahoo.com/files/yl-2012-002.pdf)
           * MAPREDUCE-4502 - Node-level aggregation with combining the result of maps
          h3. Motivation

          Currently, Tajo materializes intermediate data on local disks. Tajo stores one file for each partition. It becomes inefficient and not scalable as data volume and increase. In MR, this challenge was resolved by sorting intermediate key-values, grouping the same key data, and indexing on keys. But, It requires unnecessary sort and disk I/O. This is not feasible in Tajo.

          h3. References
           * TAJO-292 is an ad-hoc resolution to reduce the number of intermediate files. But, it still is not scalable.
           * Optimizing MapReduce Job Performance (http://www.slideshare.net/cloudera/mr-perf)
           * Multilevel aggregation for Hadoop/MapReduce (http://www.slideshare.net/ozax86/prestrata-hadoop-word-meetup)
           * SAILFISH: A FRAMEWORK FOR LARGE SCALE DATA PROCESSING (http://research.yahoo.com/files/yl-2012-002.pdf)
           * MAPREDUCE-4502 - Node-level aggregation with combining the result of maps
           * MAPREDUCE-2841 - Task level native optimization
          Hyunsik Choi made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]

            People

            • Assignee:
              Unassigned
              Reporter:
              Hyunsik Choi
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development