Uploaded image for project: 'Tajo'
  1. Tajo
  2. TAJO-987

Hash shuffle should be balanced according to intermediate volumes

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.9.0
    • Component/s: Data Shuffle
    • Labels:
      None

      Description

      It is not hard to see skewed data set in practice. Currently, hash shuffled intermediate are performed by distributing partition keys without considering their partition volumes. As a result, with skewed intermediate data, a few of nodes are likely to take much longer time than most of all nodes. It can cause performance degradation. We need some solution to mitigate this problem.

      This patch assigns the intermediate data by balancing their volumes. The approach is a kind of greedy algorithm. In many cases, the shuffle num can be over tens of thousands. I also considered the computation complexity. Its complexity is O (n). It will show reasonable performance and balanced results.

        Attachments

          Activity

            People

            • Assignee:
              hyunsik Hyunsik Choi
              Reporter:
              hyunsik Hyunsik Choi
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: