Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-24898 Some further improvements of sort-shuffle
  3. FLINK-25796

Avoid record copy for result partition of sort-shuffle if there are enough buffers for better performance

    XMLWordPrintableJSON

Details

    Description

      Currently, for result partition of sort-shuffle, there is extra record copy overhead Introduced by clustering records by subpartition index. For small records, this overhead can cause even 20% performance regression. This ticket aims to solve the problem.

      In fact, the hash-based implementation is a nature way to achieve the goal of sorting records by partition index. However, it incurs some serious weaknesses. For example, when there is no enough buffers or there is data skew, it can waste buffers and influence compression efficiency which can cause performance regression.

      This ticket tries to solve the issue by dynamically switching between the two implementations, that is, if there are enough buffers, the hash-based implementation will be used and if there is no enough buffers, the sort-based implementation will be used.

      Attachments

        Issue Links

          Activity

            People

              kevin.cyj Yingjie Cao
              kevin.cyj Yingjie Cao
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: