Uploaded image for project: 'Tajo'
  1. Tajo
  2. TAJO-584

Improve distributed merge sort

    Details

      Description

      In Tajo, sort operator is similar to merge sort, and it works in a distributed manner. The first sort phase sorts each fragment in local machine, the intermediate data are shuffled in range partition, and then the second sort phase in each node sorts the range-partitioned data.

      However, the second sort phase reads all shuffled data via one scanner. It misses the opportunity to exploit already-sorted data. This patch improves the second sort phase to merge directly multiple already-sorted intermediate data sets. It significantly reduces the response time of sort queries.

      I carried out some simple benchmark with the following query on TPC-H 100GB data sets:

      select l_orderkey from lineitem order by l_orderkey;
      

      The lineitem table occupies 75GB. The query response time are dramatically reduced from 480 to 260 secs. This patch exploits the design of TAJO-36. So, this patch requires TAJO-36.

        Attachments

        1. TAJO-584.patch
          42 kB
          Hyunsik Choi
        2. TAJO-584_20140208_01:51:59.patch
          124 kB
          Hyunsik Choi

          Issue Links

            Activity

              People

              • Assignee:
                hyunsik Hyunsik Choi
                Reporter:
                hyunsik Hyunsik Choi
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: