Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23957

Sorts in subqueries are redundant and can be removed

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.4.0
    • 2.4.0
    • SQL
    • None

    Description

      Unless combined with a LIMIT, there's no correctness reason that planned and optimized subqueries should have any sort operators (since the result of the subquery is an unordered collection of tuples). 

      For example:

      SELECT count(1) FROM (select id FROM dft ORDER by id)

      has the following plan:

      == Physical Plan ==
      *(3) HashAggregate(keys=[], functions=[count(1)])
      +- Exchange SinglePartition
         +- *(2) HashAggregate(keys=[], functions=[partial_count(1)])
            +- *(2) Project
               +- *(2) Sort [id#0L ASC NULLS FIRST], true, 0
                  +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200)
                     +- *(1) Project [id#0L]
                        +- *(1) FileScan parquet [id#0L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/henry/src/spark/one_million], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>
      

      ... but the sort operator is redundant.

      Less intuitively, the sort is also redundant in selections from an ordered subquery:

      SELECT * FROM (SELECT id FROM dft ORDER BY id)

      has plan:

      == Physical Plan ==
      *(2) Sort [id#0L ASC NULLS FIRST], true, 0
      +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200)
         +- *(1) Project [id#0L]
            +- *(1) FileScan parquet [id#0L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/henry/src/spark/one_million], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>
      

      ... but again, since the subquery returns a bag of tuples, the sort is unnecessary.

      We should consider adding an optimizer rule that removes a sort inside a subquery. SPARK-23375 is related, but removes sorts that are functionally redundant because they perform the same ordering.
       

      Attachments

        Activity

          People

            henryr Henry Robinson
            henryr Henry Robinson
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: