Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19503

Execution Plan Optimizer: avoid sort or shuffle when it does not change end result such as df.sort(...).count()

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Incomplete
    • 2.1.0
    • None
    • Optimizer, SQL
    • Perhaps only a pyspark or databricks AWS issue

    • Important

    Description

      df.sort(...).count()
      performs shuffle and sort and then count! This is wasteful as sort is not required here and makes me wonder how smart the algebraic optimiser is indeed! The data may be partitioned by known count (such as parquet files) and we should not shuffle to just perform count.

      This may look trivial, but if optimiser fails to recognise this, I wonder what else is it missing especially in more complex operations.

      Attachments

        Activity

          People

            Unassigned Unassigned
            rkarimi R
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: