Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31137

Opportunity to simplify execution plan when passing empty dataframes to subtract()

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Won't Do
    • 2.4.5
    • None
    • PySpark, SQL
    • None

    Description

      Execution plans are similar when passing an empty versus non-empty DataFrame to pyspark's subtract call.

      df.subtract(regDf)

      yields the same physical plan as:

      df.subtract(emptyDf)

       Since the operation (EXCEPT DISTINCT in Spark SQL) requires a sort on both DataFrames, this can yield some significant performance speed-ups because if the incoming DF is empty no processing should happen.

       

      Should be a quick fix for a seasoned commiter.

      Attachments

        Activity

          People

            Unassigned Unassigned
            dan_z S Daniel Zafar
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 24h
                24h
                Remaining:
                Remaining Estimate - 24h
                24h
                Logged:
                Time Spent - Not Specified
                Not Specified