[SPARK-19503] Execution Plan Optimizer: avoid sort or shuffle when it does not change end result such as df.sort(...).count() - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Incomplete
Affects Version/s: 2.1.0
Fix Version/s: None
Component/s: Optimizer, SQL
Labels:
- bulk-closed
- execution
- optimizer
- plan
- query
Environment:

Perhaps only a pyspark or databricks AWS issue

Flags:

Important

Description

df.sort(...).count()
performs shuffle and sort and then count! This is wasteful as sort is not required here and makes me wonder how smart the algebraic optimiser is indeed! The data may be partitioned by known count (such as parquet files) and we should not shuffle to just perform count.

This may look trivial, but if optimiser fails to recognise this, I wonder what else is it missing especially in more complex operations.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: R

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 07/Feb/17 22:25

Updated:: 17/May/20 17:58

Resolved:: 21/May/19 04:13