[SPARK-18591] Replace hash-based aggregates with sort-based ones if inputs already sorted - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.0.2
Fix Version/s: None
Component/s: SQL
Labels:
- bulk-closed

Description

Spark currently uses sort-based aggregates only in limited condition; the cases where spark cannot use partial aggregates and hash-based ones.
However, if input ordering has already satisfied the requirements of sort-based aggregates, it seems sort-based ones are faster than the other.

./bin/spark-shell --conf spark.sql.shuffle.partitions=1

val df = spark.range(10000000).selectExpr("id AS key", "id % 10 AS value").sort($"key").cache

def timer[R](block: => R): R = {
  val t0 = System.nanoTime()
  val result = block
  val t1 = System.nanoTime()
  println("Elapsed time: " + ((t1 - t0 + 0.0) / 1000000000.0)+ "s")
  result
}

timer {
  df.groupBy("key").count().count
}

// codegen'd hash aggregate
Elapsed time: 7.116962977s

// non-codegen'd sort aggregarte
Elapsed time: 3.088816662s

If codegen'd sort-based aggregates are supported in ~~SPARK-16844~~, this seems to make the performance gap bigger;

- codegen'd sort aggregate
Elapsed time: 1.645234684s

Therefore, it'd be better to use sort-based ones in this case.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Takeshi Yamamuro

Votes:: 1 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 26/Nov/16 02:37

Updated:: 21/Jan/22 23:55

Resolved:: 21/May/19 04:16