[SPARK-25209] Optimization in Dataset.apply for DataFrames - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3.1
Fix Version/s: 2.4.0
Component/s: SQL
Labels:
None

Target Version/s:

2.4.0

Description

Dataset.apply calls dataset.deserializer (to provide an early error) which ends up calling the full Analyzer on the deserializer. This can take tens of milliseconds, depending on how big the plan is.
Since Dataset.apply is called for many Dataset operations such as Dataset.where it can be a significant overhead for short queries.

In the following code: duration is 17 ms in current spark vs 1 ms
if I remove the line dataset.deserializer.
It seems the resulting deserializer is particularly big in the case of nested schema, but the same overhead can be observed if we have a very wide flat schema.
According to a comment in the PR that introduced this check, we can at least remove this check for DataFrames: https://github.com/apache/spark/pull/20402#discussion_r164338267

    val col = "named_struct(" +
      (0 until 100).map { i => s"'col$i', id"}.mkString(",") + ")"
    val df = spark.range(10).selectExpr(col)
    val TRUE = lit(true)
    val numIter = 1000
    var startTime = System.nanoTime()
    for(i <- 0 until numIter) {
      df.where(TRUE)
    }
    val durationMs = (System.nanoTime() - startTime) / numIter / 1000000
    println(s"duration $durationMs")

Attachments

Issue Links

links to

[Github] Pull Request #22201 (bogdanrdc)

Activity

People

Assignee:: Bogdan Raducanu

Reporter:: Bogdan Raducanu

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 23/Aug/18 11:59

Updated:: 24/Aug/18 02:13

Resolved:: 24/Aug/18 02:13