Details
-
Bug
-
Status: Open
-
Critical
-
Resolution: Unresolved
-
3.3.2, 3.5.0
-
None
Description
With AQE enabled, having sort in the plan changes sample results after caching.
Moreover, when cached, collect returns records as if it's not cached, which is inconsistent with count and show.
A script to reproduce:
import spark.implicits._ val df = (1 to 4).toDF("id").sort("id").sample(0.4, 123) println("NON CACHED:") println(" count: " + df.count()) println(" collect: " + df.collect().mkString(" ")) df.show() println("CACHED:") df.cache().count() println(" count: " + df.count()) println(" collect: " + df.collect().mkString(" ")) df.show() df.unpersist()
output:
NON CACHED: count: 2 collect: [1] [4] +---+ | id| +---+ | 1| | 4| +---+ CACHED: count: 3 collect: [1] [4] +---+ | id| +---+ | 1| | 2| | 3| +---+
BTW, disabling AQE [spark.conf.set("spark.databricks.optimizer.adaptive.enabled", "false")] helps on Databricks clusters, but locally it has no effect, at least on Spark 3.3.2.
Attachments
Issue Links
- links to