Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25209

Optimization in Dataset.apply for DataFrames

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.1
    • 2.4.0
    • SQL
    • None

    Description

      Dataset.apply calls dataset.deserializer (to provide an early error) which ends up calling the full Analyzer on the deserializer. This can take tens of milliseconds, depending on how big the plan is.
      Since Dataset.apply is called for many Dataset operations such as Dataset.where it can be a significant overhead for short queries.

      In the following code: duration is 17 ms in current spark vs 1 ms
      if I remove the line dataset.deserializer.
      It seems the resulting deserializer is particularly big in the case of nested schema, but the same overhead can be observed if we have a very wide flat schema.
      According to a comment in the PR that introduced this check, we can at least remove this check for DataFrames: https://github.com/apache/spark/pull/20402#discussion_r164338267

          val col = "named_struct(" +
            (0 until 100).map { i => s"'col$i', id"}.mkString(",") + ")"
          val df = spark.range(10).selectExpr(col)
          val TRUE = lit(true)
          val numIter = 1000
          var startTime = System.nanoTime()
          for(i <- 0 until numIter) {
            df.where(TRUE)
          }
          val durationMs = (System.nanoTime() - startTime) / numIter / 1000000
          println(s"duration $durationMs")
       

      Attachments

        Activity

          People

            bograd Bogdan Raducanu
            bograd Bogdan Raducanu
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: