Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-11596

SQL execution very slow for nested query plans because of DataFrame.withNewExecutionId

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.5.1
    • 1.6.0
    • SQL
    • None

    Description

      For nested query plans like a recursive unionAll, withExecutionId is extremely slow, likely because of repeated string concatenation in QueryPlan.simpleString

      Test case:

      (1 to 100).foldLeft[Option[DataFrame]] (None) { (curr, idx) =>
          println(s"PROCESSING >>>>>>>>>>> $idx")
          val df = sqlContext.sparkContext.parallelize((0 to 10).zipWithIndex).toDF("A", "B")
          val union = curr.map(_.unionAll(df)).getOrElse(df)
          union.cache()
          println(">>" + union.count)
          //union.show()
          Some(union)
        }
      

      Stack trace:

      scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320)
      scala.collection.AbstractIterator.addString(Iterator.scala:1157)
      scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286)
      scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
      scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288)
      scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
      org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:364)
      org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:367)
      org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168)
      org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:401)
      org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
      org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
      scala.collection.immutable.List.foreach(List.scala:318)
      org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
      org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
      org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
      scala.collection.immutable.List.foreach(List.scala:318)
      org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
      org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
      org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
      scala.collection.immutable.List.foreach(List.scala:318)
      org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
      org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:372)
      org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:369)
      org.apache.spark.sql.SQLContext$QueryExecution.stringOrError(SQLContext.scala:936)
      org.apache.spark.sql.SQLContext$QueryExecution.toString(SQLContext.scala:949)
      org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52)
      org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
      org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)
      org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402)

      Attachments

        1. screenshot-1.png
          233 kB
          Yin Huai

        Activity

          People

            yhuai Yin Huai
            copris Cristian Opris
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: