Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17972

Query planning slows down dramatically for large query plans even when sub-trees are cached

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.6.2, 2.0.1
    • 2.1.0
    • SQL
    • None

    Description

      The following Spark shell snippet creates a series of query plans that grow exponentially. The i-th plan is created using 4 cached copies of the i - 1-th plan.

      (0 until 6).foldLeft(Seq(1, 2, 3).toDS) { (plan, iteration) =>
        val start = System.currentTimeMillis()
        val result = plan.join(plan, "value").join(plan, "value").join(plan, "value").join(plan, "value")
        result.cache()
        System.out.println(s"Iteration $iteration takes time ${System.currentTimeMillis() - start} ms")
        result.as[Int]
      }
      

      We can see that although all plans are cached, the query planning time still grows exponentially and quickly becomes unbearable.

      Iteration 0 takes time 9 ms
      Iteration 1 takes time 19 ms
      Iteration 2 takes time 61 ms
      Iteration 3 takes time 219 ms
      Iteration 4 takes time 830 ms
      Iteration 5 takes time 4080 ms
      

      Similar scenarios can be found in iterative ML code and significantly affects usability.

      This issue can be fixed by introducing a checkpoint() method for Dataset that truncates both the query plan and the lineage of the underlying RDD.

      Attachments

        Issue Links

          Activity

            People

              lian cheng Cheng Lian
              lian cheng Cheng Lian
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: