Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6847

Stack overflow on updateStateByKey which followed by a dstream with checkpoint set

    XMLWordPrintableJSON

    Details

    • Target Version/s:

      Description

      The issue happens with the following sample code: uses updateStateByKey followed by a map with checkpoint interval 10 seconds

          val sparkConf = new SparkConf().setAppName("test")
          val streamingContext = new StreamingContext(sparkConf, Seconds(10))
          streamingContext.checkpoint("""checkpoint""")
          val source = streamingContext.socketTextStream("localhost", 9999)
          val updatedResult = source.map(
              (1,_)).updateStateByKey(
                  (newlist : Seq[String], oldstate : Option[String]) =>     newlist.headOption.orElse(oldstate))
          updatedResult.map(_._2)
          .checkpoint(Seconds(10))
          .foreachRDD((rdd, t) => {
            println("Deep: " + rdd.toDebugString.split("\n").length)
            println(t.toString() + ": " + rdd.collect.length)
          })
          streamingContext.start()
          streamingContext.awaitTermination()
      

      From the output, we can see that the dependency will be increasing time over time, the updateStateByKey never get check-pointed, and finally, the stack overflow will happen.

      Note:

      • The rdd in updatedResult.map(_._2) get check-pointed in this case, but not the updateStateByKey
      • If remove the checkpoint(Seconds(10)) from the map result ( updatedResult.map(_._2) ), the stack overflow will not happen

        Attachments

          Activity

            People

            • Assignee:
              zsxwing Shixiong Zhu
              Reporter:
              jhu Jack Hu
            • Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: