Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17842

Thread and memory leak in WindowDstream (UnionRDD ) when parallelPartition computation gets enabled.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Duplicate
    • 2.0.0
    • None
    • DStreams, Spark Core
    • None
    • Yarn cluster, Eclipse Dev Env

    Description

      We noticed a steady increase in ForkJoinTask instances in the driver process heap. Found out the root cause to be UnionRDD.

      WindowDstream internally uses UnionRDD which has a parallel partition computation logic by using parallel collection with ForkJoinPool task support.
      partitionEvalTaskSupport =new ForkJoinTaskSupport(new ForkJoinPool(8))

      The pool is created each time when a UnionRDD is created , but the pool is not getting shutdown. This is leaking thread/mem every slide interval of the window.

      Easily reproducible with the below code. Just keep a watch on the number of threads.

          val sparkConf = new SparkConf().setMaster("local[*]").setAppName("TestLeak")
          val ssc = new StreamingContext(sparkConf, Seconds(1))
          ssc.checkpoint("checkpoint")
          val rdd = ssc.sparkContext.parallelize(List(1,2,3))
          val constStream = new ConstantInputDStream[Int](ssc,rdd)
          constStream.window(Seconds(20),Seconds(1)).print()
          ssc.start()
          ssc.awaitTermination();
      

      This happens only when the number of rdds to be unioned is above the value spark.rdd.parallelListingThreshold (By default 10)

      Currently i'm working around by setting this threshold be a higher value.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              sreelalsl Sreelal S L
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: