Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32735

RDD actions in DStream.transfrom don't show at batch page

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: In Progress
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 3.0.0
    • Fix Version/s: None
    • Component/s: DStreams, Web UI
    • Labels:
      None
    • Docs Text:
      Fix RDD actions in DStream.transfrom don't show at batch page

      Description

      Issue

      val lines = ssc.socketTextStream("localhost", 9999)
      val words = lines.flatMap(_.split(" "))
      val mappedStream= words.transform(rdd => {
        val c = rdd.count();
        rdd.map(x => s"$c x")}
      )
      mappedStream.foreachRDD(rdd => rdd.foreach(x => println(x)))

      Every batch two spark jobs are created. Only the second one is associated with the streaming output operation and shows at batch page.

      Investigation

      The first action rdd.count() is invoked by JobGenerator.generateJobs. Batch time and output op id are not available in spark context because they are set in JobScheduler later.

      Proposal

      delegate dstream.getOrCompute to JobScheduler so that all rdd actions can run in spark context with correct local properties.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              olwn Liechuan Ou
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: