Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-42205

Remove logging of Accumulables in Task/Stage start events in JsonProtocol

    XMLWordPrintableJSON

Details

    Description

      Spark's JsonProtocol event logs (used by the history server) are impacted by a race condition when tasks / stages finish very quickly:

      The SparkListenerTaskStart and SparkListenerStageSubmitted events contain mutable TaskInfo and StageInfo objects, which in turn contain Accumulables fields. When a task or stage is submitted, Accumulables is initially empty. When the task or stage finishes, this field is updated with values from the task.

      If a task or stage finishes before the start event has been logged by the event logging listener then the start event will contain the Accumulable values from the task or stage end event. 

      This information isn't used by the History Server and contributes to wasteful bloat in event log sizes. In one real-world log, I found that ~10% of the uncompressed log size was due to these redundant Accumulable fields.

      I propose that we update JsonProtocol to skip the logging of this field for Start/Submitted events. 

      Attachments

        Activity

          People

            joshrosen Josh Rosen
            joshrosen Josh Rosen
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: