Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-42204

Remove redundant logging of TaskMetrics internal accumulators in JsonProtocol event logs

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 3.0.0
    • None
    • Spark Core
    • None

    Description

      Spark's JsonProtocol event logs (used by the history server) contain redundancy in how TaskMetrics are represented in SparkListenerTaskEnd events:

      • The "Task Metrics" field is a map from metric names to values.
      • Under the hood, each metric is implemented using an accumulator and those accumulator values are redundantly stored in the `Task Info`.`Accumulables` field. These Accumulable entries contain the metric value from the task, plus the cumulative "sum-so-far" from the completed tasks in that stage.

      The Spark History Server doesn't rely on the redundant information in the Accumulables field.

      I believe that this redundancy was introduced back in SPARK-10620 when Spark 1.x's separate TaskMetrics implementation was replaced by the current accumulator-based version.

      I think that we should eliminate this redundancy by skipping JsonProtocol logging of the TaskMetric accumulators. Although I think it's somewhat unlikely that third-party code is relying on the presence of that redundant information, I think we should hedge by adding an internal configuration flag to re-enable the redundant logging if needed.

      Attachments

        Activity

          People

            joshrosen Josh Rosen
            joshrosen Josh Rosen
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: