Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-42204

Remove redundant logging of TaskMetrics internal accumulators in JsonProtocol event logs

    XMLWordPrintableJSON

Details

    Description

      Spark's JsonProtocol event logs (used by the history server) contain redundancy in how TaskMetrics are represented in SparkListenerTaskEnd events:

      • The "Task Metrics" field is a map from metric names to values.
      • Under the hood, each metric is implemented using an accumulator and those accumulator values are redundantly stored in the `Task Info`.`Accumulables` field. These Accumulable entries contain the metric value from the task, plus the cumulative "sum-so-far" from the completed tasks in that stage.

      The Spark History Server doesn't rely on the redundant information in the Accumulables field.

      I believe that this redundancy was introduced back in SPARK-10620 when Spark 1.x's separate TaskMetrics implementation was replaced by the current accumulator-based version.

      I think that we should eliminate this redundancy by skipping JsonProtocol logging of the TaskMetric accumulators. Although I think it's somewhat unlikely that third-party code is relying on the presence of that redundant information, I think we should hedge by adding an internal configuration flag to re-enable the redundant logging if needed.

      Attachments

        Activity

          People

            joshrosen Josh Rosen
            joshrosen Josh Rosen
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: