Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29562

SQLAppStatusListener metrics aggregation is slow and memory hungry



    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.4.4
    • Fix Version/s: 3.0.0
    • Component/s: SQL
    • Labels:


      While SQLAppStatusListener was added in 2.3, the aggregation code is very similar to what it was previously, so I'm sure this is even older.

      Long story short, the aggregation code (SQLAppStatusListener.aggregateMetrics) is very, very slow, and can take a non-trivial amount of time with large queries, aside from using a ton of memory.

      There are also cascading issues caused by that: since it's called from an event handler, it can slow down event processing, causing events to be dropped, which can cause listeners to miss important events that would tell them to free up internal state (and, thus, memory).

      To given an anecdotal example, one app I looked at ran into the "events being dropped" issue, which caused the listener to accumulate state for 100s of live stages, even though most of them were already finished. That lead to a few GB of memory being wasted due to finished stages that were still being tracked.

      Here, though, I'd like to focus on SQLAppStatusListener.aggregateMetrics and making it faster. We should look at the other issues (unblocking event processing, cleaning up of stale data in listeners) separately.

      (I also remember someone in the past trying to fix something in this area, but couldn't find a PR nor an open bug.)


          Issue Links



              • Assignee:
                vanzin Marcelo Masiero Vanzin
                vanzin Marcelo Masiero Vanzin
              • Votes:
                0 Vote for this issue
                2 Start watching this issue


                • Created: