Description
Originally reported by Richard Marscher on the dev list:
we have been experiencing issues in production over the past couple weeks with Spark Standalone Worker JVMs seeming to have memory leaks. They accumulate Old Gen until it reaches max and then reach a failed state that starts critically failing some applications running against the cluster.
I've done some exploration of the Spark code base related to Worker in search of potential sources of this problem and am looking for some commentary on a couple theories I have:
Observation 1: The `finishedExecutors` HashMap seem to only accumulate new entries over time unbounded. It only seems to be appended and never periodically purged or cleaned of older executors in line with something like the worker cleanup scheduler. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala#L473
It looks like the finishedExecutors map is only read when rendering the Worker Web UI and in constructing REST API responses. I think that we could address this leak by adding a configuration to cap the maximum number of retained executors, applications, etc. We already have similar caps in the driver UI. If we add this configuration, I think that we should pick some sensible default value rather than an unlimited one. This is technically a user-facing behavior change but I think it's okay since the current behavior is to crash / OOM.
To fix this, we should:
- Add the new configurations to cap how much data we retain.
- Add a stress-tester to spark-perf so that we have a way to reproduce these leaks during QA.
- Add some unit tests to ensure that cleanup is performed at the right places. This test should be modeled after the memory-leak-prevention strategies that we employed in JobProgressListener and in other parts of the Driver UI.