Details
-
Umbrella
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.3.3
-
None
Description
When our Spark system has been under load for an extended period of time, GC remains highly active and the jobs page becomes unresponsive even when load eases. Please see the attached GCRateIssues for more details regarding the problem definition.
We found a number of separate issues which are detailed in the subtasks. I anticipate committing a single PR for all subtasks whose commits roughly align with the descriptions in the subtasks.
The performance of the code is measured before and after the change and is attached in the document PerformanceBeforeAndAfter. tl;dr in our use case, we saw about five (computed) orders of magnitude improvement.
Attachments
Attachments
Issue Links
- is duplicated by
-
SPARK-27727 Asynchronous ElementStore cleanup should have only one pending cleanup per class
- Closed
-
SPARK-27728 Address thread-safety of InMemoryStore and ElementTrackingStores.
- Closed
-
SPARK-27729 Extract deletion of the summaries from the stage deletion loop
- Closed
-
SPARK-27730 Add support for removeAllKeys
- Closed
-
SPARK-27731 Cleanup some non-compile time type checking and exception handling
- Closed
- links to
1.
|
Asynchronous ElementStore cleanup should have only one pending cleanup per class | Closed | Unassigned | |
2.
|
Address thread-safety of InMemoryStore and ElementTrackingStores. | Closed | Unassigned | |
3.
|
Extract deletion of the summaries from the stage deletion loop | Closed | Unassigned | |
4.
|
Add support for removeAllKeys | Closed | Unassigned | |
5.
|
Cleanup some non-compile time type checking and exception handling | Closed | Unassigned |