I am using spark streaming(1.3.1) as a long time running service and out of memory after running for 7 days.
I found that the field authorizedCommittersByStage in OutputCommitCoordinator class cause the OOM.
authorizedCommittersByStage is a map, key is StageId, value is Map[PartitionId, TaskAttemptId]. The OutputCommitCoordinator class has a method stageEnd which will remove stageId from authorizedCommittersByStage. But the method stageEnd is never called by DAGSchedule. And it cause the authorizedCommittersByStage's stage info never be cleaned, which cause OOM.
It happens in my spark streaming program(1.3.1), I am not sure if it will appear in other spark components and other spark version.