Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.0.0-Ducc
-
None
Description
A Job of 300,000+ Total work items failed with Reason Premature after processing 70,000+ of them.
The Job Driver (JD) maintains a file in the user's log directory named work-item-status.json.gz comprising the information shown by the WebServer on the Work Items tab of the Job Details page. As each work item is processed, the JD's WorkItemStateManager (WiSm) maintains an in-memory representation for Id, Node, PID, State, Start and End times. Periodically, the JD employs the WiSm's export method to re-write the above file.
Although the amount of information is relatively small per work item, when the number of work items is large the amount of memory consumed is large since these in-memory representations are kept for the lifetime of the Job.
To alleviate this "designed-in" memory leak, the WiSm should only keep active work items in-memory. Terminal work items should be flushed to disk. This change will affect DUCC components that employ WiSm, specifically CLI, WS and JD.