Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
If during installing a Ratis snapshot there is a failure after the metadata manager has been stopped, then as part of the flow, we are calling OzoneManager#reloadOMState() method from OzoneManager#installCheckpoint.
As part of the reloadOmState call, we re-create the IAccessAuthorizer instance with the help of OzoneAuthorizerFactory, but the old instance may remain running as even if we replace the reference in OzoneManager, there might be other places from where the old object is still referenced.
In the case when the authorizer object refers to a significant amount of data, then a repating failure like the one described in HDDS-10300 can fill up the heap of Ozone Manager. Internally we ran into this with and Atlas+Ranger+Ozone setup, where the plugin refers to ~2GB worth of objects in the heap, and was present multiple times in a heap dump. This condition lead to a crash due to long GC pauses.
Attachments
Issue Links
- links to