Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
Description
The way the snapshot LRU cache is implemented, when the oldest snapshot is evicted, the corresponding rocksdb instance is closed: https://github.com/apache/ozone/blob/3f7ded2a34c0c35b89901e222ceaee0d1fdf08b6/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotManager.java#L124
That is probably fine for shortlived tasks like users reading snapshots, but is probably not safe for long lived tasks like snap diff and maybe snapshot delete.
The problem is that the cache is currently only refreshed when the snapshot is initially retrieved from the cache; subsequent reads from the snapshot itself don't refresh the cache. Thus it is possible for rocksdb instances to be evicted and closed in the middle of snap diff processing.
One alternative I can think of is to add some kind of reference counting scheme so that rocksdb instances aren't closed automatically on eviction.
Another possibility is to have an entirely separate pool of snapshot entries, outside of the cache, that are explicitly opened and closed by long running tasks like snapdiff.
Attachments
Issue Links
- depends upon
-
HDDS-7914 [Snapshot] Block FS API access to deleted (non-active) snapshots
- Resolved
- links to