[HDDS-7935] [Snapshot] LRU Cache entries may get evicted/closed during long running processes - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.4.0
Component/s: None
Labels:
- pull-request-available

Description

The way the snapshot LRU cache is implemented, when the oldest snapshot is evicted, the corresponding rocksdb instance is closed: https://github.com/apache/ozone/blob/3f7ded2a34c0c35b89901e222ceaee0d1fdf08b6/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotManager.java#L124

That is probably fine for shortlived tasks like users reading snapshots, but is probably not safe for long lived tasks like snap diff and maybe snapshot delete.

The problem is that the cache is currently only refreshed when the snapshot is initially retrieved from the cache; subsequent reads from the snapshot itself don't refresh the cache. Thus it is possible for rocksdb instances to be evicted and closed in the middle of snap diff processing.

One alternative I can think of is to add some kind of reference counting scheme so that rocksdb instances aren't closed automatically on eviction.

Another possibility is to have an entirely separate pool of snapshot entries, outside of the cache, that are explicitly opened and closed by long running tasks like snapdiff.

Attachments

Issue Links

depends upon

HDDS-7914 [Snapshot] Block FS API access to deleted (non-active) snapshots

Resolved

links to

GitHub Pull Request #4567

GitHub Pull Request #4568

Activity

People

Assignee:: Siyao Meng

Reporter:: George Jahad

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 08/Feb/23 23:25

Updated:: 03/May/23 20:54

Resolved:: 03/May/23 20:54