[OAK-6254] DataStore: API to retrieve approximate storage size - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: blob
Labels:
None

Description

The estimated size of the datastore (on disk) is needed to:

monitor growth over time, or growth of certain operations
monitor if garbage collection is effective
avoid out of disk space
estimate backup size
statistical purposes (for example, if there are many repositories, to group them by size)

Datastore size: we could use the following heuristic: We could read the file sizes in ./datastore/00/00 (if it exists) and multiply by 65536; or ./datastore/00 and multiply by 256. That would give a rough estimation (within about 20% for repositories with datastore size > 50 GB).

I think this is mainly important for the FileDataStore. The S3 datastore, if there is a simple and fast S3 API to read the size, then that would be good as well, but if there is none, then returning "unknown" is fine for me.

As for the API, I would use something like this: long getEstimatedStorageSize(int accuracyLevel) with accuracyLevel 1 for inaccurate (fastest), 2 more accurate (slower),..., 9 precise (possibly very slow). Similar to java.util.zip.Deflater.setLevel. I would expect it takes up to 1 second for accuracyLevel 0, up to 5 seconds for accuracyLevel 1, and possibly hours for level 9. With level 1, I would read files in 00/00, with level 2 - 8 I would read files in 00, and with level 9 I would read all the files. For level 1, I wouldn't stop; for level 2, if it takes more than 5 seconds, I would stop and return the current best estimate.

Attachments

Issue Links

is related to

OAK-7193 DataStore: API to retrieve statistic (file headers, size estimation)

Open

Activity

People

Assignee:: Unassigned

Reporter:: Thomas Mueller

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 23/May/17 07:45

Updated:: 10/Jan/20 09:51