This issue serves as collection of all changes to the storage format introduced with Oak Segment Tar and their impact. Once sufficiently stabilised this information should serve as basis for the documentation in oak-doc.
|Generation in segment header||Required to unequivocally determine the generation of a segment during cleanup. Segment retention time is given in number of generations (2 by default).||No performance, space impact expected||offline||0.0.2|
|Stable id for node states||Required to efficiently determine equality of node states. This can be seen as an intermediate step to decoupling the address of records from their identity. The next step is to introduce logical record ids (
||Node states increase by the size of one record id (3 bytes / 20 bytes after
|Binary index in tar files||Avoid traversing the repository to collect the gc roots for DSGC. Fetch them from an index instead.||Additional index entry per tar file. Adds a couple of bytes per external binary to each tar file. Exact size to be determined. Francesco Mari could you help with this?
|Simplified record ids||Preparation and precondition for logical record ids (
||Record ids grow from 3 bytes to 18 bytes when serialised into records. Impact on repositories to be assessed but can be anywhere between almost none to x6.
|Storage format versioning||In order to be able to further evolve the storage format with minimal impact on existing deployments we need to carefully versions the various storage entities (segments, tar files, etc.)||No performance, space impact expected||offline||0.0.2/ 0.0.10|
|Logical record ids||We need to separate addresses of records from their identity to be able to further scale the TarMK.
||A small size overhead per segment for the logical id table. Further impact to be evaluated (Francesco Mari, please add your assessment here).||offline||0.0.14 (planned)|
|External index for segments||Avoid recreating tar files if indexes are corrupt/missing. Just recreate the indexes.||Faster startup after a crash. Overall less disk space usage as no unnecessary backup files are created.||online||not yet planned||OAK-4649|
|In-place journal||Reduce complexity by in-lining the journal log. Less files, less chances to break something. Also the granularity of the log would increase as flushing of the persisted head would not be required any more. Resilience would improve as the roll-back functionality could operate at a finer granularity.||No more journal.log. Better resiliency. Significant risk for regression of
||online||not yet planned||OAK-4103|
|Root record types||With the information currently available from the segment headers we cannot collect statistics about segment usage on repositories of non trivial sizes. This fix would allow us to build more scalable tools to that respect.||None expected wrt. to performance and size under normal operation.||offline||0.0.14 (planned) (waiting for
Misc ideas currently on the back burner:
- SegmentMK: Arch segments (OAK-1905)
- Extension headers for segments (no issue yet)
- More memory efficient serialisation of values (e.g. boolean) (no issue yet)
- Protocol Buffer for serialising records (no issue yet)