[OAK-4833] Document storage format changes - ASF JIRA

Details

Type: Technical task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.6.0
Component/s: doc, segment-tar
Labels:
- documentation

Description

This issue serves as collection of all changes to the storage format introduced with Oak Segment Tar and their impact. Once sufficiently stabilised this information should serve as basis for the documentation in oak-doc.

Change	Rational	Impact	Migration	Since	Issues
Generation in segment header	Required to unequivocally determine the generation of a segment during cleanup. Segment retention time is given in number of generations (2 by default).	No performance, space impact expected	offline	0.0.2	~~OAK-3348~~
Stable id for node states	Required to efficiently determine equality of node states. This can be seen as an intermediate step to decoupling the address of records from their identity. The next step is to introduce logical record ids (~~OAK-4659~~).	Node states increase by the size of one record id (3 bytes / 20 bytes after ~~OAK-4631~~). On top of that there is an additional block record à 18 bytes per node state.	offline	0.0.2	~~OAK-3348~~
Binary index in tar files	Avoid traversing the repository to collect the gc roots for DSGC. Fetch them from an index instead.	Additional index entry per tar file. Adds a couple of bytes per external binary to each tar file. Exact size to be determined. frm could you help with this? ~~OAK-4740~~ is a regression wrt. to resiliency caused by this change (and the fact that the blob store might return blob ids longer than 2k chars).	offline	0.0.4	~~OAK-4101~~
Simplified record ids	Preparation and precondition for logical record ids (~~OAK-4659~~). At the same time the simplest possible fix for ~~OAK-2896~~. The latter leads to degeneration of segment sizes, which in turn has adverse effects on overall performance, resource utilisation and memory requirements. Without this fix ~~OAK-2498~~ would need to be fixed in a different way that would require other changes in the storage format. I started to regard this issue as removing a premature optimisation (which caused ~~OAK-2498~~). OTOH with ~~OAK-4844~~ we should also start looking into mitigations and what those would mean to size vs. simplicity vs. performance.	Record ids grow from 3 bytes to 18 bytes when serialised into records. Impact on repositories to be assessed but can be anywhere between almost none to x6. ~~OAK-4812~~ is a performance regression caused by this chance. Its overall impact is yet to be assessed.	offline	0.0.10	~~OAK-4631~~, ~~OAK-4844~~
Storage format versioning	In order to be able to further evolve the storage format with minimal impact on existing deployments we need to carefully versions the various storage entities (segments, tar files, etc.)	No performance, space impact expected	offline	0.0.2/ 0.0.10	~~OAK-4232~~, ~~OAK-4683~~, ~~OAK-4295~~
Logical record ids	We need to separate addresses of records from their identity to be able to further scale the TarMK. ~~OAK-3348~~ (the online compaction misery) can be seen as a symptom of failing to understand this earlier. The stable ids introduced with ~~OAK-3348~~ are a first step into this direction. However this is not sufficient to implement features like e.g. background compaction (~~OAK-4756~~), partial compaction (~~OAK-3349~~) or incremental compaction (~~OAK-3350~~).	A small size overhead per segment for the logical id table. Further impact to be evaluated (frm, please add your assessment here).	offline	0.0.14 (planned)	~~OAK-4659~~
External index for segments	Avoid recreating tar files if indexes are corrupt/missing. Just recreate the indexes.	Faster startup after a crash. Overall less disk space usage as no unnecessary backup files are created.	online	not yet planned	OAK-4649
In-place journal	Reduce complexity by in-lining the journal log. Less files, less chances to break something. Also the granularity of the log would increase as flushing of the persisted head would not be required any more. Resilience would improve as the roll-back functionality could operate at a finer granularity.	No more journal.log. Better resiliency. Significant risk for regression of ~~OAK-4291~~ if not implemented properly. Most likely a significant refactoring of some parts of the code is required before we can proceed with this issue.	online	not yet planned	OAK-4103
Root record types	With the information currently available from the segment headers we cannot collect statistics about segment usage on repositories of non trivial sizes. This fix would allow us to build more scalable tools to that respect.	None expected wrt. to performance and size under normal operation.	offline	0.0.14 (planned) (waiting for ~~OAK-4659~~ as implementation depends on how we progress there)	~~OAK-2498~~

Misc ideas currently on the back burner:

SegmentMK: Arch segments (OAK-1905)
Extension headers for segments (no issue yet)
More memory efficient serialisation of values (e.g. boolean) (no issue yet)
Protocol Buffer for serialising records (no issue yet)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

OAK-4833-01.patch
19/Jan/17 10:23
7 kB
Francesco Mari

Document storage format changes

Details

Description

Attachments

Attachments

Activity

People

Dates