      This issue serves as collection of all changes to the storage format introduced with Oak Segment Tar and their impact. Once sufficiently stabilised this information should serve as basis for the documentation in oak-doc.

      Change Rational Impact Migration Since Issues
      Generation in segment header Required to unequivocally determine the generation of a segment during cleanup. Segment retention time is given in number of generations (2 by default). No performance, space impact expected offline 0.0.2 OAK-3348
      Stable id for node states Required to efficiently determine equality of node states. This can be seen as an intermediate step to decoupling the address of records from their identity. The next step is to introduce logical record ids (OAK-4659). Node states increase by the size of one record id (3 bytes / 20 bytes after OAK-4631). On top of that there is an additional block record à 18 bytes per node state. offline 0.0.2 OAK-3348
      Binary index in tar files Avoid traversing the repository to collect the gc roots for DSGC. Fetch them from an index instead. Additional index entry per tar file. Adds a couple of bytes per external binary to each tar file. Exact size to be determined. Francesco Mari could you help with this? OAK-4740 is a regression wrt. to resiliency caused by this change (and the fact that the blob store might return blob ids longer than 2k chars). offline 0.0.4 OAK-4101
      Simplified record ids Preparation and precondition for logical record ids (OAK-4659). At the same time the simplest possible fix for OAK-2896. The latter leads to degeneration of segment sizes, which in turn has adverse effects on overall performance, resource utilisation and memory requirements. Without this fix OAK-2498 would need to be fixed in a different way that would require other changes in the storage format. I started to regard this issue as removing a premature optimisation (which caused OAK-2498). OTOH with OAK-4844 we should also start looking into mitigations and what those would mean to size vs. simplicity vs. performance. Record ids grow from 3 bytes to 18 bytes when serialised into records. Impact on repositories to be assessed but can be anywhere between almost none to x6. OAK-4812 is a performance regression caused by this chance. Its overall impact is yet to be assessed. offline 0.0.10 OAK-4631, OAK-4844
      Storage format versioning In order to be able to further evolve the storage format with minimal impact on existing deployments we need to carefully versions the various storage entities (segments, tar files, etc.) No performance, space impact expected offline 0.0.2/ 0.0.10 OAK-4232, OAK-4683, OAK-4295
      Logical record ids We need to separate addresses of records from their identity to be able to further scale the TarMK. OAK-3348 (the online compaction misery) can be seen as a symptom of failing to understand this earlier. The stable ids introduced with OAK-3348 are a first step into this direction. However this is not sufficient to implement features like e.g. background compaction (OAK-4756), partial compaction (OAK-3349) or incremental compaction (OAK-3350). A small size overhead per segment for the logical id table. Further impact to be evaluated (Francesco Mari, please add your assessment here). offline 0.0.14 (planned) OAK-4659
      External index for segments Avoid recreating tar files if indexes are corrupt/missing. Just recreate the indexes. Faster startup after a crash. Overall less disk space usage as no unnecessary backup files are created. online not yet planned OAK-4649
      In-place journal Reduce complexity by in-lining the journal log. Less files, less chances to break something. Also the granularity of the log would increase as flushing of the persisted head would not be required any more. Resilience would improve as the roll-back functionality could operate at a finer granularity. No more journal.log. Better resiliency. Significant risk for regression of OAK-4291 if not implemented properly. Most likely a significant refactoring of some parts of the code is required before we can proceed with this issue. online not yet planned OAK-4103
      Root record types With the information currently available from the segment headers we cannot collect statistics about segment usage on repositories of non trivial sizes. This fix would allow us to build more scalable tools to that respect. None expected wrt. to performance and size under normal operation. offline 0.0.14 (planned) (waiting for OAK-4659 as implementation depends on how we progress there) OAK-2498

      Misc ideas currently on the back burner:

      • SegmentMK: Arch segments (OAK-1905)
      • Extension headers for segments (no issue yet)
      • More memory efficient serialisation of values (e.g. boolean) (no issue yet)
      • Protocol Buffer for serialising records (no issue yet)


