segment-tar, as names suggest, stores the segments in a bunch of tar archives, inside the segmentstore directory on the local file system. For some cases, especially in the cloud deployments, it may be interesting to store the segments outside the local FS - the remote storage such as Amazon S3, Azure Blob Storage or HDFS may be cheaper than a mounted disk, more scalable, easier for the provisioning, etc.
There are 3 classes responsible for handling tar files in the segment-tar: TarFiles, TarWriter and TarReader. The TarFiles manages the segmentstore directory, scans it for the .tars and for each one creates a TarReader. It also creates a single TarWriter object, used to write (and also read) the most recent tar file.
The TarWriter appends segments to the latest tar file and also serializes the auxiliary indexes: segment index, binary references index and the segment graph. It also takes of synchronization, as we're dealing with a mutable data structure - tar file opened in the append mode.
The TarReader not only reads the segments from the tar file, but is also responsible for the revision GC (mark & sweep methods) and recovering data from files which hasn't been closed cleanly (eg. have no index).
In order to store segments not in the tar files, but somewhere else, it'd be possible to create own implementation of the TarFiles, TarWriter and TarReader. However, such implementation would duplicate a lot of code, not strictly related to the persistence - mark(), sweep(), synchronization, etc. Rather than that, the attached patch presents a different approach: a new layer of abstraction is injected into TarFiles, TarWriter and TarReader - it only takes care of the segments persistence and knows nothing about the synchronization, GC, etc. - leaving it to the upper layer.
The new abstraction layer is modelled using 3 new classes: SegmentArchiveManager, SegmentArchiveReader and SegmentArchiveWriter. They are strictly related to the existing Tar* classes and used by them.
SegmentArchiveManager provides a bunch of file system-style methods, like open(), create(), delete(), exists(), etc. The open() and create() returns instances of the SAReader and SAWriter.
SegmentArchiveReader, despite from reading segments, can also load and parse the index, graph and binary references. The logic responsible for parsing these structures has been already extracted, so it doesn't need to be duplicated in the SAReader implementations. Also, SAReader needs to be aware about the index, since it contains the segment offsets.
The SAWriter class allows to write and read the segments and also store the indexes. It isn't thread safe - it assumes that the synchronization is already done on the higher layers.
In the patch, I've moved the tar implementation to the new classes: SegmentTarManager, SegmentTarReader and SegmentTarWriter.
Apart from the segments, the segmentstore directory also contains following files:
All these files are supported by the new SegmentNodeStorePersistence. Usually there's a simple interface (RepositoryLock, JournalLogFile, etc.) for handling the files.
- The names and package locations for all the affected classes are subjects to change - after applying the patch the TarFiles doesn't deal with the .tar files anymore, similarly the TarReader and TarWriter delegates the low-level file access duties to the SegmentArchiveReader and Writer. I didn't want to change the names yet, to make it easier to understand and rebase the patch with the trunk changes.