Index: oak-doc/src/site/markdown/nodestore/segment/changes.md
===================================================================
--- oak-doc/src/site/markdown/nodestore/segment/changes.md (nonexistent)
+++ oak-doc/src/site/markdown/nodestore/segment/changes.md (working copy)
@@ -0,0 +1,85 @@
+
+
+# Changes in the data format
+
+This document describes the changes in the data format introduced by the Oak Segment Tar module.
+The purpose of this document is not only to enumerate such changes, but also to explain the rationale behind them.
+Pointers to Jira issues are provided for a much more terse description of changes.
+Changes are presented in chronological order.
+
+## Generation in segment headers
+
+* Jira issue: [OAK-3348](https://issues.apache.org/jira/browse/OAK-3348)
+* Since: Oak Segment Tar 0.0.2
+
+The GC algorithm implemented by Oak Segment Tar is based on the fundamental idea of grouping records into generations.
+When GC is performed, records belonging to older generations can be removed, while records belonging to newer generations have to be retained.
+
+The fact that a record belongs to a generation is not a transient information: it has to persist across multiple restarts of the system.
+This means that the generation of a record has to be persisted together with the record.
+
+To not incur in the size penalty of persisting additional information for each and every record, the generation is persisted only once in the segment header.
+Thus, the generation of a record is defined as the generation of the segment containing that record.
+
+The original data format for the segment header contained some holes in the specification.
+The change made good use of one of those holes (bytes 10-13) to save the generation as a 4-byte integer value.
+
+## Stable identifiers
+
+* Jira issue: [OAK-3348](https://issues.apache.org/jira/browse/OAK-3348)
+* Since: Oak Segment Tar 0.0.2
+
+The fastest way to compare two node records is to compare their addresses.
+If their addresses are equal, the two node records are guaranteed to be equal.
+Transitively, the subtrees identified by those node records are guaranteed to be equal.
+
+The situation gets more complicated when the generation-based GC algorithm copies a node record over a new generation to save it from being deleted.
+In this situation, two copies of the same node record live in two different generation, in two different segments and at two different addresses.
+If you want to figure out if those two node records are the same, the trick of comparing their addresses will not work anymore.
+
+To overcome this problem, a stable identifier has been added to every node record.
+When a new node record is serialized, the address it is serialized to becomes its stable identifier.
+The stable identifier is included in the node record and becomes part of its serialized format.
+
+When the node record is copied to a new generation and a new segment, its address will inevitably change.
+The stable identifier instead, being part of the node record itself, will not change.
+This enables fast comparison between different copies of the same node records.
+Instead of comparing their addresses, you can compare their stable identifiers to achieve the same result.
+
+The stable identifier is serialized as a 18-bytes-long string record.
+This record, in turn, is referenced from the node record by adding an additional 3-bytes-long reference field to it.
+In conclusion, stable identifiers add an overhead of 21 bytes to every node record.
+
+## Binary references index
+
+* Jira issue: [OAK-4201](https://issues.apache.org/jira/browse/OAK-4201)
+* Since: Oak Segment Tar 0.0.4
+
+The original data format in Oak Segment mandates that every segment maintains a list of references to external binaries.
+Every time a record references an external binary - i.e. a piece of binary data that is stored in a Blob Store - a new binary reference is added to its segment.
+The list of references to external binaries is inspected periodically by the Blob Store GC algorithm to know which binaries are currently in use.
+The Blob Store GC algorithm removes every binary that is not reported as used by the Segment Store.
+
+Retrieving the comprehensive list of external binaries for the whole repository is an expensive operation when it comes to I/O.
+Every segment in every TAR file has to be read in memory and the list of references to external binaries have to be parsed.
+Even if a segment does not contain references to external binaries, it has to be read in memory first for the system to figure it out.
+
+To make this process faster and less greedy for I/O resources, Oak Segment Tar introduces an index of references to external binaries in every TAR file.
+This index aggregates the required information from every segment contained in a TAR file.
+When Blob Store GC is performed, instead of reading and parsing every segment, you can read and parse the index files.
+This optimization may reduce the amount of I/O operations of an order of magnitude in the best case.
\ No newline at end of file
Index: oak-doc/src/site/markdown/nodestore/segment/overview.md
===================================================================
--- oak-doc/src/site/markdown/nodestore/segment/overview.md (revision 1779425)
+++ oak-doc/src/site/markdown/nodestore/segment/overview.md (working copy)
@@ -50,6 +50,7 @@
* [Diff](#diff)
* [History](#history)
* [Design](#design)
+ * [Format changes](#format-changes)
## Garbage Collection
@@ -642,3 +643,9 @@
This website also contains an overview of the legacy implementation of the Segment Store and of the design decisions that brought to this implementation.
The page is old and describes a deprecated implementation, but can still be accessed [here](../segmentmk.html).
+
+### Format changes
+
+The Oak Segment Tar module introduces a number of changes in the data format compared to the legacy Oak Segment.
+The changes are described in greater detail [here](changes.html).
+Pointers to actual Jira issues can also be found on that page.