diff --git oak-doc/src/site/markdown/nodestore/overview.md oak-doc/src/site/markdown/nodestore/overview.md index 08788a5..eacd52c 100644 --- oak-doc/src/site/markdown/nodestore/overview.md +++ oak-doc/src/site/markdown/nodestore/overview.md @@ -17,9 +17,10 @@ # Node Storage -Oak comes with two node storage flavours: [Segment](segmentmk.html) and [Document](documentmk.html). -Segment storage is optimised for maximal performance in standalone deployments, -and document storage is optimised for maximal scalability in clustered deployments. +Oak comes with two node storage flavours: [Segment](segment/overview.html) and +[Document](documentmk.html). Segment storage is optimised for maximal +performance in standalone deployments, and document storage is optimised for +maximal scalability in clustered deployments. ## NodeStore API @@ -32,4 +33,4 @@ which is suited to work with in Java, and has lower performance and memory overh ## MicroKernel API The `MicroKernel` API was deprecated in OAK 1.2 and dropped from the project as of -Oak 1.3.0. It used to exposes its functionality through a String only (JSON) API. \ No newline at end of file +Oak 1.3.0. It used to exposes its functionality through a String only (JSON) API. diff --git oak-doc/src/site/markdown/nodestore/segment/overview.md oak-doc/src/site/markdown/nodestore/segment/overview.md new file mode 100644 index 0000000..3f33ea5 --- /dev/null +++ oak-doc/src/site/markdown/nodestore/segment/overview.md @@ -0,0 +1,39 @@ + + +# Segment Node Store + +The Segment Node Store is the implementation of the [Node +Store](../overview.html) that persists repository data on the file +system in an efficient, organized and performant way. + +One of the most important tasks of the Segment Node Store is the management of a +set of TAR files. These files are the most coarse-grained containers for the +repository data. You can learn more about the general structure of a TAR file +and how Oak leverages TAR files by reading [this page](tar.html). + +Every TAR file contains segments, finer-grained containers of repository data. +Unsurprisingly, segments inspired the name of this Node Store implementation. +Repository nodes and properties are serialized to one or more records, and these +records are saved into the segments. You can learn about the internal +organization of segments and the different ways to serialize records by reading +[this page](records.html). + +This website also contain a broader overview of the Segment Store and of the +design desictions that brought to his implementation. The page is quite old and +potentially outdated, but contains valuable information and is accessible +[here](../segmentmk.html). diff --git oak-doc/src/site/markdown/nodestore/segment/records.md oak-doc/src/site/markdown/nodestore/segment/records.md new file mode 100644 index 0000000..28e12ab --- /dev/null +++ oak-doc/src/site/markdown/nodestore/segment/records.md @@ -0,0 +1,291 @@ + + +# Records and segments + +While TAR files and segments are a coarse-grained mechanism to divide the +repository content in more manageable pieces, the real information is stored +inside the segments as finer-grained records. Here I zoom in the segments and +show the binary representation of data stored by Oak in the segments. It is not +strictly necessary to know how segments work in order to understand this +content, but if you feel lost you can refer to [this description of the +structure of TAR files](tar.html). + +## Data and bulk segments + +Segments are not created equal. Oak, in fact, distinguishes data and bulk +segments, where the former is used to store structured data (e.g. information +about node and properties), while the latter contains unstructured data (e.g. +the value of binary properties or of very long strings). + +It is possible to take apart a bulk segment from a data segment by just looking +at its identifier. As explained in a previous post, a segment identifier is a +randomly generated UUID. Segment identifiers are 16 bytes long, but Oak uses 4 +bits to store a flag capable to set apart bulk segments from data segments. + +The most interesting kind of segment is the data segment, because it stores +information about the repository in a structured and easily accessible way. + +## Overview of data segments + +A data segment can be roughly divided in two parts, a header and a data section. +The header contains management information about the segment itself, while the +data section stores the actual repository data. + +Repository data is split into records, that are tiny bits of information that +represent different types of information. There are different types of records, +where every type is specialized in storing a specific piece of information: node +records, template records, map records, list records, and so on. + +In general, a record can be considered as a contiguous sequence of bytes stored +at a specific position inside a segment. A record can also have references to +other records, where the referenced records can be stored in the same segment or +not. Since records can reference each other, a segment actually stores a graph +of records, where the implementation guarantees that the graph is acyclic. + +The segment also maintains a set of references to *root records* those records +in the graph that are not referenced by any other records. In graph jargon, +these records would be called source vertices. The set of references to root +records is stored in the header section of the segment. + +## Record identifiers + +Records need a mechanism to reference each other, both from inside the same +segment and across different segments. The mechanism used to reference a record +is (unsurprisingly) a record identifier. + +A record identifier is composed of a *segment field* and a *position field*. The +segment field is a single byte that identifies the segment where the referenced +record is stored. The position field is the position of the record inside the +segment identified by the segment field. There are some peculiarities in both +the segment and the position field that may not be immediately obvious. The +picture below shows how a segment looks like. + +![Overview of a segment](segment.png) + +The segment field is just one byte long, but a segment identifier is 16 bytes +long. To bridge the gap, the segment header contains an array of segment +identifiers that is used as a look-up table. The array can store only 255 +segment identifiers, so a single byte is enough to access every element in the +array. In fact, the segment field in a record identifier is just an index in the +array of segment identifiers that is used as a look-up table. The look-up table +always contains the segment identifier of the current segment in the first +position: if a segment field is set to zero, the referenced record is stored in +the current segment. + +The definition of the position field relies on some important properties of data +segments: + +- data segments have a maximum size of 256 KiB, or `0x40000` bytes. The size of + a data segment can never exceed this limit, but it is perfectly legal to have + data segments smaller than 256 KiB. + +- records are always aligned on a two-bit boundaries. Stated differently, when a + record is written in a segment, it must be stored at a position that is a + multiple of four. + +- records are stored from the end of the segment. Even if this may seem + counterintuitive, it makes perfectly sense if you consider that new records + are written as a consequence of an in-depth traversal of a content tree. + Writing records from the end of the segment guarantees that the records that + are relevant to the root of the content tree are at the beginning of the + segment, while records that are relevant to the leaves of the content tree are + stored at the end. This makes reading from the segment faster, because + operating systems are optimized to read files from the beginning to the end, + and not backwards. + +So, according to the the previous properties, allowed positions range from +`0x40000` (not included) to zero (included). Moreover, assigned positions must +be multiples of four. + +``` +0x3FFFC, 0x3FFF8, 0x3FFF4, 0x3FFF0, 0x3FFEC, ..., 0x0 +``` + +As you can see, three bytes would be necessary to store these positions, but we +know that a record identifier uses only two bytes to store position values. This +is possible because of a very simple optimization made to the positions before +being used in a record identifier. Since the positions are multiples of four, +the last two least significant bits are always zero. Being constant, these bits +can be removed by shifting the positions to the right twice. After the shift, +the list of possible positions become + +``` +0xFFFF, 0xFFFE, 0xFFFD, 0xFFFC, 0xFFFB, ..., 0x0 +``` + +With this optimization, only two bytes are necessary to store a position inside +a record identifier. Of course, when you read a position from a record +identifier you have to remember to shift the position to left twice to obtain a +valid position. + +The last important piece of information about positions is that they are always +assigned on a logic segment size of `0x40000` bytes, even if the segment ends up +to be smaller than 256 KiB. This doesn't mean that these absolute positions are +useless: in fact, they are converted to offsets relative to the effective end of +the segment. + +Hopefully an example will clarify this. + +Let's suppose that you are reading from a segment whose size is just 128 bytes. +Let's also suppose that you want to read a record from the position `0xFFF8`. +The problem is that the position `0xFFF8` was computed on a logical segment size +of 256 KiB (or `0x40000` bytes), so you have to convert the position `0xFFF8` +into an offset that can be used with the segment you are reading from. First of +all, the two least significant bits were stripped away from the position, so you +have to rotate `0xFFF8` two places to the left to obtain a proper position of +`0xFFF8 << 2 = 0x3FFE0`. How far was this position from the logical size of +`0x40000` bytes? It is easy to compute that the referenced record is `0x40000 - +0x3FFE0 = 0x20 = 32` bytes before the end of the segment. + +Given that the size of the current segment is only 128 bytes, you can use an +absolute position of `128 - 32 = 96` in the segment to read the record you are +interested in. + +## Record types + +As stated before, there are many types of records. It is necessary to make a +distinction between logical and physical records, where the former are an +idealized representation of a data structure, and the latter are used to encode +the data structures in the segments as sequence of bits. + +Usually there is a one-to-one mapping between logical and physical records, like +in block records, value records, template records and node records. Other types +of logical record, like map records and list records, use more than one physical +record to represent the content. + +Let's give a brief description of the aforementioned records. + +### Block records + +A block record is the simplest form of record, because it is just a plain +sequence of bytes. It doesn't even contain a length: it is up to the writer of +this record to store the length elsewhere. + +The only adjustment performed to the data is the alignment. The implementation +makes sure that the written sequence of bytes is stored at a position that is a +multiple of four. + +### Value records + +Value records are an improvement over block records, because they give the +possibility to store arbitrary binary data with an additional length and +optional references to other records. + +The implementation represents value records in different ways, depending on the +length of the data to be written. If the data is short enough, the record can be +written in the simplest way possible: a lento field and the data inlined +directly in the record. + +When the data is too big, instead, it is split into block records written into +block segments. The reference to these block records are stored into a list +record, whose identifier is stored inside the value record. + +This means that value record represent a good compromise when writing binary or +string data. If the data is short enough, it is written in such a way that can +be used straight away without further reads in the segment. If the data is too +long, instead, it is stored separated from the repository content not to impact +the performance of the readers of the segment. + +### List records + +List records are a general-purpose list of record identifiers. They are used as +building blocks for other types of records, as we saw for value records and as +we will see for template records and node records. + +The list record is a logical record using two different types of physical +records to represent itself: + +- bucket record: this is a recursive record representing a list of at most 255 + references. A bucket record can reference other bucket records, + hierarchically, or the record identifiers of the elements to be stored in the + list. A bucket record doesn't maintain any other information exception record + identifiers. + +- list record: this is a top-level record that maintains the size of the list in + an integer field and a record identifier pointing to a bucket. + +List records are useful to store a list of references to other records. If the +list is too big, it is split into different bucket records that may be stored +in the same segment or across segments. This guarantees good performance for +small lists, without loosing the capability to store lists with a big number of +elements. + +### Map records + +Map records are a general-purpose maps of strings to record identifiers. As +lists, they are used as building blocks for other types of records and are +represented using two types of physical record: + +- leaf record: if the number of elements in the map is small, they are all + stored in a leaf record. This covers the simplest case for small maps. + +- branch record: if the number of elements in the map is too big, the original + map is split into smaller maps based on a hash function applied to the keys of + the map. A branch record is recursive, because it can reference other branch + records if the sub-maps are too big and need to be split again. + +The implementation of the map record relies on the properties defined by an +external data structure called HAMT (Hash Array Mapped Trie), capable of +combining the properties of hash table and a trie. + +Map records are also optimized for small changes. In example, if only one +element of a previously stored map is modified, and the map is stored again, +only a "diff" of the map is stored. This prevents the full storage of the +modified map, which can save a considerable amount of space if the original map +was big. + +### Template records + +A template record stores metadata about nodes that, on average, don't change so +often. A template record stores information like the primary type, the mixin +types, the property names and the property types of a node. Having this +information stored away from the node itself prevents to write them over and +over again if they don't change when the node changes. + +In example, on average, a node is created with a certain primary type and, +optionally, with some mixin types. Usually, because of its primary type, a node +is already created with a set of initial properties. After that, only the value +of the properties change, but not the structure of the node itself. + +The template record allows Oak to handle simple modifications to nodes in the +most efficient way possible. + +### Node records + +The node record is the single most important type of record, capable of storing +both the data associated to the node and the structure of the content tree. + +A node record always maintain a reference to a template record. As stated +before, a template record defines the overall structure of the node, while the +variable part of it is maintained in the node record itself. + +The variable part of the node is represented by a list of property values and a +map of child nodes. + +The list of property values is implemented as a list of record identifiers. For +each property in the node, its value is written in the segment. The record +identifiers referencing the values of the properties are then packed together in +a list record. The identifier of the list record is stored as part of the node +record. If the value of some properties didn't change, the previous record +identifier is just reused. + +The map of child nodes is implemented as a map of record identifiers. For every +child node, its node record identifier is stored in a map indexed by name. The +map is persisted in a map record, and its identifier is stored in the node +record. Thanks to the optimizations implemented by the map record, small changes +to the map of children node don't create a lot of overhead in the segment. diff --git oak-doc/src/site/markdown/nodestore/segment/segment.png oak-doc/src/site/markdown/nodestore/segment/segment.png new file mode 100644 index 0000000..ae7b463 Binary files /dev/null and oak-doc/src/site/markdown/nodestore/segment/segment.png differ diff --git oak-doc/src/site/markdown/nodestore/segment/tar.md oak-doc/src/site/markdown/nodestore/segment/tar.md new file mode 100644 index 0000000..1060342 --- /dev/null +++ oak-doc/src/site/markdown/nodestore/segment/tar.md @@ -0,0 +1,210 @@ + + +# Structure of TAR files + +Here is described the phisical layout of a TAR file as used by Apache Oak. +First, a brief introduction of the TAR format is given. Next, more details are +provided about the low level information that are written in TAR entries. +Finally, it's described how Oak saves a graph data structure inside the TAR file +and how this representation is optimized for fast retrieval. + +## Organization of a TAR file + +Phisically speaking, a TAR file is a linear sequence of blocks. A TAR file is +terminated by two blocks containing zero bytes. Every block has a size of 512 +bytes. + +Logically speaking, a TAR file is a linear sequence of entries. Every entry is +represented by two or more blocks. The first block always contains the entry +header. Subsequent blocks store the content of the file. + +The entry header is composed of the following fields: + +- file name (100 bytes) - name of the file stored in this entry. + +- file mode (8 bytes) - string representation of the octal file mode. + +- owner's numeric ID (8 bytes) - string representation of the user ID of the + owner of the file. + +- group's numeric ID (8 bytes) - string representation of the group ID of the + owner of the file. + +- file size (12 bytes) - string representation of the octal size of the file. + +- last modification time (12 bytes) - string representation of the octal time + stamp when the file was last modified. + +- checksum (8 bytes) - checksum for the header data. + +- file type (1 byte) - type of the file stored in the entry. This field + specifies if the file is a regular file, a hard link or a symbolic link. + +- name of linked file (1 byte) - in case the file stored in the entry is a link, + this field stores the name of the file pointed to by the link. + +## The TAR file as used by Oak + +Some fields are not used by Oak. In particular, Oak sets the file mode, the +owner's numeric ID, the group's numeric ID, the checksum, and the name of linked +file to uninteresting values. The only meaningful values assigned to the fields +of the entry values are: + +- file name: the name of the data file. There are different data files used by + Oak. They are described below. + +- file size: the size of the data file. The value assigned to this field is + trivially computed from the amount of information stored in the data file. + +- last modification time: the time stamp when the entry was written. + +There are three kind of files stored in a TAR file: + +- segments: this type of file contains data about a segment in the segment + store. This kind of file has a file name in the form `UUID.CRC2`, where `UUID` + is a 128 bit UUID represented as an hexadecimal string and `CRC2` is a zer- + padded numeric string representing the CRC2 checksum of the raw segment data. + +- graph: this file has a name ending in `.gph` and contains a representaion of a + graph. The graph is represented as an adjacency list of UUIDs. + +- index: this file has a name ending in `.idx` and contains a sorted list of + every segment contained in the TAR file. + +The layout of the TAR file used by Oak is engineered for perfomance of read +operations. In particular, the most important information is stored in the +bottom entries. Reading the entries from the bottom of the file, you encounter +first the index, then the graph, then segment files. The idea is that the index +must be read first, because it provides a fast tool to locate segments in the +rest of the file. Next comes the graph, that describes how segments relate to +each other. Last come the segments, whose relative order can be ignored. + +At the same time, the layout of the TAR file allows fast append-only operations +when writing. Since the relative order of segment files is not important, +segment entries can be written in a first come, first served basis. The index at +the end of the file will provide a fast way to access them even if they are +scattered around the file. + +## Segment files + +Segment files contain raw data about a segment. Even if there are multiple kinds +of segments, TAR file only distinguishes between data and non-data segments. A +non-data segment is always saved as-is in the TAR file, without further +processing. A data segment, instead, is inspected to extract references to other +segments. + +A data segment can contain at most 255 references to other segments. These +references are simply stored as a list of UUIDs. The referenced segments can be +stored inside the current TAR file or outside of it. In the first case, the +referenced segment can be found by inspecting the index. In the second case, an +external agent is responsible to find the segment in another TAR file. + +The list of segments referenced by a data segment will end up in the graph file. +To speed up the process of locating a segment in the list of referenced segment, +this list is maintained ordered. + +## Graph files + +The graph file represents the relationships between segments stored inside or +outside the TAR file. The graph is stored as an adjacency list of UUID, where +each UUID represents a segment. + +The format of the graph file is optimized for reading. The graph file is stored +in reverse order to maintain the most important information at the end of the +file. This strategy is inline with the overall layout of the entries in the TAR +file. + +The content of the graph file is divided in three main parts: a graph header, a +graph adjacency list and a vertex mapping table. + +The graph header contains the following fields: + +- a magic number (4 bytes): identifies the beginning of a graph file. + +- size of the graph adjacency list (4 bytes): number of bytes occupied by the + graph adjacency list. + +- size of the vertex mapping table (4 bytes): number of bytes occupied by the + vertex mapping table. + +- checksum (4 bytes): a CRC2 checksum of the content of the graph file. + +Immediately after the graph header, the graph adjacency list is stored. In the +list, each vertex is represented by an integer. Each integer represents an index +in the vertex mapping table. For each vertex stored in the adjacency list, the +following information are written: + +- the integer representing the current vertex. + +- zero or more integers for each vertex referenced by the current one. + +- a sentinel value representing the list of adjacent vertices (-1). + +At the end, the vertex mapping table is stored. This table is just an ordered +list of UUIDs. The integers used in the graph adjacency list can be used as +index in the vertext mapping table to read the UUID of the corresponding +segment. This is a space optimization. Since UUIDs can be repeated more than +once in the adjacency list, it make sense to replace them with cheaper +placeholders. A UUID is 128 bit long, while an integer just 4. + +## Index files + +The index file is an ordered list of references to the entries contained in the +TAR file. The references are ordered by UUID and they point to the position in +the file where the entry is stored. Like the graph file, even the index file is +stored backwards. + +The index file is divided in two parts. The first is an index header, the second +contains the real data about the index. + +The index data contains the following fields: + +- a magic number (4 bytes): identifies the beginning of an index file. + +- size fo the index (4 bytes): number of bytes occupied by the index data. This + size also contains padding bytes that are added to the index to make it align + with the TAR block boundary. + +- number of entries (4 bytes): how many entries the index contains. + +- checksum (4 bytes): a CRC32 checksum of the content of the index file. + +After the header, the content of the index starts. For every entry contained in +the index, the following information are stored: + +- the most significat bits of the UUID (8 bytes) + +- the least significat bits of the UUID (8 bytes) + +- the offset in the TAR file where the TAR entry containing the segment is + located. + +- the size of the entry in the TAR file. + +Since the entries in the index are sorted by UUID, and since the UUIDs assigned +to the entries are uniformly distributed, when searching an entry by its UUID an +efficient algorithm called interpolation search can be used. This algorithm is a +variation of binary search. While in binary search the search space (in this +case, the array of entry) is halved at every iteration, interpolation search +exploits the distribution of the keys to remove a portion of the search space +that is potentially bigger than the half of it. Interpolation search is a more +natural approximation of the way a person searches in a phone book. If the name +to search begins with the letter T, in example, it makes no sense to open the +phone book at the half. It is way more efficient, instead, to open the phone +book close to the bottom quarter, since names starting with the letter T are +more likely to be distributed in that part of the phone book. diff --git oak-doc/src/site/site.xml oak-doc/src/site/site.xml index 971fdf2..71fab55 100644 --- oak-doc/src/site/site.xml +++ oak-doc/src/site/site.xml @@ -41,7 +41,7 @@ under the License. - +