Details
-
New Feature
-
Status: Closed
-
Major
-
Resolution: Later
-
None
-
None
-
None
Description
Object Container Files could use a 1 byte sync marker (set to zero) using zig-zag and COBS encoding within blocks to efficiently escape zeros from the record data.
Zig-Zag encoding
With zig-zag encoding only the value of 0 (zero) gets encoded into a value with a single zero byte. This property means that we can write any non-zero zig-zag long inside a block within concern for creating an unintentional sync byte.
COBS encoding
We'll use COBS encoding to ensure that all zeros are escaped inside the block payload. You can read http://www.sigcomm.org/sigcomm97/papers/p062.pdf for the details about COBS encoding.
Block Format
All blocks start and end with a sync byte (set to zero) with a type-length-value format internally as follows:
name | format | length in bytes | value | meaning |
---|---|---|---|---|
sync | byte | 1 | always 0 (zero) | The sync byte serves as a clear marker for the start of a block |
type | zig-zag long | variable | must be non-zero | The type field expresses whether the block is for metadata or normal data. |
length | zig-zag long | variable | must be non-zero | The length field expresses the number of bytes until the next record (including the cobs code and sync byte). Useful for skipping ahead to the next block. |
cobs_code | byte | 1 | see COBS code table below | Used in escaping zeros from the block payload |
payload | cobs-encoded | Greater than or equal to zero | all non-zero bytes | The payload of the block |
sync | byte | 1 | always 0 (zero) | The sync byte serves as a clear marker for the end of the block |
COBS code table
Code | Followed by | Meaning |
---|---|---|
0x00 | (not applicable) | (not allowed ) |
0x01 | nothing | Empty payload followed by the closing sync byte |
0x02 | one data byte | The single data byte, followed by the closing sync byte |
0x03 | two data bytes | The pair of data bytes, followed by the closing sync byte |
0x04 | three data bytes | The three data bytes, followed by the closing sync byte |
n | (n-1) data bytes | The (n-1) data bytes, followed by the closing sync byte |
0xFD | 252 data bytes | The 252 data bytes, followed by the closing sync byte |
0xFE | 253 data bytes | The 253 data bytes, followed by the closing sync byte |
0xFF | 254 data bytes | The 254 data bytes not followed by a zero. |
(taken from http://www.sigcomm.org/sigcomm97/papers/p062.pdf)
Encoding
Only the block writer needs to perform byte-by-byte processing to encode the block. The overhead for COBS encoding is very small in terms of the in-memory state required.
Decoding
Block readers are not required to do as much byte-by-byte processing as a writer. The reader could (for example) find a metadata block by doing the following:
- Search for a zero byte in the file which marks the start of a record
- Read and zig-zag decode the type of the block
- If the block is normal data, read the length, seek ahead to the next block and goto step #2 again
- If the block is a metadata block, cobs decode the data