Uploaded image for project: 'Apache Avro'
  1. Apache Avro
  2. AVRO-27

Consistent Overhead Byte Stuffing (COBS) encoded block format for Object Container Files

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Later
    • None
    • None
    • spec
    • None

    Description

      Object Container Files could use a 1 byte sync marker (set to zero) using zig-zag and COBS encoding within blocks to efficiently escape zeros from the record data.

      Zig-Zag encoding

      With zig-zag encoding only the value of 0 (zero) gets encoded into a value with a single zero byte. This property means that we can write any non-zero zig-zag long inside a block within concern for creating an unintentional sync byte.

      COBS encoding

      We'll use COBS encoding to ensure that all zeros are escaped inside the block payload. You can read http://www.sigcomm.org/sigcomm97/papers/p062.pdf for the details about COBS encoding.

      Block Format

      All blocks start and end with a sync byte (set to zero) with a type-length-value format internally as follows:

      name format length in bytes value meaning
      sync byte 1 always 0 (zero) The sync byte serves as a clear marker for the start of a block
      type zig-zag long variable must be non-zero The type field expresses whether the block is for metadata or normal data.
      length zig-zag long variable must be non-zero The length field expresses the number of bytes until the next record (including the cobs code and sync byte). Useful for skipping ahead to the next block.
      cobs_code byte 1 see COBS code table below Used in escaping zeros from the block payload
      payload cobs-encoded Greater than or equal to zero all non-zero bytes The payload of the block
      sync byte 1 always 0 (zero) The sync byte serves as a clear marker for the end of the block

      COBS code table

      Code Followed by Meaning
      0x00 (not applicable) (not allowed )
      0x01 nothing Empty payload followed by the closing sync byte
      0x02 one data byte The single data byte, followed by the closing sync byte
      0x03 two data bytes The pair of data bytes, followed by the closing sync byte
      0x04 three data bytes The three data bytes, followed by the closing sync byte
      n (n-1) data bytes The (n-1) data bytes, followed by the closing sync byte
      0xFD 252 data bytes The 252 data bytes, followed by the closing sync byte
      0xFE 253 data bytes The 253 data bytes, followed by the closing sync byte
      0xFF 254 data bytes The 254 data bytes not followed by a zero.

      (taken from http://www.sigcomm.org/sigcomm97/papers/p062.pdf)

      Encoding

      Only the block writer needs to perform byte-by-byte processing to encode the block. The overhead for COBS encoding is very small in terms of the in-memory state required.

      Decoding

      Block readers are not required to do as much byte-by-byte processing as a writer. The reader could (for example) find a metadata block by doing the following:

      1. Search for a zero byte in the file which marks the start of a record
      2. Read and zig-zag decode the type of the block
        • If the block is normal data, read the length, seek ahead to the next block and goto step #2 again
        • If the block is a metadata block, cobs decode the data

      Attachments

        1. COBSCodec.java
          8 kB
          Matt Massie
        2. COBSCodec2.java
          10 kB
          Scott Carey
        3. COBSPerfTest.java
          0.6 kB
          Scott Carey
        4. COLSCodec.java
          8 kB
          Scott Carey
        5. COLSCodec2.java
          8 kB
          Scott Carey
        6. COWSCodec.java
          9 kB
          Scott Carey
        7. COWSCodec2.java
          8 kB
          Scott Carey
        8. COWSCodec3.java
          8 kB
          Scott Carey

        Activity

          People

            Unassigned Unassigned
            massie Matt Massie
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: