Uploaded image for project: 'Apache Avro'
  1. Apache Avro
  2. AVRO-196

Add encoding for sparse records

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • java
    • None

    Description

      If we have a large record with many fields in avro which is mostly empty, currently avro will still serialize every field, leading to big overhead. We could support a sparse record format for this case: before each record a bitmask is serialized indicating the presence of the fields. We could specify the encoding type as a new attribute in the avpr e.g.

      {"type":"record", "name":"Test", "encoding":"sparse", "fields":....}

      I've put an implementation of the idea on github:
      http://github.com/justinsb/avro/commit/7f6ad2532298127fcdd9f52ce90df21ff527f9d1

      This leads to big improvements in the serialization size in our case, when we're using avro to serialize performance metrics, where most of the fields are usually empty.

      The alternative of using a Map isn't a good idea because it (1) serializes the names of the fields and (2) means we lose strong typing.

      Attachments

        Activity

          People

            Unassigned Unassigned
            justinsb Justin SB
            Votes:
            3 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: