Avro
  1. Avro
  2. AVRO-196

Add encoding for sparse records

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: java
    • Labels:
      None

      Description

      If we have a large record with many fields in avro which is mostly empty, currently avro will still serialize every field, leading to big overhead. We could support a sparse record format for this case: before each record a bitmask is serialized indicating the presence of the fields. We could specify the encoding type as a new attribute in the avpr e.g.

      {"type":"record", "name":"Test", "encoding":"sparse", "fields":....}

      I've put an implementation of the idea on github:
      http://github.com/justinsb/avro/commit/7f6ad2532298127fcdd9f52ce90df21ff527f9d1

      This leads to big improvements in the serialization size in our case, when we're using avro to serialize performance metrics, where most of the fields are usually empty.

      The alternative of using a Map isn't a good idea because it (1) serializes the names of the fields and (2) means we lose strong typing.

        Activity

        Hide
        Philip Zeyliger added a comment -

        I'm thinking about ways to approach this without changing the serialized form so much. There are two ways to model this:

        • For fields f1 through f10, create records r1 through r10. Then create u=union(r1, ..., r10) and store list(u). The union costs you one byte per item (I think), so it's more expensive than the bitset.
        • You could also just implement that sparseness manually. You store "bytes" first, followed by list(int). When you do the deserialization in your application, you interpret the bitset manually. (BTW, check out Java's BitSet class.)

        In general, there are often tighter data-structures for specific applications, and AVRO is unlikely to support all of them. For example, if you want to store a map with complex keys (or even a sorted map), you have to store pairs, and create the map on the application. If you're storing a timeseries, you might want to store only the deltas relative to the previous values, and interpret that at run-time. I think having to do the application-specific logic in wrapper classes is pretty normal.

        Show
        Philip Zeyliger added a comment - I'm thinking about ways to approach this without changing the serialized form so much. There are two ways to model this: For fields f1 through f10, create records r1 through r10. Then create u=union(r1, ..., r10) and store list(u). The union costs you one byte per item (I think), so it's more expensive than the bitset. You could also just implement that sparseness manually. You store "bytes" first, followed by list(int). When you do the deserialization in your application, you interpret the bitset manually. (BTW, check out Java's BitSet class.) In general, there are often tighter data-structures for specific applications, and AVRO is unlikely to support all of them. For example, if you want to store a map with complex keys (or even a sorted map), you have to store pairs, and create the map on the application. If you're storing a timeseries, you might want to store only the deltas relative to the previous values, and interpret that at run-time. I think having to do the application-specific logic in wrapper classes is pretty normal.
        Hide
        Justin SB added a comment -

        You're probably right that this is too big a change for avro in its early stages. In my use case I was storing floats, but I've switched to storing ints instead, so an empty value is now 7 extra bits instead of 31.

        Perhaps we should see what can be achieved through compression first (AVRO-135). I'd like to see a per-record compression option, and I'd also like to have empty values compress well. I think as long as we choose an algorithm where consecutive zeroes are highly compressed, compression would solve the issue here, while also being more generally applicable.

        Show
        Justin SB added a comment - You're probably right that this is too big a change for avro in its early stages. In my use case I was storing floats, but I've switched to storing ints instead, so an empty value is now 7 extra bits instead of 31. Perhaps we should see what can be achieved through compression first ( AVRO-135 ). I'd like to see a per-record compression option, and I'd also like to have empty values compress well. I think as long as we choose an algorithm where consecutive zeroes are highly compressed, compression would solve the issue here, while also being more generally applicable.
        Hide
        Thiruvalluvan M. G. added a comment -

        I guess, you are storing a sentinel value (say 0) to indicate that the value is absent. We have an idiom in Avro that an optional field of type T is represented as [null, T] (a union of null and T). If you go that way, each field that is absent will be encoded as null. Null itself does not take any space in Avro format, but the union branch will take 1 byte. This approach has two advantages over yours:

        • This is general and can be used for fields of any type. The cost is one byte irrespective of the size of the field.
        • You don't need a sentinel value

        The disadvantages of this are:

        • This is not quite as efficient as bit fields; it takes one byte per field.
        • The one-byte overhead applies irrespective of the presence or absence of the field.
        Show
        Thiruvalluvan M. G. added a comment - I guess, you are storing a sentinel value (say 0) to indicate that the value is absent. We have an idiom in Avro that an optional field of type T is represented as [null, T] (a union of null and T). If you go that way, each field that is absent will be encoded as null. Null itself does not take any space in Avro format, but the union branch will take 1 byte. This approach has two advantages over yours: This is general and can be used for fields of any type. The cost is one byte irrespective of the size of the field. You don't need a sentinel value The disadvantages of this are: This is not quite as efficient as bit fields; it takes one byte per field. The one-byte overhead applies irrespective of the presence or absence of the field.
        Hide
        Jeff Hammerbacher added a comment -

        Perhaps we should see what can be achieved through compression first (AVRO-135).

        I think compression is the way to go here. We are missing a strategy for RPC compression, but it seems that sparse records in data files will compress well.

        Show
        Jeff Hammerbacher added a comment - Perhaps we should see what can be achieved through compression first ( AVRO-135 ). I think compression is the way to go here. We are missing a strategy for RPC compression, but it seems that sparse records in data files will compress well.

          People

          • Assignee:
            Unassigned
            Reporter:
            Justin SB
          • Votes:
            3 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development