Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Incomplete
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: spec
    • Labels:
      None

      Description

      There are better ways to encode arrays of varints [1] which are faster to decode, and more space efficient than encoding varints independently.

      Extending the idea to other types of variable length data like 'bytes' and 'string', you could encode the entries for an array block as an array of lengths, followed by contiguous byte/utf8 data.

      [1] group varint encoding: slides 57-63 of http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/people/jeff/WSDM09-keynote.pdf

        Activity

        Hide
        Doug Cutting added a comment -

        Adding a new fundamental type or encoding is hard to do compatibly. Rather I wonder whether this could be layered, as a library? One might automatically rewrite schemas and have a layer that transforms datastructures accordingly? This could perhaps be done without copying data, as wrapping DatumWriter and DatumReader implementations.

        Also related is columnar compression in a data file. In this case, a data file is a sequence of records whose schema might be re-written. For example, a file containing <string,long> pairs might be represented as a data file containing <int,string,long> records where the int contains the number of characters shared with the previous string and the long the difference from the previous long. Schema properties could indicate which fields should be represented as differences. If random-access is required, e.g., for mapreduce splitting, then the container (DataFileReader & DataFileWriter in Java) might have per-block callbacks.

        Show
        Doug Cutting added a comment - Adding a new fundamental type or encoding is hard to do compatibly. Rather I wonder whether this could be layered, as a library? One might automatically rewrite schemas and have a layer that transforms datastructures accordingly? This could perhaps be done without copying data, as wrapping DatumWriter and DatumReader implementations. Also related is columnar compression in a data file. In this case, a data file is a sequence of records whose schema might be re-written. For example, a file containing <string,long> pairs might be represented as a data file containing <int,string,long> records where the int contains the number of characters shared with the previous string and the long the difference from the previous long. Schema properties could indicate which fields should be represented as differences. If random-access is required, e.g., for mapreduce splitting, then the container (DataFileReader & DataFileWriter in Java) might have per-block callbacks.
        Hide
        Stu Hood added a comment -

        > Adding a new fundamental type or encoding is hard to do compatibly.
        Agreed: but this particular optimization is only possible with Avro's support, and opens up a lot of other interesting possibilities. For instance, in your prefix encoding example, encoding a block of <int,string,long> as a record array<int>, array<long>, array<string> might give a 3-6x increase in decode speed (based on the numbers suggested in the link).

        It is worth considering how the specification can evolve backwards compatibly as well: perhaps the next revision of the specification could require a magical 'spec revision' number to be present in all schemas, and would assume that a schema that is missing the rev number is a legacy format? This would allow readers and writers to communicate across spec revision boundaries by disabling optimizations/encodings that the other side does not support.

        > One might automatically rewrite schemas and have a layer that transforms datastructures accordingly?
        Yea: there is probably room for a schema translation layer above Avro for things like RLE / prefix encoding, but I think it is a separate area of focus.

        Show
        Stu Hood added a comment - > Adding a new fundamental type or encoding is hard to do compatibly. Agreed: but this particular optimization is only possible with Avro's support, and opens up a lot of other interesting possibilities. For instance, in your prefix encoding example, encoding a block of <int,string,long> as a record array<int>, array<long>, array<string> might give a 3-6x increase in decode speed (based on the numbers suggested in the link). It is worth considering how the specification can evolve backwards compatibly as well: perhaps the next revision of the specification could require a magical 'spec revision' number to be present in all schemas, and would assume that a schema that is missing the rev number is a legacy format? This would allow readers and writers to communicate across spec revision boundaries by disabling optimizations/encodings that the other side does not support. > One might automatically rewrite schemas and have a layer that transforms datastructures accordingly? Yea: there is probably room for a schema translation layer above Avro for things like RLE / prefix encoding, but I think it is a separate area of focus.
        Hide
        Doug Cutting added a comment -

        > encoding a block of <int,string,long> as a record array<int>, array<long>, array<string> might give a 3-6x increase in decode speed

        That's what I meant by a schema transformation. You're transforming [<int,string,long>] to <[int],[string],[long]>. This might be done automatically by a layer that implements DatumReader and DatumWriter. The actual schema of the datafile would be <[int],[string],[long]> but application code would treat it as [<int,string,long>].

        Show
        Doug Cutting added a comment - > encoding a block of <int,string,long> as a record array<int>, array<long>, array<string> might give a 3-6x increase in decode speed That's what I meant by a schema transformation. You're transforming [<int,string,long>] to < [int] , [string] , [long] >. This might be done automatically by a layer that implements DatumReader and DatumWriter. The actual schema of the datafile would be < [int] , [string] , [long] > but application code would treat it as [<int,string,long>] .
        Hide
        Stu Hood added a comment -

        > That's what I meant by a schema transformation.
        As far as I know, there is no way to transform a schema that will allow you to dodge Avro's varint encoding and do group varint encoding instead: that was where I was suggesting you would get the encoding/decoding speed benefits by using multiple arrays.

        Show
        Stu Hood added a comment - > That's what I meant by a schema transformation. As far as I know, there is no way to transform a schema that will allow you to dodge Avro's varint encoding and do group varint encoding instead: that was where I was suggesting you would get the encoding/decoding speed benefits by using multiple arrays.
        Hide
        Scott Carey added a comment -

        Its doing more than transforming [<int,string,long>] to <[int],[string],[long]>.

        First, its changing how a sequence of varints are encoded, from zig-zag encoding to group varint encoding. This is where the decode speed improvements happen.

        Then, its doing the transformation above, then it is doing one more:

        [string] is currently an int length followed by a sequence of bytes of that length. This would transform [string] into an <[lengths],bytes>, with group varint encoding for the lengths.

        I see this sort of thing as most useful for the file format and other block based sequence of records formats – for the file format we could have a columnar format that transformed records and arrays this way. Unions would be a bit tricky to do however.

        For the core avro spec, this is only useful for arrays and maps and schema translation is not sufficient IMO without group varint encoding which even if we had consensus to add it to the current binary format, we could not add until Avro 2.0.

        Perhaps we consider group varint encodings at some point, along with some other features like bitmasks for nullables, to an 'extended' or 'columnar' binary format and leave the 'core' format as simple as possible.

        Show
        Scott Carey added a comment - Its doing more than transforming [<int,string,long>] to < [int] , [string] , [long] >. First, its changing how a sequence of varints are encoded, from zig-zag encoding to group varint encoding. This is where the decode speed improvements happen. Then, its doing the transformation above, then it is doing one more: [string] is currently an int length followed by a sequence of bytes of that length. This would transform [string] into an < [lengths] ,bytes>, with group varint encoding for the lengths. I see this sort of thing as most useful for the file format and other block based sequence of records formats – for the file format we could have a columnar format that transformed records and arrays this way. Unions would be a bit tricky to do however. For the core avro spec, this is only useful for arrays and maps and schema translation is not sufficient IMO without group varint encoding which even if we had consensus to add it to the current binary format, we could not add until Avro 2.0. Perhaps we consider group varint encodings at some point, along with some other features like bitmasks for nullables, to an 'extended' or 'columnar' binary format and leave the 'core' format as simple as possible.
        Hide
        Bruce Mitchener added a comment -

        If something happens here, it would be nice to be able to bypass the varint encoding entirely by specifically stating the length of the integers (int8, int16, int32, int64 or whatever). There are use cases where the varint encoding chews up a lot of time and the space savings aren't as important.

        Show
        Bruce Mitchener added a comment - If something happens here, it would be nice to be able to bypass the varint encoding entirely by specifically stating the length of the integers (int8, int16, int32, int64 or whatever). There are use cases where the varint encoding chews up a lot of time and the space savings aren't as important.
        Hide
        Stu Hood added a comment -

        I'm having trouble getting an implementation of group varint encoding to be significantly (or any) faster than Avro's varint encoding, so this ticket is probably invalid.

        On a side note though, I'm curious what I might be doing wrong in BinaryEncoder setup for it to be so expensive: see https://github.com/stuhood/gvi#readme and the benchmark code https://github.com/stuhood/gvi/blob/master/src/test/java/net/hoodidge/gvi/GroupVarIntBenchmark.java#L53

        Show
        Stu Hood added a comment - I'm having trouble getting an implementation of group varint encoding to be significantly (or any) faster than Avro's varint encoding, so this ticket is probably invalid. On a side note though, I'm curious what I might be doing wrong in BinaryEncoder setup for it to be so expensive: see https://github.com/stuhood/gvi#readme and the benchmark code https://github.com/stuhood/gvi/blob/master/src/test/java/net/hoodidge/gvi/GroupVarIntBenchmark.java#L53
        Hide
        Scott Carey added a comment - - edited

        I'd expect bigger gains for longs and for encode than for decode.

        Avro's int decode runs at about 4 to 5 clock cycles per byte, trying to shave a clock cycle off that is hard since every byte requires a conditional operation, an array write, a counter increment, and some masking and shifting.
        The group varint stuff can cut that down though.
        Longs are still at 8 or so cycles per byte and might have more room to gain than Ints.

        For your code, the big thing that is likely slowing you down is:

                    buff.put(b(m));
        

        buff.put is a lot slower than byte array assignment. All ByteBuffer access is slow at this level.
        Unfortunately, ByteBuffer is polymorphic and thus 'put' is a virtual method. A couple things to try: perhaps use ReadWriteHeapByteBuffer, which is final and so can be better optimized and is not polymorphic at your call site. Or, unpack the byte buffer higher up and pass around (byte[], position) through your methods to get raw byte[] access which is fastest.

        You might also try using the Avro Perf.java test and add a new read method to BinaryDecoder to test it there. The way the methods and buffers are set up in BinaryDecoder was tweaked heavily to get the JVM to inline and optimize the right methods in the right order. Several equivalent rearrangements of the code in BinaryDecoder are slower. It would be easier to isolate the differences if both used the same general setup and buffer.

        Group Varint encoding lends itself to specialized processor instructions such as SIMD and vectorized instructions. So it might benefit C/C++ more than Java.

        On the encode side, Avro isn't optimized well yet. See AVRO-753. We should be able to improve int and long encoding by about a factor of 2.5.

        Show
        Scott Carey added a comment - - edited I'd expect bigger gains for longs and for encode than for decode. Avro's int decode runs at about 4 to 5 clock cycles per byte, trying to shave a clock cycle off that is hard since every byte requires a conditional operation, an array write, a counter increment, and some masking and shifting. The group varint stuff can cut that down though. Longs are still at 8 or so cycles per byte and might have more room to gain than Ints. For your code, the big thing that is likely slowing you down is: buff.put(b(m)); buff.put is a lot slower than byte array assignment. All ByteBuffer access is slow at this level. Unfortunately, ByteBuffer is polymorphic and thus 'put' is a virtual method. A couple things to try: perhaps use ReadWriteHeapByteBuffer, which is final and so can be better optimized and is not polymorphic at your call site. Or, unpack the byte buffer higher up and pass around (byte[], position) through your methods to get raw byte[] access which is fastest. You might also try using the Avro Perf.java test and add a new read method to BinaryDecoder to test it there. The way the methods and buffers are set up in BinaryDecoder was tweaked heavily to get the JVM to inline and optimize the right methods in the right order. Several equivalent rearrangements of the code in BinaryDecoder are slower. It would be easier to isolate the differences if both used the same general setup and buffer. Group Varint encoding lends itself to specialized processor instructions such as SIMD and vectorized instructions. So it might benefit C/C++ more than Java. On the encode side, Avro isn't optimized well yet. See AVRO-753 . We should be able to improve int and long encoding by about a factor of 2.5.
        Hide
        Stu Hood added a comment -

        > Unfortunately, ByteBuffer is polymorphic and thus 'put' is a virtual method.
        I had not thought of the polymorphism in ByteBuffer: thanks for the tip! Since groups are of a known length, copying to and from a temporary byte[] of the max group length would probably work swimmingly.

        Show
        Stu Hood added a comment - > Unfortunately, ByteBuffer is polymorphic and thus 'put' is a virtual method. I had not thought of the polymorphism in ByteBuffer: thanks for the tip! Since groups are of a known length, copying to and from a temporary byte[] of the max group length would probably work swimmingly.
        Hide
        Stu Hood added a comment -

        Closing this in favor of AVRO-806.

        Thanks again for your optimization help Scott.

        Show
        Stu Hood added a comment - Closing this in favor of AVRO-806 . Thanks again for your optimization help Scott.

          People

          • Assignee:
            Unassigned
            Reporter:
            Stu Hood
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development