Avro
  1. Avro
  2. AVRO-383

Optimizing ResolvingDecoder for default values

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3.0
    • Component/s: java
    • Labels:
      None

      Description

      When the reader's and writer's schemas are records and the reader's schema has a field with default value and the writer's schema doesn't have the field, the ResolvingDecoder keeps the default value in a byte array. This byte array is in Json format. Moving this to Avro binary format improves performance.

      Apply the test patch and try "Perf -M". Then apply the patch and run it again. On my machine, the performance is three times the original.

      1. AVRO-383-new.patch
        2 kB
        Thiruvalluvan M. G.
      2. AVRO-383-new-test.patch
        1 kB
        Thiruvalluvan M. G.
      3. AVRO-383.patch
        11 kB
        Thiruvalluvan M. G.
      4. AVRO-383-test.patch
        2 kB
        Thiruvalluvan M. G.

        Activity

        Hide
        Thiruvalluvan M. G. added a comment -

        Committed revision 905516. Thanks Doug for reviewing.

        Show
        Thiruvalluvan M. G. added a comment - Committed revision 905516. Thanks Doug for reviewing.
        Hide
        Doug Cutting added a comment -

        +1. Thanks!

        Show
        Doug Cutting added a comment - +1. Thanks!
        Hide
        Thiruvalluvan M. G. added a comment -

        There a new unit-test that catches this corner-case.

        Show
        Thiruvalluvan M. G. added a comment - There a new unit-test that catches this corner-case.
        Hide
        Thiruvalluvan M. G. added a comment -

        The patch is incomplete:

        • The skip operation still uses the JSonDecoder. Probably, this problem will never be caught because the default values will never be skipped - the default value come into play when the reader wants it and the writer doesn't have it. Skip happens when the situation is the other way around - writer has it and reader doesn't want it. Nevertheless, the API allows skip of default values, we should fix it.
        • With the Json decoder gone, DefaultStartAction no longer needs the the field "root".

        The new patch fixes both the problems.

        Show
        Thiruvalluvan M. G. added a comment - The patch is incomplete: The skip operation still uses the JSonDecoder. Probably, this problem will never be caught because the default values will never be skipped - the default value come into play when the reader wants it and the writer doesn't have it. Skip happens when the situation is the other way around - writer has it and reader doesn't want it. Nevertheless, the API allows skip of default values, we should fix it. With the Json decoder gone, DefaultStartAction no longer needs the the field "root". The new patch fixes both the problems.
        Hide
        Thiruvalluvan M. G. added a comment -

        Committed revision 904596.

        I tried to refactor the code with a generic schema walker so that both GenericDatumReader and ResolvingGrammarGenerator can use. But it became much more complex than what we have today. So I settled for the logic and code duplication.

        Thanks Doug and Philip for your review and comments.

        > There is a proposal to implement GenericDatumReader using ResolvingDecoder.

        Do we yet have a Jira for this?

        I'm still working on benchmark to demonstrate the benefits. I file a JIRA possibly tomorrow.

        Show
        Thiruvalluvan M. G. added a comment - Committed revision 904596. I tried to refactor the code with a generic schema walker so that both GenericDatumReader and ResolvingGrammarGenerator can use. But it became much more complex than what we have today. So I settled for the logic and code duplication. Thanks Doug and Philip for your review and comments. > There is a proposal to implement GenericDatumReader using ResolvingDecoder. Do we yet have a Jira for this? I'm still working on benchmark to demonstrate the benefits. I file a JIRA possibly tomorrow.
        Hide
        Doug Cutting added a comment -

        I'm +0 with the logic duplication in the patch: it's unfortunate, since we'll have to update two places if either (a) this logic has common bugs, or (b) the spec changes. I suspect (a) is more likely than (b). But, Thiru, you've provided some reasons why sharing this logic might not be easy and I don't see an easy alternative.

        > There is a proposal to implement GenericDatumReader using ResolvingDecoder.

        Do we yet have a Jira for this?

        Show
        Doug Cutting added a comment - I'm +0 with the logic duplication in the patch: it's unfortunate, since we'll have to update two places if either (a) this logic has common bugs, or (b) the spec changes. I suspect (a) is more likely than (b). But, Thiru, you've provided some reasons why sharing this logic might not be easy and I don't see an easy alternative. > There is a proposal to implement GenericDatumReader using ResolvingDecoder. Do we yet have a Jira for this?
        Hide
        Thiruvalluvan M. G. added a comment -

        Quick question: is it possible to avoid ResolvingParserGenerator.java::encode() having a full switch on all AVRO types? Or is the reason that you can't use JsonDecoder here because this is required for JsonDecoder to work?

        It is relatively easy to convert the JsonNode into a byte array and then read it back using JsonDecoder. The main reason I didn't want to use JsonDecoder is because JsonDecoder decodes in a streaming mode, it does not read the whole Json objects into memory before decoding the contents. We can decode very large objects without consuming too much of memory, this way. The main drawback of this approach is that records should encode their fields in the same order as that of the schema. We left this limitation in the JsonDecoder because we didn't want to buffer things.

        But, with default values, you don't expect the that the authors of the schema to write the default values preserving its field order. Even if they do, if the schema file is passed through some Json tools, the order is not guaranteed to be preserved.

        Another way to implement ResolvingParserGenerator.java::encode() might be to call GenericData#defaultFieldValue() then use GenericDatumWriter to encode this as binary. This might be a bit slower, but it would use a lot less code, and this is done at schema-compilation time, so shouldn't be too performance critical.

        I guess you meant GenericDatumReader#defaultFieldValue(). I did consider using defaultFieldValue() function. In fact the ResolvingParserGenerator's encode() modeled after defaultFieldValue(). The reason I didn't do so was that I didn't want to create mess of dependencies. Presently, the dependency is Schema is at layer 0,

        { BinaryEncoder, BinaryDecoder}

        at the layer 1, the classes in io.parsing at layer 2, the rest of the "advanced" encoders and decoders of io package at layer 3. Everything else is above layer 4. Throwing generic package into the mix, will look a lot complicated.

        There is a proposal to implement GenericDatumReader using ResolvingDecoder. Using GenericDatumReader in ResolvingDecoder will then cause circular dependency.

        I also looked at refactoring defaultFieldValue() into some common place and use it both at GenericDatumReader and ResolvingDecoder. But defaultFieldValue calls virtual methods of GenericDatumReader giving hooks for the users to customize its behavior.

        Show
        Thiruvalluvan M. G. added a comment - Quick question: is it possible to avoid ResolvingParserGenerator.java::encode() having a full switch on all AVRO types? Or is the reason that you can't use JsonDecoder here because this is required for JsonDecoder to work? It is relatively easy to convert the JsonNode into a byte array and then read it back using JsonDecoder. The main reason I didn't want to use JsonDecoder is because JsonDecoder decodes in a streaming mode, it does not read the whole Json objects into memory before decoding the contents. We can decode very large objects without consuming too much of memory, this way. The main drawback of this approach is that records should encode their fields in the same order as that of the schema. We left this limitation in the JsonDecoder because we didn't want to buffer things. But, with default values, you don't expect the that the authors of the schema to write the default values preserving its field order. Even if they do, if the schema file is passed through some Json tools, the order is not guaranteed to be preserved. Another way to implement ResolvingParserGenerator.java::encode() might be to call GenericData#defaultFieldValue() then use GenericDatumWriter to encode this as binary. This might be a bit slower, but it would use a lot less code, and this is done at schema-compilation time, so shouldn't be too performance critical. I guess you meant GenericDatumReader#defaultFieldValue(). I did consider using defaultFieldValue() function. In fact the ResolvingParserGenerator's encode() modeled after defaultFieldValue(). The reason I didn't do so was that I didn't want to create mess of dependencies. Presently, the dependency is Schema is at layer 0, { BinaryEncoder, BinaryDecoder} at the layer 1, the classes in io.parsing at layer 2, the rest of the "advanced" encoders and decoders of io package at layer 3. Everything else is above layer 4. Throwing generic package into the mix, will look a lot complicated. There is a proposal to implement GenericDatumReader using ResolvingDecoder. Using GenericDatumReader in ResolvingDecoder will then cause circular dependency. I also looked at refactoring defaultFieldValue() into some common place and use it both at GenericDatumReader and ResolvingDecoder. But defaultFieldValue calls virtual methods of GenericDatumReader giving hooks for the users to customize its behavior.
        Hide
        Doug Cutting added a comment -

        Another way to implement ResolvingParserGenerator.java::encode() might be to call GenericData#defaultFieldValue() then use GenericDatumWriter to encode this as binary. This might be a bit slower, but it would use a lot less code, and this is done at schema-compilation time, so shouldn't be too performance critical.

        Show
        Doug Cutting added a comment - Another way to implement ResolvingParserGenerator.java::encode() might be to call GenericData#defaultFieldValue() then use GenericDatumWriter to encode this as binary. This might be a bit slower, but it would use a lot less code, and this is done at schema-compilation time, so shouldn't be too performance critical.
        Hide
        Philip Zeyliger added a comment -

        Quick question: is it possible to avoid ResolvingParserGenerator.java::encode() having a full switch on all AVRO types? Or is the reason that you can't use JsonDecoder here because this is required for JsonDecoder to work?

        Show
        Philip Zeyliger added a comment - Quick question: is it possible to avoid ResolvingParserGenerator.java::encode() having a full switch on all AVRO types? Or is the reason that you can't use JsonDecoder here because this is required for JsonDecoder to work?

          People

          • Assignee:
            Thiruvalluvan M. G.
            Reporter:
            Thiruvalluvan M. G.
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development