[AVRO-2274] Improve resolving performance when schemas don't change - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: None
Component/s: java
Labels:
None

Description

Decoding optimizations based on the observation that schemas don't change very much. We add special-case paths to optimize the case where a _sub_schema of the reader and the writer are the same. The specific cases are:

In the case of an enumeration, if the reader and writer are the same, then we can simply return the tag written by the writer rather than "adjust" it as if it might have been re-ordered. In fact, we can do this (directly return the tag written by the writer) as long as the reader-schema is an "extension" of the writer's in that it may have added new symbols but hasn't renumbered any of the writer's symbols. Enumerations that either don't change at all or are "extended" as defined here are the common ways to extend enumerations. (Our tests show this optimization improves performance by about 3%.)

When the reader and writer subschemas are both unions, resolution is expensive: we have an outer union preceded by a "writer-union action", but each branch of this outer union consist of union-adjust actions, which are heavy weight. We optimize this case when the reader and writer unions are the same: we fall back on the standard grammar used for a union, avoiding all these adjustments. Since unions are commonly used to encode "nullable" fields in Avro, and nullability rarely changes as a schema evolves, this optimization should help many users. (Our tests show this optimization improves performance by 25-30%, a significant win.)

The "custom code" generated for reading records has to read fields in a loop that uses a switch statement to deal with writers that may have re-ordered fields. In most cases, however, fields have not been reordered (esp. in more complex records with many record sub-schemas). So we've added a new method to ResolvingDecoder called readFieldOrderIfDiff, which is a variant of the existing readFieldOrder. If the field order has indeed changed, then readFieldOrderIfDiff returns the new field order, just like readFieldOrder does. However, if the field-order hasn't changed, then readFieldOrderIfDiff returns null. We then modified the generation of custom-decoders for records to add a special-case path that simply reads the record's fields in order, without incurring the overhead of the loop or the switch statement. (Our tests show this optimization improves performance by 8-9%, on top of the 35-40% produced by the original custom-coder optimization.)

Attachments

Issue Links

links to

GitHub Pull Request #393

Activity

People

Assignee:: Raymie Stata

Reporter:: Raymie Stata

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 26/Nov/18 00:45

Updated:: 28/Nov/18 03:34

Resolved:: 28/Nov/18 03:34