[NIFI-5640] Improve efficiency of Avro Record Reader - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.8.0
Component/s: None
Labels:
None

Description

There are a few things that we are doing in the Avro Reader that cause subpar performance. Firstly, in the AvroTypeUtil, when converting an Avro GenericRecord to our Record, the building of the RecordSchema is slow because we call toString() (which is quite expensive) on the Avro schema in order to provide a textual version to RecordSchema. However, the text is typically not used and it is optional to provide the schema text, so we should avoid calling Schema#toString() whenever possible.

The AvroTypeUtil class also calls #getNonNullSubSchemas() a lot. In some cases we don't really need to do this and can avoid creating the sublist. In other cases, we do need to call it. However, the method uses the stream() method on an existing List just to filter out 0 or 1 elements. While use of the stream() method makes the code very readable, it is quite a bit more expensive than just iterating over the existing list and adding to an ArrayList. We should avoid use of the stream() method for trivial pieces of code in time-critical parts of the codebase.

Additionally, I've found that Avro's GenericDatumReader is extremely inefficient, at least in some cases, when reading Strings because it uses an IdentityHashMap to cache details about the schema. But IdentityHashMap is far slower than if it were to just use HashMap so we could subclass the reader in order to avoid the slow caching.

Attachments

Issue Links

links to

GitHub Pull Request #3036

Activity

People

Assignee:: Mark Payne

Reporter:: Mark Payne

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 27/Sep/18 12:52

Updated:: 27/Sep/18 19:40

Resolved:: 27/Sep/18 19:40