Uploaded image for project: 'Apache NiFi'
  1. Apache NiFi
  2. NIFI-5640

Improve efficiency of Avro Record Reader



    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.8.0
    • Component/s: None
    • Labels:


      There are a few things that we are doing in the Avro Reader that cause subpar performance. Firstly, in the AvroTypeUtil, when converting an Avro GenericRecord to our Record, the building of the RecordSchema is slow because we call toString() (which is quite expensive) on the Avro schema in order to provide a textual version to RecordSchema. However, the text is typically not used and it is optional to provide the schema text, so we should avoid calling Schema#toString() whenever possible.

      The AvroTypeUtil class also calls #getNonNullSubSchemas() a lot. In some cases we don't really need to do this and can avoid creating the sublist. In other cases, we do need to call it. However, the method uses the stream() method on an existing List just to filter out 0 or 1 elements. While use of the stream() method makes the code very readable, it is quite a bit more expensive than just iterating over the existing list and adding to an ArrayList. We should avoid use of the stream() method for trivial pieces of code in time-critical parts of the codebase.

      Additionally, I've found that Avro's GenericDatumReader is extremely inefficient, at least in some cases, when reading Strings because it uses an IdentityHashMap to cache details about the schema. But IdentityHashMap is far slower than if it were to just use HashMap so we could subclass the reader in order to avoid the slow caching.


          Issue Links



              • Assignee:
                markap14 Mark Payne
                markap14 Mark Payne
              • Votes:
                0 Vote for this issue
                3 Start watching this issue


                • Created: