[HIVE-17394] AvroSerde is regenerating TypeInfo objects for each nullable Avro field for every row - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.1.0, 3.0.0
Fix Version/s: 3.0.0
Component/s: Serializers/Deserializers
Labels:
None

Hadoop Flags:

Reviewed

Description

The following methods in AvroDeserializer keeps regenerating TypeInfo objects for every nullable field in a row.

This is happening in the following methods.

private Object deserializeNullableUnion(Object datum, Schema fileSchema, Schema recordSchema) throws AvroSerdeException {
// elided
line 312:  return worker(datum, fileSchema, newRecordSchema,
            SchemaToTypeInfo.generateTypeInfo(newRecordSchema, null));
}
..
private Object deserializeSingleItemNullableUnion(Object datum, Schema Schema recordSchema)
// elided
line 357: return worker(datum, currentFileSchema, schema,
      SchemaToTypeInfo.generateTypeInfo(schema, null));

This is really bad in terms of performance. I'm not sure why didn't we use the TypeInfo we already have instead of generating again for each nullable field. If you look at the worker method which calls the method deserializeNullableUnion the typeInfo corresponding to the nullable field column is already determined.
Moreover the cache in SchemaToTypeInfo class does not help in nullable Avro records case as checking if an Avro record schema object already exists in the cache requires traversing all the fields in the record schema.

I've attached profiling snapshot which shows maximum time is being spent in the cache.

One way of fixing this IMO might be to make use of the column TypeInfo which is already passed in the worker method.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HIVE-17394.1.patch
12/Sep/17 14:59
4 kB
Anthony Hsu
AvroSerDeUnionTypeInfo.png
28/Aug/17 17:25
88 kB
Ratandeep Ratti
AvroSerDe.nps
28/Aug/17 17:25
15 kB
Ratandeep Ratti

Issue Links

relates to

HIVE-18410 [Performance][Avro] Reading flat Avro tables is very expensive in Hive

Closed

links to

https://reviews.apache.org/r/62247/

Activity

People

Assignee:: Anthony Hsu

Reporter:: Ratandeep Ratti

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 28/Aug/17 17:14

Updated:: 22/May/18 23:13

Resolved:: 12/Sep/17 22:09