[HIVE-18410] [Performance][Avro] Reading flat Avro tables is very expensive in Hive - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.2.1, 2.1.0, 2.3.2, 3.0.0
Fix Version/s: 3.1.0, 3.0.0
Component/s: None
Labels:
None

Description

There's a performance penalty when reading flat [no nested fields] Avro tables. When reading the same flat dataset in Pig, it takes half the time. On profiling, a lot of time is spent in AvroDeserializer.deserializeSingleItemNullableUnion(). The bulk of the time is spent in GenericData.get().resolveUnion(), which calls GenericData.getSchemaName(Object datum), which does a lot of instanceof checks. This could be simplified with performance benefits. A approach is described in this patch which almost halves the runtime.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

profiling_with_patch.nps
09/Jan/18 01:42
4 kB
Ratandeep Ratti
profiling_without_patch.nps
09/Jan/18 01:42
11 kB
Ratandeep Ratti
profiling_with_patch.png
09/Jan/18 01:45
238 kB
Ratandeep Ratti
profiling_without_patch.png
09/Jan/18 01:45
257 kB
Ratandeep Ratti
HIVE-18410.patch
09/Jan/18 01:47
13 kB
Ratandeep Ratti
HIVE-18410_1.patch
12/Jan/18 18:22
11 kB
Ratandeep Ratti
HIVE-18410_2.patch
01/Feb/18 19:48
13 kB
Ratandeep Ratti
HIVE-18410_3.patch
03/Feb/18 05:08
13 kB
Ratandeep Ratti

Issue Links

is related to

HIVE-17394 AvroSerde is regenerating TypeInfo objects for each nullable Avro field for every row

Closed

Activity

People

Assignee:: Ratandeep Ratti

Reporter:: Ratandeep Ratti

Votes:: 2 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 09/Jan/18 01:40

Updated:: 22/May/18 23:16

Resolved:: 18/Apr/18 02:55