[DRILL-5351] Excessive bounds checking in the Parquet reader - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.11.0
Component/s: None
Labels:
- ready-to-commit

Description

In profiling the Parquet reader, the variable length decoding appears to be a major bottleneck making the reader CPU bound rather than disk bound.
A yourkit profile indicates the following methods being severe bottlenecks -

VarLenBinaryReader.determineSizeSerial(long)
NullableVarBinaryVector$Mutator.setSafe(int, int, int, int, DrillBuf)
DrillBuf.chk(int, int)
NullableVarBinaryVector$Mutator.fillEmpties()

The problem is that each of these methods does some form of bounds checking and eventually of course, the actual write to the ByteBuf is also bounds checked.

DrillBuf.chk can be disabled by a configuration setting. Disabling this does improve performance of TPCH queries. In addition, all regression, unit, and TPCH-SF100 tests pass.

I would recommend we allow users to turn this check off if there are performance critical queries.

Removing the bounds checking at every level is going to be a fair amount of work. In the meantime, it appears that a few simple changes to variable length vectors improves query performance by about 10% across the board.

Attachments

Issue Links

links to

GitHub Pull Request #781

Activity

People

Assignee:: Parth Chandra

Reporter:: Parth Chandra

Reviewer:: Paul Rogers

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 13/Mar/17 17:15

Updated:: 03/Apr/17 05:17

Resolved:: 03/Apr/17 05:17