[IMPALA-6673] Be smarter about I/O patterns for Parquet scan ranges - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Backend
Labels:
- parquet
- perf

Target Version:

Product Backlog
Epic Color:
ghx-label-3

Description

Currently the Parquet scanner is somewhat naive about how it issues column scan ranges: it issues a separate scan range per column, in the order that the the column readers are organised internally. If the column ranges are large (i.e. multiple I/O buffers) or we're reading from SSDs where random access is fairly efficient, this may not matter very much. However, this approach is suboptimal when reading smaller columns (e.g. highly compressed) from spinning disks for two reasons:

Some columns may be adjacent in the file. If we are reading each column into a single smaller I/O buffer but multiple columns would fit in a larger I/O buffer, we would probably be better off doing a single I/O for that column.
We are reading the columns in a fairly random order, because the I/O mgr does round robin on the scan ranges in the order they were added. Sorting the scan ranges by file offset would improve the odds of being able to read each subsequent column without an additional seek and will also improve locality for the disk's internal cache. Based on some superficial googling, a lot of drives have 64M or 128M internal caches, which is large enough that it could be useful but small enough that, if we do I/O from a 256MB+ Parquet file in a completely random order, we're reducing the chances of getting cache hits significantly.

~~IMPALA-4835~~ may help a lot here, since it will tell us upfront what the memory budget is for I/O.

Attachments

Issue Links

depends upon

IMPALA-4835 HDFS scans should operate with a constrained number of I/O buffers

Resolved

relates to

IMPALA-5843 Use page index in Parquet files to skip pages

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Tim Armstrong

Votes:: 2 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 15/Mar/18 01:36

Updated:: 15/Mar/18 01:38