Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
ghx-label-3
Description
Currently the Parquet scanner is somewhat naive about how it issues column scan ranges: it issues a separate scan range per column, in the order that the the column readers are organised internally. If the column ranges are large (i.e. multiple I/O buffers) or we're reading from SSDs where random access is fairly efficient, this may not matter very much. However, this approach is suboptimal when reading smaller columns (e.g. highly compressed) from spinning disks for two reasons:
- Some columns may be adjacent in the file. If we are reading each column into a single smaller I/O buffer but multiple columns would fit in a larger I/O buffer, we would probably be better off doing a single I/O for that column.
- We are reading the columns in a fairly random order, because the I/O mgr does round robin on the scan ranges in the order they were added. Sorting the scan ranges by file offset would improve the odds of being able to read each subsequent column without an additional seek and will also improve locality for the disk's internal cache. Based on some superficial googling, a lot of drives have 64M or 128M internal caches, which is large enough that it could be useful but small enough that, if we do I/O from a 256MB+ Parquet file in a completely random order, we're reducing the chances of getting cache hits significantly.
IMPALA-4835 may help a lot here, since it will tell us upfront what the memory budget is for I/O.
Attachments
Issue Links
- depends upon
-
IMPALA-4835 HDFS scans should operate with a constrained number of I/O buffers
- Resolved
- relates to
-
IMPALA-5843 Use page index in Parquet files to skip pages
- Resolved