Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
Impala 2.7.0
Description
Parquet files are scanned in the granularity of row groups. If some row groups span multiple blocks, then we will most likely end up seeing some scan ranges having remote reads and some scan ranges not performing scans at all. This will attribute to skew across the cluster where distribution of scans is uneven.
We should consider adding a counter for the number of scan ranges that end up doing no reads. Alternatively, we could just display warning messages saying that the Parquet file is poorly formatted.
In the case of S3, we could suggest that the user changes the default block size (fs.s3a.block.size) to match the row group size of the files to avoid skew.
Attachments
Issue Links
- is related to
-
IMPALA-3885 Parquet files with multiple blocks cause remote reads
- Resolved