[IMPALA-3989] Display skew warning for poorly formatted Parquet files - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: Impala 2.7.0
Fix Version/s: Impala 2.9.0
Component/s: Backend
Labels:
- newbie
- s3
- usability

Target Version:

Product Backlog

Description

Parquet files are scanned in the granularity of row groups. If some row groups span multiple blocks, then we will most likely end up seeing some scan ranges having remote reads and some scan ranges not performing scans at all. This will attribute to skew across the cluster where distribution of scans is uneven.

We should consider adding a counter for the number of scan ranges that end up doing no reads. Alternatively, we could just display warning messages saying that the Parquet file is poorly formatted.

In the case of S3, we could suggest that the user changes the default block size (fs.s3a.block.size) to match the row group size of the files to avoid skew.

Attachments

Issue Links

is related to

IMPALA-3885 Parquet files with multiple blocks cause remote reads

Resolved

Activity

People

Assignee:: Attila Jeges

Reporter:: Sailesh Mukil

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 17/Aug/16 19:12

Updated:: 13/Feb/17 18:58

Resolved:: 13/Feb/17 18:58