Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3989

Display skew warning for poorly formatted Parquet files

    Details

      Description

      Parquet files are scanned in the granularity of row groups. If some row groups span multiple blocks, then we will most likely end up seeing some scan ranges having remote reads and some scan ranges not performing scans at all. This will attribute to skew across the cluster where distribution of scans is uneven.

      We should consider adding a counter for the number of scan ranges that end up doing no reads. Alternatively, we could just display warning messages saying that the Parquet file is poorly formatted.

      In the case of S3, we could suggest that the user changes the default block size (fs.s3a.block.size) to match the row group size of the files to avoid skew.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                attilaj Attila Jeges
                Reporter:
                sailesh Sailesh Mukil
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: