Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3989

Display skew warning for poorly formatted Parquet files

    XMLWordPrintableJSON

Details

    Description

      Parquet files are scanned in the granularity of row groups. If some row groups span multiple blocks, then we will most likely end up seeing some scan ranges having remote reads and some scan ranges not performing scans at all. This will attribute to skew across the cluster where distribution of scans is uneven.

      We should consider adding a counter for the number of scan ranges that end up doing no reads. Alternatively, we could just display warning messages saying that the Parquet file is poorly formatted.

      In the case of S3, we could suggest that the user changes the default block size (fs.s3a.block.size) to match the row group size of the files to avoid skew.

      Attachments

        Issue Links

          Activity

            People

              attilaj Attila Jeges
              sailesh Sailesh Mukil
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: