Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3989

Display skew warning for poorly formatted Parquet files

    Details

      Description

      Parquet files are scanned in the granularity of row groups. If some row groups span multiple blocks, then we will most likely end up seeing some scan ranges having remote reads and some scan ranges not performing scans at all. This will attribute to skew across the cluster where distribution of scans is uneven.

      We should consider adding a counter for the number of scan ranges that end up doing no reads. Alternatively, we could just display warning messages saying that the Parquet file is poorly formatted.

      In the case of S3, we could suggest that the user changes the default block size (fs.s3a.block.size) to match the row group size of the files to avoid skew.

        Issue Links

          Activity

          Hide
          kwho Michael Ho added a comment -

          Fixed at https://github.com/apache/incubator-impala/commit/8f59ce9dfc636cc9f6f03ca9f5ee289ca7cca602

          IMPALA-3989: Display skew warning for poorly formatted Parquet files
          Parquet files are scanned in the granularity of row groups. Each row
          group belongs to one or more splits and each split is scanned by a
          separate parquet scanner.

          If some row groups span multiple splits, then we will most likely end
          up seeing some scanners having remote reads and some scanners not
          performing scans at all. This will attribute to skew across the
          cluster where distribution of scans is uneven.

          This change adds a counter (NumScannersWithNoReads) to the scan node's
          runtime profile to track the number of parquet scanners that end up
          doing no reads becuse of poor formatting.

          Change-Id: Ibf48d978383d73efdade733a892e795ebd53c76a
          Reviewed-on: http://gerrit.cloudera.org:8080/5400
          Reviewed-by: Dan Hecht <dhecht@cloudera.com>
          Tested-by: Impala Public Jenkins

          Show
          kwho Michael Ho added a comment - Fixed at https://github.com/apache/incubator-impala/commit/8f59ce9dfc636cc9f6f03ca9f5ee289ca7cca602 IMPALA-3989 : Display skew warning for poorly formatted Parquet files Parquet files are scanned in the granularity of row groups. Each row group belongs to one or more splits and each split is scanned by a separate parquet scanner. If some row groups span multiple splits, then we will most likely end up seeing some scanners having remote reads and some scanners not performing scans at all. This will attribute to skew across the cluster where distribution of scans is uneven. This change adds a counter (NumScannersWithNoReads) to the scan node's runtime profile to track the number of parquet scanners that end up doing no reads becuse of poor formatting. Change-Id: Ibf48d978383d73efdade733a892e795ebd53c76a Reviewed-on: http://gerrit.cloudera.org:8080/5400 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins

            People

            • Assignee:
              attilaj Attila Jeges
              Reporter:
              sailesh Sailesh Mukil
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development