Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3453

S3 : Uneven split sizes are generated for Parquet causing execution skew

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • Impala 2.6.0
    • Impala 2.6.0
    • Backend

    Description

      With Impala on S3 unevenly sized splits are assigned to the scan nodes which introduces execution skew

        Averaged Fragment F00:(Total: 1m17s, non-child: 0.000ns, % non-child: 0.00%)
            split sizes:  min: 5.01 GB, max: 11.63 GB, avg: 5.91 GB, stddev: 1.08 GB
            completion times: min:5s442ms  max:2m17s  mean: 1m17s  stddev:48s312ms
            execution rates: min:47.64 MB/sec  max:1.06 GB/sec  mean:324.41 MB/sec  stddev:406.41 MB/sec
            num instances: 32
      

      Running the same query against the exact HDFS layout doesn't produce skew.

      Attachments

        1. FullScanQueryParquet.txt
          76 kB
          Mostafa Mokhtar
        2. image.png
          18 kB
          Mostafa Mokhtar
        3. profile_after_IMPALA-3453_128MB.txt
          49 kB
          Sailesh Mukil
        4. profile_after_IMPALA-3453_256MB.txt
          49 kB
          Sailesh Mukil
        5. profile_after_IMPALA-3453.txt
          49 kB
          Sailesh Mukil
        6. profile_before_IMPALA-3453.txt
          49 kB
          Sailesh Mukil
        7. SplitSkewProfile.txt
          174 kB
          Mostafa Mokhtar
        8. TPC-H Q6 profile.txt
          151 kB
          Mostafa Mokhtar

        Issue Links

          Activity

            People

              mmokhtar Mostafa Mokhtar
              mmokhtar Mostafa Mokhtar
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: