Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3240

Regression introduced by IMPALA-2502 in primitive_shuffle_join_union_all_with_groupby

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • Impala 2.6.0
    • None
    • Backend

    Description

      The primitive_shuffle_join_union_all_with_groupby is showing a regression after IMPALA-2502, the regression appears to be happening due to serialization serialization of the branches of the UNION.

      Operator             #Hosts   Avg Time   Max Time    #Rows  Est. #Rows   Peak Mem  Est. Peak Mem  Detail                       
      -------------------------------------------------------------------------------------------------------------------------------
      16:EXCHANGE               1   10.113us   10.113us        0      36.00M          0        -1.00 B  UNPARTITIONED                
      15:AGGREGATE             15    5s502ms    6s002ms        0      36.00M  922.04 MB      604.24 MB  FINALIZE                     
      14:EXCHANGE              15  949.540ms    1s056ms  395.39M      36.00M          0              0  HASH(l_orderkey)             
      09:AGGREGATE             15    4s139ms    4s558ms  395.39M      36.00M  145.84 MB      604.24 MB  STREAMING                    
      00:UNION                 15    4s078ms    4s489ms  398.68M      36.00M    8.00 KB              0                               
      |--08:AGGREGATE          15    9s568ms   10s270ms  199.34M      18.00M  794.06 MB      151.06 MB  FINALIZE                     
      |  07:HASH JOIN          15   14s089ms   14s888ms  774.31M      18.00M    1.13 GB      107.00 MB  INNER JOIN, PARTITIONED      
      |  |--13:EXCHANGE        15    1s010ms    1s335ms  204.99M      45.00M          0              0  HASH(o2.o_orderkey)          
      |  |  06:SCAN HDFS       15  402.922ms  458.808ms  204.99M      45.00M  210.24 MB      176.00 MB  tpch_300_parquet.orders o2   
      |  12:EXCHANGE           15    3s101ms    5s305ms  774.31M     180.00M          0              0  HASH(l2.l_orderkey)          
      |  05:SCAN HDFS          15  690.566ms  754.985ms  774.31M     180.00M  263.63 MB      176.00 MB  tpch_300_parquet.lineitem l2 
      04:AGGREGATE             15    9s157ms    9s725ms  199.34M      18.00M  794.06 MB      151.06 MB  FINALIZE                     
      03:HASH JOIN             15   13s297ms   14s457ms  774.31M      18.00M    1.13 GB      107.00 MB  INNER JOIN, PARTITIONED      
      |--11:EXCHANGE           15    1s057ms    1s360ms  204.99M      45.00M          0              0  HASH(o1.o_orderkey)          
      |  02:SCAN HDFS          15  397.264ms  452.629ms  204.99M      45.00M  210.30 MB      176.00 MB  tpch_300_parquet.orders o1   
      10:EXCHANGE              15    3s017ms    5s100ms  774.31M     180.00M          0              0  HASH(l1.l_orderkey)          
      01:SCAN HDFS             15  668.373ms  731.161ms  774.31M     180.00M  263.03 MB      176.00 MB  tpch_300_parquet.lineitem l1 
      
      SELECT Count(*) ROWCOUNT
      FROM (
          SELECT /* +straight_join */ l_orderkey
          FROM lineitem l1
          INNER JOIN /* +shuffle */ orders o1 ON l1.l_orderkey = o1.o_orderkey
          WHERE l_shipdate < '1995-01-01'
              AND o1.o_orderdate < '1995-01-01'
          GROUP BY l_orderkey
          UNION ALL
          SELECT /* +straight_join */ l_orderkey
          FROM lineitem l2
          INNER JOIN /* +shuffle */ orders o2 ON l2.l_orderkey = o2.o_orderkey
          WHERE l_shipdate < '1995-01-01'
              AND o2.o_orderdate < '1995-01-01'
          GROUP BY l_orderkey
          ) a
      GROUP BY l_orderkey
      HAVING ROWCOUNT = 99999999;
      

      Repro
      https://github.com/cloudera/Impala/blob/cdh5-trunk/testdata/workloads/targeted-perf/queries/primitive_shuffle_join_union_all_with_groupby.test

      The profile counters show that one branch of the UNION is taking significantly more time than the other, also for some reason the profile shows significantly higher CPU time for the subquery waiting on open.

      Fragment that gets blocked showing higher CPU

        Averaged Fragment F04:(Total: 43s861ms, non-child: 0.000ns, % non-child: 0.00%)
            split sizes:  min: 1.10 GB, max: 1.48 GB, avg: 1.25 GB, stddev: 107.42 MB
            completion times: min:43s595ms  max:44s660ms  mean: 44s148ms  stddev:333.493ms
            execution rates: min:25.82 MB/sec  max:34.06 MB/sec  mean:29.09 MB/sec  stddev:2.24 MB/sec
            num instances: 15
             - AverageThreadTokens: 6.31 
             - BloomFilterBytes: 0
             - PeakMemoryUsage: 181.76 MB (190585123)
             - PerHostPeakMemUsage: 2.47 GB (2654165942)
             - PrepareTime: 146.519ms
             - RowsProduced: 13.67M (13665799)
             - TotalCpuTime: 3m58s
             - TotalNetworkReceiveTime: 0.000ns
             - TotalNetworkSendTime: 39s952ms
             - TotalStorageWaitTime: 187.611ms
            CodeGen:(Total: 394.245ms, non-child: 394.245ms, % non-child: 100.00%)
               - CodegenTime: 0.000ns
               - CompileTime: 4.357ms
               - LoadTime: 0.000ns
               - ModuleBitcodeSize: 1.49 MB (1559356)
               - OptimizationTime: 243.888ms
               - PrepareTime: 143.875ms
            DataStreamSender (dst_id=13):(Total: 43s583ms, non-child: 43s583ms, % non-child: 100.00%)
               - BytesSent: 252.13 MB (264376136)
               - NetworkThroughput(*): 483.75 KB/sec
               - OverallThroughput: 5.78 MB/sec
               - PeakMemoryUsage: 480.00 KB (491520)
               - SerializeBatchTime: 2s074ms
               - ThriftTransmitTime(*): 8m58s
               - UncompressedRowBatchSize: 599.61 MB (628733578)
            HDFS_SCAN_NODE (id=6):(Total: 402.922ms, non-child: 402.922ms, % non-child: 100.00%)
               - AverageHdfsReadThreadConcurrency: 0.00 
               - AverageScannerThreadConcurrency: 5.36 
               - BytesRead: 158.43 MB (166123086)
               - BytesReadDataNodeCache: 0
               - BytesReadLocal: 158.43 MB (166123086)
               - BytesReadRemoteUnexpected: 0
               - BytesReadShortCircuit: 158.43 MB (166123086)
               - DecompressionTime: 598.367ms
               - MaxCompressedTextFileLength: 0
               - NumColumns: 2 (2)
               - NumDisksAccessed: 4 (4)
               - NumRowGroups: 5 (5)
               - NumScannerThreadsStarted: 5 (5)
               - PeakMemoryUsage: 181.31 MB (190120925)
               - PerReadThreadRawHdfsThroughput: 1.17 GB/sec
               - RemoteScanRanges: 0 (0)
               - RowsRead: 30.00M (30000000)
               - RowsReturned: 13.67M (13665799)
               - RowsReturnedRate: 34.00 M/sec
               - ScanRangesComplete: 5 (5)
               - ScannerThreadsInvoluntaryContextSwitches: 1.72K (1723)
               - ScannerThreadsTotalWallClockTime: 3m54s
                 - MaterializeTupleTime(*): 3m54s
                 - ScannerThreadsSysTime: 128.177ms
                 - ScannerThreadsUserTime: 4s464ms
               - ScannerThreadsVoluntaryContextSwitches: 13.27K (13268)
               - TotalRawHdfsReadTime(*): 141.720ms
               - TotalReadThroughput: 3.60 MB/sec
      

      Fragment that gets admitted right way showing lower CPU

         Averaged Fragment F01:(Total: 3s974ms, non-child: 0.000ns, % non-child: 0.00%)
            split sizes:  min: 1.10 GB, max: 1.48 GB, avg: 1.25 GB, stddev: 107.42 MB
            completion times: min:3s679ms  max:5s077ms  mean: 4s267ms  stddev:372.035ms
            execution rates: min:283.68 MB/sec  max:319.56 MB/sec  mean:301.40 MB/sec  stddev:11.10 MB/sec
            num instances: 15
             - AverageThreadTokens: 5.58 
             - BloomFilterBytes: 0
             - PeakMemoryUsage: 181.75 MB (190583485)
             - PerHostPeakMemUsage: 1.42 GB (1526291688)
             - PrepareTime: 144.209ms
             - RowsProduced: 13.67M (13665799)
             - TotalCpuTime: 23s330ms
             - TotalNetworkReceiveTime: 0.000ns
             - TotalNetworkSendTime: 123.212ms
             - TotalStorageWaitTime: 260.160ms
            CodeGen:(Total: 398.214ms, non-child: 398.214ms, % non-child: 100.00%)
               - CodegenTime: 0.000ns
               - CompileTime: 6.726ms
               - LoadTime: 0.000ns
               - ModuleBitcodeSize: 1.49 MB (1559356)
               - OptimizationTime: 247.982ms
               - PrepareTime: 142.337ms
            DataStreamSender (dst_id=11):(Total: 3s698ms, non-child: 3s698ms, % non-child: 100.00%)
               - BytesSent: 252.22 MB (264470845)
               - NetworkThroughput(*): 34.68 MB/sec
               - OverallThroughput: 68.30 MB/sec
               - PeakMemoryUsage: 480.00 KB (491520)
               - SerializeBatchTime: 2s046ms
               - ThriftTransmitTime(*): 7s274ms
               - UncompressedRowBatchSize: 599.61 MB (628733578)
            HDFS_SCAN_NODE (id=2):(Total: 397.264ms, non-child: 397.264ms, % non-child: 100.00%)
               - AverageHdfsReadThreadConcurrency: 0.04 
               - AverageScannerThreadConcurrency: 5.05 
               - BytesRead: 158.43 MB (166123086)
               - BytesReadDataNodeCache: 0
               - BytesReadLocal: 158.43 MB (166123086)
               - BytesReadRemoteUnexpected: 0
               - BytesReadShortCircuit: 158.43 MB (166123086)
               - DecompressionTime: 579.748ms
               - MaxCompressedTextFileLength: 0
               - NumColumns: 2 (2)
               - NumDisksAccessed: 4 (4)
               - NumRowGroups: 5 (5)
               - NumScannerThreadsStarted: 5 (5)
               - PeakMemoryUsage: 181.31 MB (190121458)
               - PerReadThreadRawHdfsThroughput: 1.13 GB/sec
               - RemoteScanRanges: 0 (0)
               - RowsRead: 30.00M (30000000)
               - RowsReturned: 13.67M (13665799)
               - RowsReturnedRate: 34.49 M/sec
               - ScanRangesComplete: 5 (5)
               - ScannerThreadsInvoluntaryContextSwitches: 1.73K (1732)
               - ScannerThreadsTotalWallClockTime: 19s447ms
                 - MaterializeTupleTime(*): 18s475ms
                 - ScannerThreadsSysTime: 124.778ms
                 - ScannerThreadsUserTime: 4s347ms
               - ScannerThreadsVoluntaryContextSwitches: 13.18K (13184)
               - TotalRawHdfsReadTime(*): 142.595ms
               - TotalReadThroughput: 39.41 MB/sec
      

      Attachments

        Activity

          People

            tarmstrong Tim Armstrong
            mmokhtar Mostafa Mokhtar
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: