Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4397

Codegen for compute stats query on 1K column table takes 4 minutes

    Details

      Description

      Codegen of compute stats aggregation takes 4 minutes for a table with 1K columns

              CodeGen:(Total: 4m19s, non-child: 4m19s, % non-child: 100.00%)
                 - CodegenTime: 540.513ms
                 - CompileTime: 56s810ms
                 - LoadTime: 0.000ns
                 - ModuleBitcodeSize: 1.90 MB (1993816)
                 - NumFunctions: 14.03K (14033)
                 - NumInstructions: 265.60K (265603)
                 - OptimizationTime: 3m23s
                 - PrepareTime: 41.053ms
      
      1. CreateWideTable.sql
        350 kB
        Mostafa Mokhtar
      2. Vtune-Bottom-Up.csv
        556 kB
        Mostafa Mokhtar
      3. VtuneTopDownTree.csv
        149 kB
        Mostafa Mokhtar

        Issue Links

          Activity

          Hide
          tarmstrong Tim Armstrong added a comment -

          IMPALA-4397,IMPALA-3259: reduce codegen time and memory

          A handful of fixes to codegen memory usage:

          • Delete the IR module when we're done with it (it can be fairly large)
          • Track the compiled code size (typically not that large, but it can add
            up if there are many fragments).
          • Estimate optimisation memory requirements and track it in the memory
            tracker. This is very crude but much better than not tracking it.

          A handful of fixes to improve codegen time/cost, particularly targeted
          at compute stats workloads:

          • Avoid over-inlining when there are many aggregate functions,
            conjuncts, etc by adding "NoInline" attributes.
          • Don't codegen non-grouping merge aggregations. They will only process
            one row per Impala daemon, so codegen is not worth it.
          • Make the Hll algorithm more efficient by specialising the hash function
            based on decimal width.

          Limitations:

          • This doesn't tackle over-inlining of large expr trees, but a similar
            approach will be used there in a follow-on patch.

          Perf:
          Compute stats on functional_parquet.widetable_1000_cols goes from 1min+
          of codegen to ~ 5s codegen on my machine. Local perf runs of tpc-h
          and targeted perf showed no regressions and some moderate improvements
          (1-2%).

          Also did an experiment to understand the perf consequences of disabling
          inlining. I manually set CODEGEN_INLINE_EXPRS_THRESHOLD to 0, and ran:

          drop stats tpch_20_parquet.lineitem
          compute stats tpch_20_parquet.lineitem;

          There was no difference in time spent in the agg node: 30.7s with
          inlining, 30.5s without.

          Change-Id: Id10015b49da182cb181a653ac8464b4a18b71091
          Reviewed-on: http://gerrit.cloudera.org:8080/4956
          Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
          Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
          Tested-by: Internal Jenkins

          Show
          tarmstrong Tim Armstrong added a comment - IMPALA-4397 , IMPALA-3259 : reduce codegen time and memory A handful of fixes to codegen memory usage: Delete the IR module when we're done with it (it can be fairly large) Track the compiled code size (typically not that large, but it can add up if there are many fragments). Estimate optimisation memory requirements and track it in the memory tracker. This is very crude but much better than not tracking it. A handful of fixes to improve codegen time/cost, particularly targeted at compute stats workloads: Avoid over-inlining when there are many aggregate functions, conjuncts, etc by adding "NoInline" attributes. Don't codegen non-grouping merge aggregations. They will only process one row per Impala daemon, so codegen is not worth it. Make the Hll algorithm more efficient by specialising the hash function based on decimal width. Limitations: This doesn't tackle over-inlining of large expr trees, but a similar approach will be used there in a follow-on patch. Perf: Compute stats on functional_parquet.widetable_1000_cols goes from 1min+ of codegen to ~ 5s codegen on my machine. Local perf runs of tpc-h and targeted perf showed no regressions and some moderate improvements (1-2%). Also did an experiment to understand the perf consequences of disabling inlining. I manually set CODEGEN_INLINE_EXPRS_THRESHOLD to 0, and ran: drop stats tpch_20_parquet.lineitem compute stats tpch_20_parquet.lineitem; There was no difference in time spent in the agg node: 30.7s with inlining, 30.5s without. Change-Id: Id10015b49da182cb181a653ac8464b4a18b71091 Reviewed-on: http://gerrit.cloudera.org:8080/4956 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: Internal Jenkins
          Hide
          tarmstrong Tim Armstrong added a comment -

          IMPALA-4397 addendum: remove stray semicolon

          Change-Id: Ie16e403ae0eb657f0614a1a90b0556f9f1d1056e
          Reviewed-on: http://gerrit.cloudera.org:8080/5261
          Tested-by: Impala Public Jenkins
          Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>

          M be/src/codegen/mcjit-mem-mgr.h
          1 file changed, 1 insertion, 1 deletion

          Show
          tarmstrong Tim Armstrong added a comment - IMPALA-4397 addendum: remove stray semicolon Change-Id: Ie16e403ae0eb657f0614a1a90b0556f9f1d1056e Reviewed-on: http://gerrit.cloudera.org:8080/5261 Tested-by: Impala Public Jenkins Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> — M be/src/codegen/mcjit-mem-mgr.h 1 file changed, 1 insertion , 1 deletion
          Hide
          mmokhtar Mostafa Mokhtar added a comment -

          After fix

                CodeGen:(Total: 15s095ms, non-child: 15s095ms, % non-child: 100.00%)
                   - CodegenTime: 841.839ms
                   - CompileTime: 5s497ms
                   - LoadTime: 0.000ns
                   - ModuleBitcodeSize: 1.91 MB (1997564)
                   - NumFunctions: 13.04K (13037)
                   - NumInstructions: 289.74K (289743)
                   - OptimizationTime: 9s558ms
                   - PeakMemoryUsage: 141.48 MB (148348416)
                   - PrepareTime: 36.027ms
          
          Show
          mmokhtar Mostafa Mokhtar added a comment - After fix CodeGen:(Total: 15s095ms, non-child: 15s095ms, % non-child: 100.00%) - CodegenTime: 841.839ms - CompileTime: 5s497ms - LoadTime: 0.000ns - ModuleBitcodeSize: 1.91 MB (1997564) - NumFunctions: 13.04K (13037) - NumInstructions: 289.74K (289743) - OptimizationTime: 9s558ms - PeakMemoryUsage: 141.48 MB (148348416) - PrepareTime: 36.027ms

            People

            • Assignee:
              tarmstrong Tim Armstrong
              Reporter:
              mmokhtar Mostafa Mokhtar
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development