Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4491

Avoid reading columns that are only part of constant expressions in select

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • Impala 2.5.0
    • None
    • Backend
    • None

    Description

      Query execution should skip reading any columns if their value is not actually needed (e.g. with an IF expression where test condition is constant).

      I think this is a very good general optimization but in our specific case, we have an in-house reporting engine that deals with optional report columns by prepending each in the query with IF([show_in_ui_var]=1, somecolumn, 0) which becomes IF(0=1, somecolumn, 0) when the column is not selected in the UI. A report can easily have 40 of 50 total columns 'turned off' this way. But currently Impala actually does hdfs scans for all 50. I'm attaching one such full sample report query.

      Here's a test to reproduce:

      create table ints_to_ten (x int);
      
      insert into ints_to_ten values (1), (2), (3), (4), (5), (6), (7), (8), (9), (10);
      
      create table sample_wide_table (
        col1 int,
        col2 int,
        col3 int,
        col4 int,
        col5 int,
        col6 int,
        col7 int,
        col8 int,
        col9 int,
        col10 int,
        col11 int,
        col12 int,
        col13 int,
        col14 int,
        col15 int,
        col16 int,
        col17 int,
        col18 int,
        col19 int,
        col20 int,
        col21 int,
        col22 int,
        col23 int,
        col24 int,
        col25 int,
        col26 int,
        col27 int,
        col28 int,
        col29 int,
        col30 int,
        col31 int,
        col32 int,
        col33 int,
        col34 int,
        col35 int,
        col36 int,
        col37 int,
        col38 int,
        col39 int,
        col40 int,
        col41 int,
        col42 int,
        col43 int,
        col44 int,
        col45 int,
        col46 int,
        col47 int,
        col48 int,
        col49 int,
        col50 int
      )
      stored as parquet;
      
      insert overwrite sample_wide_table
      select 
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val,
      val
      from (
      select 
      cast(round(rand() * 10000) as int) val
      from ints_to_ten t1, ints_to_ten t2, ints_to_ten t3, ints_to_ten t4, ints_to_ten t5, ints_to_ten t6, ints_to_ten t7, ints_to_ten t8
      ) x;
      
      -- confirm 100 million rows, all columns have same aggregate values because each rows uses same pseudo-random value for all columns
      select count(*), min(col1), max(col1), min(col25), max(col25), min(col50), max(col50) from sample_wide_table;
      
      
      
      -- now run following 3 queries repeatedly.  Queries 1 and 3 take about the same time (9 seconds in our prod cluster), while query 2 with just 1 column takes about 1 second (over about 15 runs I tried).
      
      select
      avg(col1),
      avg(col2),
      avg(col3),
      avg(col4),
      avg(col5),
      avg(col6),
      avg(col7),
      avg(col8),
      avg(col9),
      avg(col10),
      avg(col11),
      avg(col12),
      avg(col13),
      avg(col14),
      avg(col15),
      avg(col16),
      avg(col17),
      avg(col18),
      avg(col19),
      avg(col20),
      avg(col21),
      avg(col22),
      avg(col23),
      avg(col24),
      avg(col25),
      avg(col26),
      avg(col27),
      avg(col28),
      avg(col29),
      avg(col30),
      avg(col31),
      avg(col32),
      avg(col33),
      avg(col34),
      avg(col35),
      avg(col36),
      avg(col37),
      avg(col38),
      avg(col39),
      avg(col40),
      avg(col41),
      avg(col42),
      avg(col43),
      avg(col44),
      avg(col45),
      avg(col46),
      avg(col47),
      avg(col48),
      avg(col49),
      avg(col50)
      from sample_wide_table;
      
      select
      avg(col1)
      from sample_wide_table;
      
      
      select
      IF(1=0, avg(col1), 0),
      IF(1=0, avg(col2), 0),
      IF(1=0, avg(col3), 0),
      IF(1=0, avg(col4), 0),
      IF(1=0, avg(col5), 0),
      IF(1=0, avg(col6), 0),
      IF(1=0, avg(col7), 0),
      IF(1=0, avg(col8), 0),
      IF(1=0, avg(col9), 0),
      IF(1=0, avg(col10), 0),
      IF(1=0, avg(col11), 0),
      IF(1=0, avg(col12), 0),
      IF(1=0, avg(col13), 0),
      IF(1=0, avg(col14), 0),
      IF(1=0, avg(col15), 0),
      IF(1=0, avg(col16), 0),
      IF(1=0, avg(col17), 0),
      IF(1=0, avg(col18), 0),
      IF(1=0, avg(col19), 0),
      IF(1=0, avg(col20), 0),
      IF(1=0, avg(col21), 0),
      IF(1=0, avg(col22), 0),
      IF(1=0, avg(col23), 0),
      IF(1=0, avg(col24), 0),
      IF(1=0, avg(col25), 0),
      IF(1=0, avg(col26), 0),
      IF(1=0, avg(col27), 0),
      IF(1=0, avg(col28), 0),
      IF(1=0, avg(col29), 0),
      IF(1=0, avg(col30), 0),
      IF(1=0, avg(col31), 0),
      IF(1=0, avg(col32), 0),
      IF(1=0, avg(col33), 0),
      IF(1=0, avg(col34), 0),
      IF(1=0, avg(col35), 0),
      IF(1=0, avg(col36), 0),
      IF(1=0, avg(col37), 0),
      IF(1=0, avg(col38), 0),
      IF(1=0, avg(col39), 0),
      IF(1=0, avg(col40), 0),
      IF(1=0, avg(col41), 0),
      IF(1=0, avg(col42), 0),
      IF(1=0, avg(col43), 0),
      IF(1=0, avg(col44), 0),
      IF(1=0, avg(col45), 0),
      IF(1=0, avg(col46), 0),
      IF(1=0, avg(col47), 0),
      IF(1=0, avg(col48), 0),
      IF(1=0, avg(col49), 0),
      IF(1=0, avg(col50), 0)
      from sample_wide_table;
      

      Queries averaging all 50 columns for all 100 million records take the same time (about 9 seconds in our cluster) whether actually returning the aggregate or returning a constant value. As a control, queries averaging 1 column take about 1 second.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              maristir_impala_cf3d Mauricio Aristizabal
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: