Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-10932

Make sure all kinds of simple predicates on bool columns are pushed down

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None
    • ghx-label-4

    Description

      When scanning parquet/orc tables, we push down binary predicates like "x < 1" to leverage the file level statistics. However, predicates on bool column may not in this form. They could be "x", "NOT x", "x ISĀ [NOT] TRUE", "x IS [NOT] FALSE".

      Note that dictionary predicates may have some of them, but still not all of them. For instance, here we have the predicate in dictionary predicates:

      set explain_level=2;
      explain select count(*) from functional_parquet.alltypessmall where bool_col;
      | 00:SCAN HDFS [functional_parquet.alltypessmall, RANDOM]        |
      |    HDFS partitions=4/4 files=4 size=14.76KB                    |
      |    predicates: bool_col                                        |
      |    stored statistics:                                          |
      |      table: rows=unavailable size=unavailable                  |
      |      partitions: 0/4 rows=939                                  |
      |      columns: unavailable                                      |
      |    extrapolated-rows=disabled max-scan-range-rows=unavailable  |
      |    parquet dictionary predicates: bool_col                     |
      

      Here we still have the predicate in dictionary predicates:

      explain select count(*) from functional_parquet.alltypessmall where bool_col is true;
      | 00:SCAN HDFS [functional_parquet.alltypessmall, RANDOM]        |
      |    HDFS partitions=4/4 files=4 size=14.76KB                    |
      |    predicates: istrue(bool_col)                                |
      |    stored statistics:                                          |
      |      table: rows=unavailable size=unavailable                  |
      |      partitions: 0/4 rows=939                                  |
      |      columns: unavailable                                      |
      |    extrapolated-rows=disabled max-scan-range-rows=unavailable  |
      |    parquet dictionary predicates: istrue(bool_col)             |
      

      But here we don't have any predicates pushed down to stats or dictionary:

      explain select count(*) from functional_parquet.alltypessmall where bool_col is not true;
      | 00:SCAN HDFS [functional_parquet.alltypessmall, RANDOM]             |
      |    HDFS partitions=4/4 files=4 size=14.76KB                         |
      |    predicates: isnottrue(bool_col)                                  |
      |    stored statistics:                                               |
      |      table: rows=unavailable size=unavailable                       |
      |      partitions: 0/4 rows=939                                       |
      |      columns: unavailable                                           |
      |    extrapolated-rows=disabled max-scan-range-rows=unavailable       |
      |    mem-estimate=16.00MB mem-reservation=8.00KB thread-reservation=1 |
      |    tuple-ids=0 row-size=1B cardinality=94                           |
      |    in pipelines: 00(GETNEXT)                                        |
      

      If we use a weird form "x < TRUE", we can see them both:

      explain select count(*) from functional_parquet.alltypessmall where bool_col < true;
      | 00:SCAN HDFS [functional_parquet.alltypessmall]                     |
      |    HDFS partitions=4/4 files=4 size=14.76KB                         |
      |    predicates: bool_col < TRUE                                      |
      |    stored statistics:                                               |
      |      table: rows=unavailable size=unavailable                       |
      |      partitions: 0/4 rows=939                                       |
      |      columns: unavailable                                           |
      |    extrapolated-rows=disabled max-scan-range-rows=unavailable       |
      |    parquet statistics predicates: bool_col < TRUE                   |
      |    parquet dictionary predicates: bool_col < TRUE                   |
      |    mem-estimate=16.00MB mem-reservation=8.00KB thread-reservation=1 |
      

      Usually, we don't use this form for bool columns. So we should deal with the above mentioned forms as well.

      Attachments

        Issue Links

          Activity

            People

              stigahuang Quanlong Huang
              stigahuang Quanlong Huang
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: