[IMPALA-10932] Make sure all kinds of simple predicates on bool columns are pushed down - ASF JIRA

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Epic Color:
ghx-label-4

Description

When scanning parquet/orc tables, we push down binary predicates like "x < 1" to leverage the file level statistics. However, predicates on bool column may not in this form. They could be "x", "NOT x", "x IS [NOT] TRUE", "x IS [NOT] FALSE".

Note that dictionary predicates may have some of them, but still not all of them. For instance, here we have the predicate in dictionary predicates:

set explain_level=2;
explain select count(*) from functional_parquet.alltypessmall where bool_col;
| 00:SCAN HDFS [functional_parquet.alltypessmall, RANDOM]        |
|    HDFS partitions=4/4 files=4 size=14.76KB                    |
|    predicates: bool_col                                        |
|    stored statistics:                                          |
|      table: rows=unavailable size=unavailable                  |
|      partitions: 0/4 rows=939                                  |
|      columns: unavailable                                      |
|    extrapolated-rows=disabled max-scan-range-rows=unavailable  |
|    parquet dictionary predicates: bool_col                     |

Here we still have the predicate in dictionary predicates:

explain select count(*) from functional_parquet.alltypessmall where bool_col is true;
| 00:SCAN HDFS [functional_parquet.alltypessmall, RANDOM]        |
|    HDFS partitions=4/4 files=4 size=14.76KB                    |
|    predicates: istrue(bool_col)                                |
|    stored statistics:                                          |
|      table: rows=unavailable size=unavailable                  |
|      partitions: 0/4 rows=939                                  |
|      columns: unavailable                                      |
|    extrapolated-rows=disabled max-scan-range-rows=unavailable  |
|    parquet dictionary predicates: istrue(bool_col)             |

But here we don't have any predicates pushed down to stats or dictionary:

explain select count(*) from functional_parquet.alltypessmall where bool_col is not true;
| 00:SCAN HDFS [functional_parquet.alltypessmall, RANDOM]             |
|    HDFS partitions=4/4 files=4 size=14.76KB                         |
|    predicates: isnottrue(bool_col)                                  |
|    stored statistics:                                               |
|      table: rows=unavailable size=unavailable                       |
|      partitions: 0/4 rows=939                                       |
|      columns: unavailable                                           |
|    extrapolated-rows=disabled max-scan-range-rows=unavailable       |
|    mem-estimate=16.00MB mem-reservation=8.00KB thread-reservation=1 |
|    tuple-ids=0 row-size=1B cardinality=94                           |
|    in pipelines: 00(GETNEXT)                                        |

If we use a weird form "x < TRUE", we can see them both:

explain select count(*) from functional_parquet.alltypessmall where bool_col < true;
| 00:SCAN HDFS [functional_parquet.alltypessmall]                     |
|    HDFS partitions=4/4 files=4 size=14.76KB                         |
|    predicates: bool_col < TRUE                                      |
|    stored statistics:                                               |
|      table: rows=unavailable size=unavailable                       |
|      partitions: 0/4 rows=939                                       |
|      columns: unavailable                                           |
|    extrapolated-rows=disabled max-scan-range-rows=unavailable       |
|    parquet statistics predicates: bool_col < TRUE                   |
|    parquet dictionary predicates: bool_col < TRUE                   |
|    mem-estimate=16.00MB mem-reservation=8.00KB thread-reservation=1 |

Usually, we don't use this form for bool columns. So we should deal with the above mentioned forms as well.

Attachments

Issue Links

relates to

IMPALA-9040 Read performance improvements in ORC support

Open

Make sure all kinds of simple predicates on bool columns are pushed down

Details

Description

Attachments

Issue Links

Activity

People

Dates