Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
-
ghx-label-4
Description
When scanning parquet/orc tables, we push down binary predicates like "x < 1" to leverage the file level statistics. However, predicates on bool column may not in this form. They could be "x", "NOT x", "x ISĀ [NOT] TRUE", "x IS [NOT] FALSE".
Note that dictionary predicates may have some of them, but still not all of them. For instance, here we have the predicate in dictionary predicates:
set explain_level=2; explain select count(*) from functional_parquet.alltypessmall where bool_col; | 00:SCAN HDFS [functional_parquet.alltypessmall, RANDOM] | | HDFS partitions=4/4 files=4 size=14.76KB | | predicates: bool_col | | stored statistics: | | table: rows=unavailable size=unavailable | | partitions: 0/4 rows=939 | | columns: unavailable | | extrapolated-rows=disabled max-scan-range-rows=unavailable | | parquet dictionary predicates: bool_col |
Here we still have the predicate in dictionary predicates:
explain select count(*) from functional_parquet.alltypessmall where bool_col is true; | 00:SCAN HDFS [functional_parquet.alltypessmall, RANDOM] | | HDFS partitions=4/4 files=4 size=14.76KB | | predicates: istrue(bool_col) | | stored statistics: | | table: rows=unavailable size=unavailable | | partitions: 0/4 rows=939 | | columns: unavailable | | extrapolated-rows=disabled max-scan-range-rows=unavailable | | parquet dictionary predicates: istrue(bool_col) |
But here we don't have any predicates pushed down to stats or dictionary:
explain select count(*) from functional_parquet.alltypessmall where bool_col is not true; | 00:SCAN HDFS [functional_parquet.alltypessmall, RANDOM] | | HDFS partitions=4/4 files=4 size=14.76KB | | predicates: isnottrue(bool_col) | | stored statistics: | | table: rows=unavailable size=unavailable | | partitions: 0/4 rows=939 | | columns: unavailable | | extrapolated-rows=disabled max-scan-range-rows=unavailable | | mem-estimate=16.00MB mem-reservation=8.00KB thread-reservation=1 | | tuple-ids=0 row-size=1B cardinality=94 | | in pipelines: 00(GETNEXT) |
If we use a weird form "x < TRUE", we can see them both:
explain select count(*) from functional_parquet.alltypessmall where bool_col < true; | 00:SCAN HDFS [functional_parquet.alltypessmall] | | HDFS partitions=4/4 files=4 size=14.76KB | | predicates: bool_col < TRUE | | stored statistics: | | table: rows=unavailable size=unavailable | | partitions: 0/4 rows=939 | | columns: unavailable | | extrapolated-rows=disabled max-scan-range-rows=unavailable | | parquet statistics predicates: bool_col < TRUE | | parquet dictionary predicates: bool_col < TRUE | | mem-estimate=16.00MB mem-reservation=8.00KB thread-reservation=1 |
Usually, we don't use this form for bool columns. So we should deal with the above mentioned forms as well.
Attachments
Issue Links
- relates to
-
IMPALA-9040 Read performance improvements in ORC support
- Open