Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-11924

Bloom filter size is unaffected by column NDV

    XMLWordPrintableJSON

Details

    • ghx-label-10

    Description

      For bloom filter sizing Impala simply uses the the cardinality of the build side while it could be clearly capped by NDV:
      https://github.com/apache/impala/blob/feb4a76ed4cb5b688143eb21370f78ec93133c56/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java#L661
      https://github.com/apache/impala/blob/feb4a76ed4cb5b688143eb21370f78ec93133c56/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java#L698

      E.g.:

      use tpch_parquet;
      set RUNTIME_FILTER_MIN_SIZE=8192
      RUNTIME_FILTER_MIN_SIZE
      explain select count(*) from orders join customer on o_comment = c_mktsegment
      PLAN-ROOT SINK
      |
      06:AGGREGATE [FINALIZE]
      |  output: count:merge(*)
      |  row-size=8B cardinality=1
      |
      05:EXCHANGE [UNPARTITIONED]
      |
      03:AGGREGATE
      |  output: count(*)
      |  row-size=8B cardinality=1
      |
      02:HASH JOIN [INNER JOIN, BROADCAST]
      |  hash predicates: o_comment = c_mktsegment
      |  runtime filters: RF000 <- c_mktsegment
      |  row-size=82B cardinality=162.03K
      |
      |--04:EXCHANGE [BROADCAST]
      |  |
      |  01:SCAN HDFS [tpch_parquet.customer]
      |     HDFS partitions=1/1 files=1 size=12.34MB
      |     row-size=21B cardinality=150.00K
      |
      00:SCAN HDFS [tpch_parquet.orders]
         HDFS partitions=1/1 files=2 size=54.21MB
         runtime filters: RF000 -> o_comment
         row-size=61B cardinality=1.50M
      

      The query above set RF000's size to 65536, while the minimum 8192 would be more than enough, as the ndv of c_mktsegment is 5

      The current logic should work well for FK/PK joins where the build size's cardinality is close the PK's ndv, but can massively overestimate large tables with small ndv keys.

      Attachments

        Issue Links

          Activity

            People

              csringhofer Csaba Ringhofer
              csringhofer Csaba Ringhofer
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: