Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-2416

Use Min, Max, Distinct count & row count to create a uniformly distributed histogram for better Cardinality estimation

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • Impala 2.3.0
    • None
    • Frontend

    Description

      As a stepping stone to using Histograms for more accurate cardinality estimation build a uni-formally distributed histogram using Min, Max, Distinct count & row count for better estimation of joins and filters.

      For a table with the following stats this what Impala estimates

      +---------+--------+---------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------+
      | #Rows   | #Files | Size    | Bytes Cached | Cache Replication | Format  | Incremental stats | Location                                                  |
      +---------+--------+---------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------+
      | 1500000 | 2      | 54.93MB | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://localhost:20500/test-warehouse/tpch.orders_parquet |
      +---------+--------+---------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------+
      
      +-----------------+---------------+------------------+--------+----------+-------------------+
      | Column          | Type          | #Distinct Values | #Nulls | Max Size | Avg Size          |
      +-----------------+---------------+------------------+--------+----------+-------------------+
      | o_orderkey      | BIGINT        | 1563438          | -1     | 8        | 8                 |
      | o_custkey       | BIGINT        | 98390            | -1     | 8        | 8                 |
      | o_orderstatus   | STRING        | 3                | -1     | 1        | 1                 |
      | o_totalprice    | DECIMAL(12,2) | 1438190          | -1     | 8        | 8                 |
      | o_orderdate     | STRING        | 2468             | -1     | 10       | 10                |
      | o_orderpriority | STRING        | 5                | -1     | 15       | 8.399886131286621 |
      | o_clerk         | STRING        | 1006             | -1     | 15       | 15                |
      | o_shippriority  | INT           | 1                | -1     | 4        | 4                 |
      | o_comment       | STRING        | 1388613          | -1     | 78       | 48.51387023925781 |
      
      
      
      Condition estimate Actual
      o_orderkey in (1,2,3,4) 4 4
      o_orderkey between 1 and 4 15,000 4
      o_orderkey <= 4 and o_orderkey >= 1 15,000 4
      o_orderkey <= 1500000 and o_orderkey >= 1 15,000 375,000

      ----------------------------------------------

      Attachments

        Activity

          People

            Unassigned Unassigned
            mmokhtar Mostafa Mokhtar
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: