Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-8024

HBase table cardinality estimates are wrong

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • Impala 3.1.0
    • None
    • Frontend
    • None
    • ghx-label-6

    Description

      IMPALA-8021 added cardinality estimates to EXPLAIN plan output. Running some of our PlannerTest files revealed that our HBase cardinality estimates are very poor, even for our simple test tables. For example, for functional_hbase.alltypessmall:

      count(*) tells us that there are 100 rows:

      select count(*) from functional_hbase.alltypessmall
      +----------+
      | count(*) |
      +----------+
      | 100      |
      +----------+
      

      Table stats claim that there are only 60 rows:

      show table stats functional_hbase.alltypessmall;
      +-----------------+--------------+------------+------+
      | Region Location | Start RowKey | Est. #Rows | Size |
      +-----------------+--------------+------------+------+
      | localhost       |              | 10         | 0B   |
      | localhost       | 1            | 10         | 0B   |
      | localhost       | 3            | 10         | 0B   |
      | localhost       | 5            | 10         | 0B   |
      | localhost       | 7            | 10         | 0B   |
      | localhost       | 9            | 10         | 0B   |
      | Total           |              | 60         | 0B   |
      +-----------------+--------------+------------+------+
      

      The NDV stats show that there must be at least 100 rows:

      show column stats functional_hbase.alltypessmall
      +-----------------+-----------+------------------+--------+----------+----------+
      | Column          | Type      | #Distinct Values | #Nulls | Max Size | Avg Size |
      +-----------------+-----------+------------------+--------+----------+----------+
      | id              | INT       | 99               | 0      | 4        | 4        |
      ...
      | timestamp_col   | TIMESTAMP | 100              | 0      | 16       | 16       |
      ...
      +-----------------+-----------+------------------+--------+----------+----------+
      

      Planning a query, the most critical part, thinks there are only 50 rows:

      select *
      from functional.alltypesagg join functional_hbase.alltypessmall using (id, int_col)
      
      |--01:SCAN HBASE [functional_hbase.alltypessmall]
      |     row-size=89B cardinality=50
      

      We need a more reliable estimate.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Paul.Rogers Paul Rogers
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: