Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-6491

More robust HBase scan cardinality estimation

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • Impala 2.5.0, Impala 2.6.0, Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, Impala 2.11.0
    • None
    • Frontend
    • ghx-label-2

    Description

      There are a few issues with our HBase scan cardinality estimation:
      1. The cardinality estimates can be very inaccurate leading to bad plan choices. In particular, users have reported cases of severe underestimation which can have a ripple effect in the query plan (e.g. planner thinks a join with that table is selective)
      2. Unlike HDFS scans, we do not use row count statistics from the Hive Metastore for estimating the cardinality of HBase scans. Instead, we do a small scan over the HBase table and estimate a row count based on the average bytes per row and the storefile size.

      There are other more detailed caveats with the HBase estimation method.

      The original motivation of this method was to adjust the row count for queries that only scan a subset of the region servers (the HMS statistics only cover the entire table).

      Proposal
      To address these shortcomings, we could start with the table-level row count store in the Metastore and then adjust that number based on the total number of bytes in the table and the number of bytes in the relevant region servers.

      Attachments

        Activity

          People

            Unassigned Unassigned
            alex.behm Alexander Behm
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: