There are a few issues with our HBase scan cardinality estimation:
1. The cardinality estimates can be very inaccurate leading to bad plan choices. In particular, users have reported cases of severe underestimation which can have a ripple effect in the query plan (e.g. planner thinks a join with that table is selective)
2. Unlike HDFS scans, we do not use row count statistics from the Hive Metastore for estimating the cardinality of HBase scans. Instead, we do a small scan over the HBase table and estimate a row count based on the average bytes per row and the storefile size.
There are other more detailed caveats with the HBase estimation method.
The original motivation of this method was to adjust the row count for queries that only scan a subset of the region servers (the HMS statistics only cover the entire table).
To address these shortcomings, we could start with the table-level row count store in the Metastore and then adjust that number based on the total number of bytes in the table and the number of bytes in the relevant region servers.