One possible option is to count the entries in the MapFile indexes, multiply that count by whatever hbase.io.index.interval (or the INDEX_INTERVAL HTD attribute) is, consider all of the MapFiles for the columns in a table, and choose the largest value. Do this for all of the table's regions. The result would be a reasonable estimate, but the whole process sounds expensive. Originally I was thinking that the regionservers could do this since they have to read in the MapFile indexes anyway, and also they know the count of rows in memcache, but if regionservers limit the number of in-memory MapFile indexes to avoid OOME as has been discussed, they won't have all of the information on hand.
Maybe a map of MapFile to row count estimations can be stored in the FS next to the MapFiles and can be updated appropriately during compactions. Then a client can iterate over the regions of a table, ask the regionservers involved for row count estimations, the regionservers can consult the estimation-map and send the largest count found there for the table plus the largest memcache count for the table, and finally the client can total all of the results.