I understand this better now. I did some poking around with the HFile tool. Average key length does seem to be around 150 bytes, as I estimated.
For one hfile /hbase/foo/fb820ae7002fc96f78165802a0b05e63/metrics/14129209576094096, metadata is:
avgKeyLen=159, avgValueLen=7, entries=49285512, length=615516343
fileinfoOffset=592314718, dataIndexOffset=592315104, dataIndexCount=131869, metaIndexOffset=0, metaIndexCount=0, totalBytes=8653853680, entryCount=49285512, version=1
Size of index = length - dataIndexOffset = 615516343 - 592315104 = 22mb
Index data per Region Server = 22mb * 180 regions = almost 4gb. Plus the other column family, so this does seem to add up to the 5 to 6gb of HEAP we are seeing.
- of entries per dataindex entry = 49285512 / 131869 = 374
Times the key size (avg 157 bytes for this file) = 59k (close to the block size of 64k). So, seems to make sense.
I also looked at the keyvalue pairs using the HFile tool (a section of output is below).
We have a few billion rows (2 - 4 billion). I haven't done a full row count.
What I didn't understand previously is that it's not 374 rows, but 374 "entries". An entry means a single column entry and the key is repeated for each column value. Given our fairly large key, that would add up quickly.
1) Increase the hbase block size (I did this and it resolved our situation for now)
2) Modifying our schema to use smaller keys - perhaps IDs instead of string names.
3) Modifying our schema to have fewer columns - we could combine several related columns into one compound value.
4) An LRU cache for storefile indexes
Given the other options, #4 may not be warranted, so I think we can close this issue.