currently CreateHTableJob.estimateCuboidStorageSize does not consider hbase encoding and compression into consideration. From our observation in real cubes, the estimation can be tens of times bigger than actual:
here's some stats:
Cube1(w/o hll, holistic distinct count)
1051G=>161G (estimated size=>real size)
cube2(w/o hll)
2118G => 504G
cube3(w/o hll)
3507G=>791G
cube 4(w 2 hll15)
188T => 2T
cube 5(w 2 hll15)
28T => 0.7T
cube 6(w 1 hll16)
172G=>30G
from the stats we can see that for cubes without hll, the estimation can be 4~5 times bigger, for cubes with hll, the estimation can be more than 50 times!(It's worth studying why cube6 is estimated only 6 time bigger, maybe related to hll precision level, maybe due to data?)
To reduce region counts, we will apply estimation discount as follows:
if (isMemoryHungry)
{ logger.info("Cube is memory hungry, storage size multiply 0.05"); ret *= 0.05; }else
{ logger.info("Cube is not memory hungry, storage size multiply 0.25"); ret *= 0.25; }and let's see how it works