This is a follow up for
Had a offline discussion with Sambavi - she pointed out a scenario where the
HIVE-3433 will not scale. Assume that the user is performing
a cube on many columns, say '8' columns. So, each row would generate 256 rows
for the hash table, which may kill the current group by implementation.
A better implementation would be to add an additional mr job - in the first
mr job perform the group by assuming there was no cube. Add another mr job, where
you would perform the cube. The assumption is that the group by would have
decreased the output data significantly, and the rows would appear in the order of
grouping keys which has a higher probability of hitting the hash table.