normally increase number of bit vectors will increase calculation accuracy. Let's say
select compute_stats(a, 40) from test_hive;
generally get better accuracy than
select compute_stats(a, 16) from test_hive;
But larger number of bit vectors also cause query run slower. When number of bit vectors over 50, it won't help to increase accuracy anymore. But it still increase memory usage, and crash Hive if number if too huge. Current Hive doesn't prevent user use ridiculous large number of bit vectors in 'compute_stats' query.
select compute_stats(a, 999999999) from column_eight_types;
2012-12-20 23:21:52,247 Stage-1 map = 0%, reduce = 0%
2012-12-20 23:22:11,315 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.29 sec
MapReduce Total cumulative CPU time: 290 msec
Ended Job = job_1354923204155_0777 with errors
Error during job, obtaining debugging information...
Job Tracking URL: http://cs-10-20-81-171.cloud.cloudera.com:8088/proxy/application_1354923204155_0777/
Examining task ID: task_1354923204155_0777_m_000000 (and more) from job job_1354923204155_0777
Task with the most failures(4):
Diagnostic Messages for this Task:
Error: Java heap space