Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Not A Bug
-
Kudu_Impala
-
None
-
None
Description
I am performing testing scenarios between IMPALA on HDFS vs IMPALA on KUDU
we have set of queries which are accessing number of fact tables and dimension tables.
In one of the query we are trying to process 2 fact tables which are having around 78 millions and 668 millions records.
While having data in IMPALA on HDFS, i was able to get query results in less than 50 seconds.
But while having data in IMPALA on KUDU, even after trying number of distributions/paritions, i have not been able to reduce query execution time less than 125 seconds.
So i have some conerns here...
1. In KUDU, what is the criteria of having number of cores/nodes in cluster as per number of records to process...?
2. In KUDU, is there any option of like distributed cache in IMPALA on KUDU to improve my execution time...?
3. Is there any other way to improve performance with having such huge data load..?
i have attached the query for reference..