Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
Description
Kylin will count on the intermediate flat hive table to get the total row number, then to redistribute that.
From hive's wiki, hive will automatically collect the table statistics when run a "insert overwrite" statement, then the subsequent "select count" will be very fast (see https://cwiki.apache.org/confluence/display/Hive/StatsDev). While, Kylin is executing "INSERT OVERWRITE DIRECTORY '/kylin/row_count' SELECT count from", which still cause MR/Tez job be started, this will cause the step take longer time.
Just change the SQL to "select count" or using Hive API to get the statistic, the cost will be saved.
Here is a sample, the table 'kylin_intermediate_qq_dbe874d2_bb9a_4375_ba50_3dcf096a13c5' is an intermediate table :
If directly run "count", it is pretty fast:
hive> select count from kylin_intermediate_qq_dbe874d2_bb9a_4375_ba50_3dcf096a13c5;
OK
970033
Time taken: 0.112 seconds, Fetched: 1 row(s)
While today Kylin's SQL will cause a job be started:
hive> INSERT OVERWRITE DIRECTORY '/kylin/row_count' select count from kylin_intermediate_qq_dbe874d2_bb9a_4375_ba50_3dcf096a13c5;
Query ID = root_20161106080808_0099b622-c0bd-41da-aee5-2321adf7bdda
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1463701915919_46208, Tracking URL =
Attachments
Attachments
Issue Links
- relates to
-
KYLIN-2159 Redistribution Hive Table Step always requires row_count filename as 000000_0
- Closed