Kylin will count on the intermediate flat hive table to get the total row number, then to redistribute that.
From hive's wiki, hive will automatically collect the table statistics when run a "insert overwrite" statement, then the subsequent "select count" will be very fast (see https://cwiki.apache.org/confluence/display/Hive/StatsDev). While, Kylin is executing "INSERT OVERWRITE DIRECTORY '/kylin/row_count' SELECT count from", which still cause MR/Tez job be started, this will cause the step take longer time.
Just change the SQL to "select count" or using Hive API to get the statistic, the cost will be saved.
Here is a sample, the table 'kylin_intermediate_qq_dbe874d2_bb9a_4375_ba50_3dcf096a13c5' is an intermediate table :
If directly run "count", it is pretty fast:
hive> select count from kylin_intermediate_qq_dbe874d2_bb9a_4375_ba50_3dcf096a13c5;
Time taken: 0.112 seconds, Fetched: 1 row(s)
While today Kylin's SQL will cause a job be started:
hive> INSERT OVERWRITE DIRECTORY '/kylin/row_count' select count from kylin_intermediate_qq_dbe874d2_bb9a_4375_ba50_3dcf096a13c5;
Query ID = root_20161106080808_0099b622-c0bd-41da-aee5-2321adf7bdda
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
In order to limit the maximum number of reducers:
In order to set a constant number of reducers:
Starting Job = job_1463701915919_46208, Tracking URL =