[KYLIN-2165] Use hive table statistics data to get the total count - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: v2.0.0
Component/s: Job Engine
Labels:
None

Description

Kylin will count on the intermediate flat hive table to get the total row number, then to redistribute that.

From hive's wiki, hive will automatically collect the table statistics when run a "insert overwrite" statement, then the subsequent "select count" will be very fast (see https://cwiki.apache.org/confluence/display/Hive/StatsDev). While, Kylin is executing "INSERT OVERWRITE DIRECTORY '/kylin/row_count' SELECT count from", which still cause MR/Tez job be started, this will cause the step take longer time.

Just change the SQL to "select count" or using Hive API to get the statistic, the cost will be saved.

Here is a sample, the table 'kylin_intermediate_qq_dbe874d2_bb9a_4375_ba50_3dcf096a13c5' is an intermediate table :

If directly run "count", it is pretty fast:
hive> select count from kylin_intermediate_qq_dbe874d2_bb9a_4375_ba50_3dcf096a13c5;
OK
970033
Time taken: 0.112 seconds, Fetched: 1 row(s)

While today Kylin's SQL will cause a job be started:

hive> INSERT OVERWRITE DIRECTORY '/kylin/row_count' select count from kylin_intermediate_qq_dbe874d2_bb9a_4375_ba50_3dcf096a13c5;
Query ID = root_20161106080808_0099b622-c0bd-41da-aee5-2321adf7bdda
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1463701915919_46208, Tracking URL =

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

KYLIN-2165.BUGFIX.patch
06/Apr/17 05:57
3 kB
Dayue Gao

Issue Links

relates to

KYLIN-2159 Redistribution Hive Table Step always requires row_count filename as 000000_0

Closed

Activity

People

Assignee:: Shao Feng Shi

Reporter:: Shao Feng Shi

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 06/Nov/16 08:11

Updated:: 30/Apr/17 12:21

Resolved:: 06/Apr/17 12:41