Uploaded image for project: 'Kylin'
  1. Kylin
  2. KYLIN-2165

Use hive table statistics data to get the total count

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: v2.0.0
    • Component/s: Job Engine
    • Labels:
      None

      Description

      Kylin will count on the intermediate flat hive table to get the total row number, then to redistribute that.

      From hive's wiki, hive will automatically collect the table statistics when run a "insert overwrite" statement, then the subsequent "select count" will be very fast (see https://cwiki.apache.org/confluence/display/Hive/StatsDev). While, Kylin is executing "INSERT OVERWRITE DIRECTORY '/kylin/row_count' SELECT count from", which still cause MR/Tez job be started, this will cause the step take longer time.

      Just change the SQL to "select count" or using Hive API to get the statistic, the cost will be saved.

      Here is a sample, the table 'kylin_intermediate_qq_dbe874d2_bb9a_4375_ba50_3dcf096a13c5' is an intermediate table :

      If directly run "count", it is pretty fast:
      hive> select count from kylin_intermediate_qq_dbe874d2_bb9a_4375_ba50_3dcf096a13c5;
      OK
      970033
      Time taken: 0.112 seconds, Fetched: 1 row(s)

      While today Kylin's SQL will cause a job be started:

      hive> INSERT OVERWRITE DIRECTORY '/kylin/row_count' select count from kylin_intermediate_qq_dbe874d2_bb9a_4375_ba50_3dcf096a13c5;
      Query ID = root_20161106080808_0099b622-c0bd-41da-aee5-2321adf7bdda
      Total jobs = 1
      Launching Job 1 out of 1
      Number of reduce tasks determined at compile time: 1
      In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=<number>
      In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=<number>
      In order to set a constant number of reducers:
      set mapreduce.job.reduces=<number>
      Starting Job = job_1463701915919_46208, Tracking URL =

        Attachments

        1. KYLIN-2165.BUGFIX.patch
          3 kB
          Dayue Gao

          Issue Links

            Activity

              People

              • Assignee:
                shaofengshi Shao Feng Shi
                Reporter:
                shaofengshi Shao Feng Shi
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: