Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-7261

Calculation works wrong when hive.groupby.skewindata is true and count(*) count(distinct) group by work simultaneously

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 0.12.0
    • None
    • Query Processor
    • None
    • hive0.12 hadoop1.0.4

    Description

      【Phenomenon】
      The query results are not the same as when hive.groupby.skewindata was setted to true and false.

      【my question】
      I want to calculate the count and count(distinct) simultaneously ,otherwise it will cost 2 MR job to calculate. But when i set the hive.groupby.skewindata to be true, the count result shoud not be same as the count(distinct) , but the real result is same, so it's wrong. And I find the difference of its query plan which the "Reduce Operator Tree->Group By Operator->mode" is mergepartial when skew is set to false and
      "Reduce Operator Tree->Group By Operator->mode" is complete when skew is set to true. So i'm confused the root cause of the error.

      【sql】
      select ds,appid,eventname,active,count(distinct(guid)), count from eventinfo_tmp where ds='20140612' and length(eventname)<1000 and eventname like '%alibaba%' group by ds,appid,eventname,active;

      【the others hive configaration exclude hive.groupby.skewindata】
      hive.exec.compress.output=true
      hive.exec.compress.intermediate=true
      io.seqfile.compression.type=BLOCK
      mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
      hive.map.aggr=true
      hive.stats.autogather=false
      hive.exec.scratchdir=/user/complat/tmp
      mapred.job.queue.name=complat
      hive.exec.mode.local.auto=false
      hive.exec.mode.local.auto.inputbytes.max=500
      hive.exec.mode.local.auto.tasks.max=10
      hive.exec.mode.local.auto.input.files.max=1000
      hive.exec.dynamic.partition=true
      hive.exec.dynamic.partition.mode=nonstrict
      hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
      mapred.max.split.size=100000000
      mapred.min.split.size.per.node=100000000
      mapred.min.split.size.per.rack=100000000

      【result】
      when hive.groupby.skewindata=true the result is :
      20140612 8 alibaba 1 87 147

      when it=false the result is :
      20140612 8 alibaba 1 87 87

      【query plan】
      ABSTRACT SYNTAX TREE:
      (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME eventinfo_tmp))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL ds)) (TOK_SELEXPR (TOK_TABLE_OR_COL appid)) (TOK_SELEXPR (TOK_TABLE_OR_COL eventname)) (TOK_SELEXPR (TOK_TABLE_OR_COL active)) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL guid))) (TOK_SELEXPR (TOK_FUNCTIONSTAR count))) (TOK_WHERE (and (and (= (TOK_TABLE_OR_COL ds) '20140612') (< (TOK_FUNCTION length (TOK_TABLE_OR_COL eventname)) 1000)) (like (TOK_TABLE_OR_COL eventname) '%tvvideo_setting%'))) (TOK_GROUPBY (TOK_TABLE_OR_COL ds) (TOK_TABLE_OR_COL appid) (TOK_TABLE_OR_COL eventname) (TOK_TABLE_OR_COL active))))

      STAGE DEPENDENCIES:
      Stage-1 is a root stage
      Stage-0 is a root stage

      STAGE PLANS:
      Stage: Stage-1
      Map Reduce
      Alias -> Map Operator Tree:
      eventinfo_tmp
      TableScan
      alias: eventinfo_tmp
      Filter Operator
      predicate:
      expr: ((length(eventname) < 1000) and (eventname like '%tvvideo_setting%'))
      type: boolean
      Select Operator
      expressions:
      expr: ds
      type: string
      expr: appid
      type: string
      expr: eventname
      type: string
      expr: active
      type: int
      expr: guid
      type: string
      outputColumnNames: ds, appid, eventname, active, guid
      Group By Operator
      aggregations:
      expr: count(DISTINCT guid)
      expr: count()
      bucketGroup: false
      keys:
      expr: ds
      type: string
      expr: appid
      type: string
      expr: eventname
      type: string
      expr: active
      type: int
      expr: guid
      type: string
      mode: hash
      outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6
      Reduce Output Operator
      key expressions:
      expr: _col0
      type: string
      expr: _col1
      type: string
      expr: _col2
      type: string
      expr: _col3
      type: int
      expr: _col4
      type: string
      sort order: +++++
      Map-reduce partition columns:
      expr: _col0
      type: string
      expr: _col1
      type: string
      expr: _col2
      type: string
      expr: _col3
      type: int
      tag: -1
      value expressions:
      expr: _col5
      type: bigint
      expr: _col6
      type: bigint
      Reduce Operator Tree:
      Group By Operator
      aggregations:
      expr: count(DISTINCT KEY._col4:0._col0)
      expr: count(VALUE._col1)
      bucketGroup: false
      keys:
      expr: KEY._col0
      type: string
      expr: KEY._col1
      type: string
      expr: KEY._col2
      type: string
      expr: KEY._col3
      type: int
      mode: mergepartial
      outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
      Select Operator
      expressions:
      expr: _col0
      type: string
      expr: _col1
      type: string
      expr: _col2
      type: string
      expr: _col3
      type: int
      expr: _col4
      type: bigint
      expr: _col5
      type: bigint
      outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
      File Output Operator
      compressed: true
      GlobalTableId: 0
      table:
      input format: org.apache.hadoop.mapred.TextInputFormat
      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

      Stage: Stage-0
      Fetch Operator
      limit: -1

      ABSTRACT SYNTAX TREE:
      (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME eventinfo_tmp))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL ds)) (TOK_SELEXPR (TOK_TABLE_OR_COL appid)) (TOK_SELEXPR (TOK_TABLE_OR_COL eventname)) (TOK_SELEXPR (TOK_TABLE_OR_COL active)) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL guid))) (TOK_SELEXPR (TOK_FUNCTIONSTAR count))) (TOK_WHERE (and (and (= (TOK_TABLE_OR_COL ds) '20140612') (< (TOK_FUNCTION length (TOK_TABLE_OR_COL eventname)) 1000)) (like (TOK_TABLE_OR_COL eventname) '%tvvideo_setting%'))) (TOK_GROUPBY (TOK_TABLE_OR_COL ds) (TOK_TABLE_OR_COL appid) (TOK_TABLE_OR_COL eventname) (TOK_TABLE_OR_COL active))))

      STAGE DEPENDENCIES:
      Stage-1 is a root stage
      Stage-0 is a root stage

      STAGE PLANS:
      Stage: Stage-1
      Map Reduce
      Alias -> Map Operator Tree:
      eventinfo_tmp
      TableScan
      alias: eventinfo_tmp
      Filter Operator
      predicate:
      expr: ((length(eventname) < 1000) and (eventname like '%tvvideo_setting%'))
      type: boolean
      Select Operator
      expressions:
      expr: ds
      type: string
      expr: appid
      type: string
      expr: eventname
      type: string
      expr: active
      type: int
      expr: guid
      type: string
      outputColumnNames: ds, appid, eventname, active, guid
      Group By Operator
      aggregations:
      expr: count(DISTINCT guid)
      expr: count()
      bucketGroup: false
      keys:
      expr: ds
      type: string
      expr: appid
      type: string
      expr: eventname
      type: string
      expr: active
      type: int
      expr: guid
      type: string
      mode: hash
      outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6
      Reduce Output Operator
      key expressions:
      expr: _col0
      type: string
      expr: _col1
      type: string
      expr: _col2
      type: string
      expr: _col3
      type: int
      expr: _col4
      type: string
      sort order: +++++
      Map-reduce partition columns:
      expr: _col0
      type: string
      expr: _col1
      type: string
      expr: _col2
      type: string
      expr: _col3
      type: int
      tag: -1
      value expressions:
      expr: _col5
      type: bigint
      expr: _col6
      type: bigint
      Reduce Operator Tree:
      Group By Operator
      aggregations:
      expr: count(DISTINCT KEY._col4:0._col0)
      expr: count(VALUE._col1)
      bucketGroup: false
      keys:
      expr: KEY._col0
      type: string
      expr: KEY._col1
      type: string
      expr: KEY._col2
      type: string
      expr: KEY._col3
      type: int
      mode: complete
      outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
      Select Operator
      expressions:
      expr: _col0
      type: string
      expr: _col1
      type: string
      expr: _col2
      type: string
      expr: _col3
      type: int
      expr: _col4
      type: bigint
      expr: _col5
      type: bigint
      outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
      File Output Operator
      compressed: true
      GlobalTableId: 0
      table:
      input format: org.apache.hadoop.mapred.TextInputFormat
      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

      Stage: Stage-0
      Fetch Operator
      limit: -1

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              chris_chen Chris Chen
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: