Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-26236

count(1) with subquery count(distinct) gives wrong results with hive.optimize.distinct.rewrite=true and cbo on

Log workAgile BoardRank to TopRank to BottomBulk Copy AttachmentsBulk Move AttachmentsAdd voteVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • All Versions
    • None
    • CBO, Logical Optimizer

    Description

      it give wrong result when hive.optimize.distinct.rewrite is true, By default, it's true for all 3.x versions. The test result is 2, and the expected result is 1.

      create table count_distinct(a int, b int);
      insert into table count_distinct values (1,2),(2,3);
      set hive.execution.engine=tez;
      set hive.cbo.enable=true;
      set hive.optimize.distinct.rewrite=true;
      select count(1) from ( 
            select count(distinct a) from count_distinct
      ) tmp; 

      Before CBO optimization,RelNode tree as this,

      HiveProject(_o__c0=[$0])
        HiveAggregate(group=[{}], agg#0=[count($0)])
          HiveProject($f0=[1])
            HiveProject(_o__c0=[$0])
              HiveAggregate(group=[{}], agg#0=[count(DISTINCT $0)])
                HiveProject($f0=[$0])
                  HiveTableScan(table=[[default.count_distinct]], table:alias=[count_distinct]) 

      Optimized by HiveExpandDistinctAggregatesRule, RelNode tree as this,

      HiveProject(_o__c0=[$0])
        HiveAggregate(group=[{}], agg#0=[count($0)])
          HiveProject($f0=[1])
            HiveProject(_o__c0=[$0])
              HiveAggregate(group=[{}], agg#0=[count($0)])
                HiveAggregate(group=[{0}])
                  HiveProject($f0=[$0])
                    HiveProject($f0=[$0])
                      HiveTableScan(table=[[default.count_distinct]], table:alias=[count_distinct]) 

      count(distinct xx) converte to count (xx) from (select xx from table_name group by xx) 

      Optimized by Projection Pruning, RelNode tree as this, 

      HiveAggregate(group=[{}], agg#0=[count()])
        HiveProject(DUMMY=[0])
          HiveAggregate(group=[{}])
            HiveAggregate(group=[{0}])
              HiveProject(a=[$0])
                HiveTableScan(table=[[default.count_distinct]], table:alias=[count_distinct]) 

      In this case, an error occurs in the execution plan.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            honghui honghui.Liu Assign to me
            honghui honghui.Liu

            Dates

              Created:
              Updated:

              Slack

                Issue deployment