Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-8517

When joining on partition column NDV gets overridden by StatsUtils.getColStatisticsFromExpression

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 0.14.0
    • 0.14.0
    • Physical Optimizer
    • None
    • hive

    Description

      When joining on partition column number of partitions is used as NDV which gets overridden by StatsUtils.getColStatisticsFromExpression and the number of partitions used as NDV is replaced by number of rows which results in the same behavior as explained in https://issues.apache.org/jira/browse/HIVE-8196. "Joining on partition columns with fetch column stats enabled results it very small CE which negatively affects query performance "

      This is the call stack.

      StatsUtils.getColStatisticsFromExpression(HiveConf, Statistics, ExprNodeDesc) line: 1001	
      StatsRulesProcFactory$ReduceSinkStatsRule.process(Node, Stack<Node>, NodeProcessorCtx, Object...) line: 1479	
      DefaultRuleDispatcher.dispatch(Node, Stack<Node>, Object...) line: 90	
      PreOrderWalker(DefaultGraphWalker).dispatchAndReturn(Node, Stack<Node>) line: 94	
      PreOrderWalker(DefaultGraphWalker).dispatch(Node, Stack<Node>) line: 78	
      PreOrderWalker.walk(Node) line: 54	
      PreOrderWalker.walk(Node) line: 59	
      PreOrderWalker.walk(Node) line: 59	
      PreOrderWalker(DefaultGraphWalker).startWalking(Collection<Node>, HashMap<Node,Object>) line: 109	
      AnnotateWithStatistics.transform(ParseContext) line: 78	
      TezCompiler.runStatsAnnotation(OptimizeTezProcContext) line: 248	
      TezCompiler.optimizeOperatorPlan(ParseContext, Set<ReadEntity>, Set<WriteEntity>) line: 120	
      TezCompiler(TaskCompiler).compile(ParseContext, List<Task<Serializable>>, HashSet<ReadEntity>, HashSet<WriteEntity>) line: 99	
      SemanticAnalyzer.analyzeInternal(ASTNode) line: 10037	
      SemanticAnalyzer(BaseSemanticAnalyzer).analyze(ASTNode, Context) line: 221	
      ExplainSemanticAnalyzer.analyzeInternal(ASTNode) line: 74	
      ExplainSemanticAnalyzer(BaseSemanticAnalyzer).analyze(ASTNode, Context) line: 221	
      Driver.compile(String, boolean) line: 415	
      

      Query

      select
        ss_item_sk item_sk, d_date, sum(ss_sales_price),
        sum(sum(ss_sales_price))
            over (partition by ss_item_sk order by d_date rows between unbounded preceding and current row) cume_sales
      from store_sales
          ,date_dim
      where ss_sold_date_sk=d_date_sk
        and d_month_seq between 1193 and 1193+11
        and ss_item_sk is not NULL
      group by ss_item_sk, d_date
      

      Plan
      Notice in the Map join operator the number of rows drop from 82,510,879,939 to 36524 after the join.

      OK
      STAGE DEPENDENCIES:
        Stage-1 is a root stage
        Stage-0 depends on stages: Stage-1
      
      STAGE PLANS:
        Stage: Stage-1
          Tez
            Edges:
              Map 1 <- Map 4 (BROADCAST_EDGE)
              Reducer 2 <- Map 1 (SIMPLE_EDGE)
              Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
            DagName: mmokhtar_20141019131818_086d663a-5621-456c-bf25-8ccb7112ee3b:6
            Vertices:
              Map 1
                  Map Operator Tree:
                      TableScan
                        alias: store_sales
                        filterExpr: ss_item_sk is not null (type: boolean)
                        Statistics: Num rows: 82510879939 Data size: 6873789738208 Basic stats: COMPLETE Column stats: COMPLETE
                        Filter Operator
                          predicate: ss_item_sk is not null (type: boolean)
                          Statistics: Num rows: 82510879939 Data size: 652315818272 Basic stats: COMPLETE Column stats: COMPLETE
                          Map Join Operator
                            condition map:
                                 Inner Join 0 to 1
                            condition expressions:
                              0 {ss_item_sk} {ss_sales_price} {ss_sold_date_sk}
                              1 {d_date_sk} {d_date} {d_month_seq}
                            keys:
                              0 ss_sold_date_sk (type: int)
                              1 d_date_sk (type: int)
                            outputColumnNames: _col1, _col12, _col22, _col26, _col28, _col29
                            input vertices:
                              1 Map 4
                            Statistics: Num rows: 36524 Data size: 4163736 Basic stats: COMPLETE Column stats: COMPLETE
                            Filter Operator
                              predicate: (((_col22 = _col26) and _col29 BETWEEN 1193 AND 1204) and _col1 is not null) (type: boolean)
                              Statistics: Num rows: 9131 Data size: 1040934 Basic stats: COMPLETE Column stats: COMPLETE
                              Select Operator
                                expressions: _col1 (type: int), _col28 (type: string), _col12 (type: float)
                                outputColumnNames: _col1, _col28, _col12
                                Statistics: Num rows: 9131 Data size: 1040934 Basic stats: COMPLETE Column stats: COMPLETE
                                Group By Operator
                                  aggregations: sum(_col12)
                                  keys: _col1 (type: int), _col28 (type: string)
                                  mode: hash
                                  outputColumnNames: _col0, _col1, _col2
                                  Statistics: Num rows: 4565 Data size: 483890 Basic stats: COMPLETE Column stats: COMPLETE
                                  Reduce Output Operator
                                    key expressions: _col0 (type: int), _col1 (type: string)
                                    sort order: ++
                                    Map-reduce partition columns: _col0 (type: int), _col1 (type: string)
                                    Statistics: Num rows: 4565 Data size: 483890 Basic stats: COMPLETE Column stats: COMPLETE
                                    value expressions: _col2 (type: double)
                  Execution mode: vectorized
              Map 4
                  Map Operator Tree:
                      TableScan
                        alias: date_dim
                        filterExpr: (d_date_sk is not null and d_month_seq BETWEEN 1193 AND 1204) (type: boolean)
                        Statistics: Num rows: 73049 Data size: 81741831 Basic stats: COMPLETE Column stats: COMPLETE
                        Filter Operator
                          predicate: (d_date_sk is not null and d_month_seq BETWEEN 1193 AND 1204) (type: boolean)
                          Statistics: Num rows: 36524 Data size: 3725448 Basic stats: COMPLETE Column stats: COMPLETE
                          Reduce Output Operator
                            key expressions: d_date_sk (type: int)
                            sort order: +
                            Map-reduce partition columns: d_date_sk (type: int)
                            Statistics: Num rows: 36524 Data size: 3725448 Basic stats: COMPLETE Column stats: COMPLETE
                            value expressions: d_date (type: string), d_month_seq (type: int)
                          Select Operator
                            expressions: d_date_sk (type: int)
                            outputColumnNames: _col0
                            Statistics: Num rows: 36524 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE
                            Group By Operator
                              keys: _col0 (type: int)
                              mode: hash
                              outputColumnNames: _col0
                              Statistics: Num rows: 36524 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE
                              Dynamic Partitioning Event Operator
                                Target Input: store_sales
                                Partition key expr: ss_sold_date_sk
                                Statistics: Num rows: 36524 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE
                                Target column: ss_sold_date_sk
                                Target Vertex: Map 1
                  Execution mode: vectorized
              Reducer 2
                  Reduce Operator Tree:
                    Group By Operator
                      aggregations: sum(VALUE._col0)
                      keys: KEY._col0 (type: int), KEY._col1 (type: string)
                      mode: mergepartial
                      outputColumnNames: _col0, _col1, _col2
                      Statistics: Num rows: 4565 Data size: 502150 Basic stats: COMPLETE Column stats: COMPLETE
                      Reduce Output Operator
                        key expressions: _col0 (type: int), _col1 (type: string)
                        sort order: ++
                        Map-reduce partition columns: _col0 (type: int)
                        Statistics: Num rows: 4565 Data size: 502150 Basic stats: COMPLETE Column stats: COMPLETE
                        value expressions: _col0 (type: int), _col1 (type: string), _col2 (type: double)
                  Execution mode: vectorized
              Reducer 3
                  Reduce Operator Tree:
                    Extract
                      Statistics: Num rows: 4565 Data size: 502150 Basic stats: COMPLETE Column stats: COMPLETE
                      PTF Operator
                        Statistics: Num rows: 4565 Data size: 502150 Basic stats: COMPLETE Column stats: COMPLETE
                        Select Operator
                          expressions: _col0 (type: int), _col1 (type: string), _col2 (type: double), _wcol0 (type: double)
                          outputColumnNames: _col0, _col1, _col2, _col3
                          Statistics: Num rows: 4565 Data size: 36520 Basic stats: COMPLETE Column stats: COMPLETE
                          File Output Operator
                            compressed: false
                            Statistics: Num rows: 4565 Data size: 36520 Basic stats: COMPLETE Column stats: COMPLETE
                            table:
                                input format: org.apache.hadoop.mapred.TextInputFormat
                                output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      
        Stage: Stage-0
          Fetch Operator
            limit: -1
            Processor Tree:
              ListSink
      

      Attachments

        1. HIVE-8517.1.patch
          0.7 kB
          Mostafa Mokhtar
        2. HIVE-8517.2.patch
          0.8 kB
          Mostafa Mokhtar
        3. HIVE-8517.3.patch
          3 kB
          Mostafa Mokhtar

        Issue Links

          Activity

            People

              mmokhtar Mostafa Mokhtar
              mmokhtar Mostafa Mokhtar
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: