[HIVE-8517] When joining on partition column NDV gets overridden by StatsUtils.getColStatisticsFromExpression - ASF JIRA

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 0.14.0
Fix Version/s: 0.14.0
Component/s: Physical Optimizer
Labels:
None

Tags:
hive

Description

When joining on partition column number of partitions is used as NDV which gets overridden by StatsUtils.getColStatisticsFromExpression and the number of partitions used as NDV is replaced by number of rows which results in the same behavior as explained in https://issues.apache.org/jira/browse/HIVE-8196. "Joining on partition columns with fetch column stats enabled results it very small CE which negatively affects query performance "

This is the call stack.

StatsUtils.getColStatisticsFromExpression(HiveConf, Statistics, ExprNodeDesc) line: 1001	
StatsRulesProcFactory$ReduceSinkStatsRule.process(Node, Stack<Node>, NodeProcessorCtx, Object...) line: 1479	
DefaultRuleDispatcher.dispatch(Node, Stack<Node>, Object...) line: 90	
PreOrderWalker(DefaultGraphWalker).dispatchAndReturn(Node, Stack<Node>) line: 94	
PreOrderWalker(DefaultGraphWalker).dispatch(Node, Stack<Node>) line: 78	
PreOrderWalker.walk(Node) line: 54	
PreOrderWalker.walk(Node) line: 59	
PreOrderWalker.walk(Node) line: 59	
PreOrderWalker(DefaultGraphWalker).startWalking(Collection<Node>, HashMap<Node,Object>) line: 109	
AnnotateWithStatistics.transform(ParseContext) line: 78	
TezCompiler.runStatsAnnotation(OptimizeTezProcContext) line: 248	
TezCompiler.optimizeOperatorPlan(ParseContext, Set<ReadEntity>, Set<WriteEntity>) line: 120	
TezCompiler(TaskCompiler).compile(ParseContext, List<Task<Serializable>>, HashSet<ReadEntity>, HashSet<WriteEntity>) line: 99	
SemanticAnalyzer.analyzeInternal(ASTNode) line: 10037	
SemanticAnalyzer(BaseSemanticAnalyzer).analyze(ASTNode, Context) line: 221	
ExplainSemanticAnalyzer.analyzeInternal(ASTNode) line: 74	
ExplainSemanticAnalyzer(BaseSemanticAnalyzer).analyze(ASTNode, Context) line: 221	
Driver.compile(String, boolean) line: 415

Query

select
  ss_item_sk item_sk, d_date, sum(ss_sales_price),
  sum(sum(ss_sales_price))
      over (partition by ss_item_sk order by d_date rows between unbounded preceding and current row) cume_sales
from store_sales
    ,date_dim
where ss_sold_date_sk=d_date_sk
  and d_month_seq between 1193 and 1193+11
  and ss_item_sk is not NULL
group by ss_item_sk, d_date

Plan
Notice in the Map join operator the number of rows drop from 82,510,879,939 to 36524 after the join.

OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Tez
      Edges:
        Map 1 <- Map 4 (BROADCAST_EDGE)
        Reducer 2 <- Map 1 (SIMPLE_EDGE)
        Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
      DagName: mmokhtar_20141019131818_086d663a-5621-456c-bf25-8ccb7112ee3b:6
      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: store_sales
                  filterExpr: ss_item_sk is not null (type: boolean)
                  Statistics: Num rows: 82510879939 Data size: 6873789738208 Basic stats: COMPLETE Column stats: COMPLETE
                  Filter Operator
                    predicate: ss_item_sk is not null (type: boolean)
                    Statistics: Num rows: 82510879939 Data size: 652315818272 Basic stats: COMPLETE Column stats: COMPLETE
                    Map Join Operator
                      condition map:
                           Inner Join 0 to 1
                      condition expressions:
                        0 {ss_item_sk} {ss_sales_price} {ss_sold_date_sk}
                        1 {d_date_sk} {d_date} {d_month_seq}
                      keys:
                        0 ss_sold_date_sk (type: int)
                        1 d_date_sk (type: int)
                      outputColumnNames: _col1, _col12, _col22, _col26, _col28, _col29
                      input vertices:
                        1 Map 4
                      Statistics: Num rows: 36524 Data size: 4163736 Basic stats: COMPLETE Column stats: COMPLETE
                      Filter Operator
                        predicate: (((_col22 = _col26) and _col29 BETWEEN 1193 AND 1204) and _col1 is not null) (type: boolean)
                        Statistics: Num rows: 9131 Data size: 1040934 Basic stats: COMPLETE Column stats: COMPLETE
                        Select Operator
                          expressions: _col1 (type: int), _col28 (type: string), _col12 (type: float)
                          outputColumnNames: _col1, _col28, _col12
                          Statistics: Num rows: 9131 Data size: 1040934 Basic stats: COMPLETE Column stats: COMPLETE
                          Group By Operator
                            aggregations: sum(_col12)
                            keys: _col1 (type: int), _col28 (type: string)
                            mode: hash
                            outputColumnNames: _col0, _col1, _col2
                            Statistics: Num rows: 4565 Data size: 483890 Basic stats: COMPLETE Column stats: COMPLETE
                            Reduce Output Operator
                              key expressions: _col0 (type: int), _col1 (type: string)
                              sort order: ++
                              Map-reduce partition columns: _col0 (type: int), _col1 (type: string)
                              Statistics: Num rows: 4565 Data size: 483890 Basic stats: COMPLETE Column stats: COMPLETE
                              value expressions: _col2 (type: double)
            Execution mode: vectorized
        Map 4
            Map Operator Tree:
                TableScan
                  alias: date_dim
                  filterExpr: (d_date_sk is not null and d_month_seq BETWEEN 1193 AND 1204) (type: boolean)
                  Statistics: Num rows: 73049 Data size: 81741831 Basic stats: COMPLETE Column stats: COMPLETE
                  Filter Operator
                    predicate: (d_date_sk is not null and d_month_seq BETWEEN 1193 AND 1204) (type: boolean)
                    Statistics: Num rows: 36524 Data size: 3725448 Basic stats: COMPLETE Column stats: COMPLETE
                    Reduce Output Operator
                      key expressions: d_date_sk (type: int)
                      sort order: +
                      Map-reduce partition columns: d_date_sk (type: int)
                      Statistics: Num rows: 36524 Data size: 3725448 Basic stats: COMPLETE Column stats: COMPLETE
                      value expressions: d_date (type: string), d_month_seq (type: int)
                    Select Operator
                      expressions: d_date_sk (type: int)
                      outputColumnNames: _col0
                      Statistics: Num rows: 36524 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE
                      Group By Operator
                        keys: _col0 (type: int)
                        mode: hash
                        outputColumnNames: _col0
                        Statistics: Num rows: 36524 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE
                        Dynamic Partitioning Event Operator
                          Target Input: store_sales
                          Partition key expr: ss_sold_date_sk
                          Statistics: Num rows: 36524 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE
                          Target column: ss_sold_date_sk
                          Target Vertex: Map 1
            Execution mode: vectorized
        Reducer 2
            Reduce Operator Tree:
              Group By Operator
                aggregations: sum(VALUE._col0)
                keys: KEY._col0 (type: int), KEY._col1 (type: string)
                mode: mergepartial
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 4565 Data size: 502150 Basic stats: COMPLETE Column stats: COMPLETE
                Reduce Output Operator
                  key expressions: _col0 (type: int), _col1 (type: string)
                  sort order: ++
                  Map-reduce partition columns: _col0 (type: int)
                  Statistics: Num rows: 4565 Data size: 502150 Basic stats: COMPLETE Column stats: COMPLETE
                  value expressions: _col0 (type: int), _col1 (type: string), _col2 (type: double)
            Execution mode: vectorized
        Reducer 3
            Reduce Operator Tree:
              Extract
                Statistics: Num rows: 4565 Data size: 502150 Basic stats: COMPLETE Column stats: COMPLETE
                PTF Operator
                  Statistics: Num rows: 4565 Data size: 502150 Basic stats: COMPLETE Column stats: COMPLETE
                  Select Operator
                    expressions: _col0 (type: int), _col1 (type: string), _col2 (type: double), _wcol0 (type: double)
                    outputColumnNames: _col0, _col1, _col2, _col3
                    Statistics: Num rows: 4565 Data size: 36520 Basic stats: COMPLETE Column stats: COMPLETE
                    File Output Operator
                      compressed: false
                      Statistics: Num rows: 4565 Data size: 36520 Basic stats: COMPLETE Column stats: COMPLETE
                      table:
                          input format: org.apache.hadoop.mapred.TextInputFormat
                          output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                          serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HIVE-8517.1.patch
19/Oct/14 18:27
0.7 kB
Mostafa Mokhtar
HIVE-8517.2.patch
20/Oct/14 23:48
0.8 kB
Mostafa Mokhtar
HIVE-8517.3.patch
21/Oct/14 17:51
3 kB
Mostafa Mokhtar

Issue Links

relates to

HIVE-8196 Joining on partition columns with fetch column stats enabled results it very small CE which negatively affects query performance

Closed

When joining on partition column NDV gets overridden by StatsUtils.getColStatisticsFromExpression

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates