Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.14.0, 1.0.0, 1.1.0, 1.2.0
-
None
Description
TPC-DS Q82 is running slower than hive 13 because the join type is not correct.
The estimate for item x inventory x date_dim is 227 Million rows while the actual is 3K rows.
Hive 13 finishes in 753 seconds.
Hive 14 finishes in 1,267 seconds.
Hive 14 + force map join finished in 431 seconds.
Query
select i_item_id ,i_item_desc ,i_current_price from item, inventory, date_dim, store_sales where i_current_price between 30 and 30+30 and inv_item_sk = i_item_sk and d_date_sk=inv_date_sk and d_date between '2002-05-30' and '2002-07-30' and i_manufact_id in (437,129,727,663) and inv_quantity_on_hand between 100 and 500 and ss_item_sk = i_item_sk group by i_item_id,i_item_desc,i_current_price order by i_item_id limit 100
Plan
STAGE PLANS: Stage: Stage-1 Tez Edges: Map 7 <- Map 1 (BROADCAST_EDGE), Map 2 (BROADCAST_EDGE) Reducer 4 <- Map 3 (SIMPLE_EDGE), Map 7 (SIMPLE_EDGE) Reducer 5 <- Reducer 4 (SIMPLE_EDGE) Reducer 6 <- Reducer 5 (SIMPLE_EDGE) DagName: mmokhtar_20141106005353_7a2eb8df-12ff-4fe9-89b4-30f1e4e3fb90:1 Vertices: Map 1 Map Operator Tree: TableScan alias: item filterExpr: ((i_current_price BETWEEN 30 AND 60 and (i_manufact_id) IN (437, 129, 727, 663)) and i_item_sk is not null) (type: boolean) Statistics: Num rows: 462000 Data size: 663862160 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: ((i_current_price BETWEEN 30 AND 60 and (i_manufact_id) IN (437, 129, 727, 663)) and i_item_sk is not null) (type: boolean) Statistics: Num rows: 115500 Data size: 34185680 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: i_item_sk (type: int), i_item_id (type: string), i_item_desc (type: string), i_current_price (type: float) outputColumnNames: _col0, _col1, _col2, _col3 Statistics: Num rows: 115500 Data size: 33724832 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col0 (type: int) sort order: + Map-reduce partition columns: _col0 (type: int) Statistics: Num rows: 115500 Data size: 33724832 Basic stats: COMPLETE Column stats: COMPLETE value expressions: _col1 (type: string), _col2 (type: string), _col3 (type: float) Execution mode: vectorized Map 2 Map Operator Tree: TableScan alias: date_dim filterExpr: (d_date BETWEEN '2002-05-30' AND '2002-07-30' and d_date_sk is not null) (type: boolean) Statistics: Num rows: 73049 Data size: 81741831 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: (d_date BETWEEN '2002-05-30' AND '2002-07-30' and d_date_sk is not null) (type: boolean) Statistics: Num rows: 36524 Data size: 3579352 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: d_date_sk (type: int) outputColumnNames: _col0 Statistics: Num rows: 36524 Data size: 146096 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col0 (type: int) sort order: + Map-reduce partition columns: _col0 (type: int) Statistics: Num rows: 36524 Data size: 146096 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: _col0 (type: int) outputColumnNames: _col0 Statistics: Num rows: 36524 Data size: 146096 Basic stats: COMPLETE Column stats: COMPLETE Group By Operator keys: _col0 (type: int) mode: hash outputColumnNames: _col0 Statistics: Num rows: 18262 Data size: 73048 Basic stats: COMPLETE Column stats: COMPLETE Dynamic Partitioning Event Operator Target Input: inventory Partition key expr: inv_date_sk Statistics: Num rows: 18262 Data size: 73048 Basic stats: COMPLETE Column stats: COMPLETE Target column: inv_date_sk Target Vertex: Map 7 Execution mode: vectorized Map 3 Map Operator Tree: TableScan alias: store_sales filterExpr: ss_item_sk is not null (type: boolean) Statistics: Num rows: 82510879939 Data size: 6873789738208 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: ss_item_sk is not null (type: boolean) Statistics: Num rows: 82510879939 Data size: 330043519756 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: ss_item_sk (type: int) outputColumnNames: _col0 Statistics: Num rows: 82510879939 Data size: 330043519756 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col0 (type: int) sort order: + Map-reduce partition columns: _col0 (type: int) Statistics: Num rows: 82510879939 Data size: 330043519756 Basic stats: COMPLETE Column stats: COMPLETE Execution mode: vectorized Map 7 Map Operator Tree: TableScan alias: inventory filterExpr: (inv_quantity_on_hand BETWEEN 100 AND 500 and inv_item_sk is not null) (type: boolean) Statistics: Num rows: 1,627,857,000 Data size: 19208695084 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: (inv_quantity_on_hand BETWEEN 100 AND 500 and inv_item_sk is not null) (type: boolean) Statistics: Num rows: 813,928,500 Data size: 9604347540 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: inv_item_sk (type: int), inv_date_sk (type: int) outputColumnNames: _col0, _col2 Statistics: Num rows: 813,928,500 Data size: 6511428000 Basic stats: COMPLETE Column stats: COMPLETE Map Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {_col2} 1 {_col0} {_col1} {_col2} {_col3} keys: 0 _col0 (type: int) 1 _col0 (type: int) outputColumnNames: _col2, _col3, _col4, _col5, _col6 input vertices: 1 Map 1 Statistics: Num rows: 203,482,128 Data size: 59416781376 Basic stats: COMPLETE Column stats: COMPLETE Map Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {_col3} {_col4} {_col5} {_col6} 1 keys: 0 _col2 (type: int) 1 _col0 (type: int) outputColumnNames: _col3, _col4, _col5, _col6 input vertices: 1 Map 2 Statistics: Num rows: 227,514,273 Data size: 66434167716 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: _col3 (type: int), _col4 (type: string), _col5 (type: string), _col6 (type: float) outputColumnNames: _col3, _col4, _col5, _col6 Statistics: Num rows: 227514273 Data size: 66434167716 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col3 (type: int) sort order: + Map-reduce partition columns: _col3 (type: int) Statistics: Num rows: 227514273 Data size: 66434167716 Basic stats: COMPLETE Column stats: COMPLETE value expressions: _col4 (type: string), _col5 (type: string), _col6 (type: float) Execution mode: vectorized Reducer 4 Reduce Operator Tree: Merge Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 1 {VALUE._col3} {VALUE._col4} {VALUE._col5} outputColumnNames: _col5, _col6, _col7 Statistics: Num rows: 5156859392 Data size: 1485175504896 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: _col5 (type: string), _col6 (type: string), _col7 (type: float) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 5156859392 Data size: 1485175504896 Basic stats: COMPLETE Column stats: COMPLETE Group By Operator keys: _col0 (type: string), _col1 (type: string), _col2 (type: float) mode: hash outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 5156859392 Data size: 1485175504896 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: float) sort order: +++ Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: float) Statistics: Num rows: 5156859392 Data size: 1485175504896 Basic stats: COMPLETE Column stats: COMPLETE Reducer 5 Reduce Operator Tree: Group By Operator keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: float) mode: mergepartial outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 5156859392 Data size: 1485175504896 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: _col0 (type: string), _col1 (type: string), _col2 (type: float) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 5156859392 Data size: 1485175504896 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col0 (type: string) sort order: + Statistics: Num rows: 5156859392 Data size: 1485175504896 Basic stats: COMPLETE Column stats: COMPLETE TopN Hash Memory Usage: 0.04 value expressions: _col1 (type: string), _col2 (type: float) Execution mode: vectorized Reducer 6 Reduce Operator Tree: Select Operator expressions: KEY.reducesinkkey0 (type: string), VALUE._col0 (type: string), VALUE._col1 (type: float) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 5156859392 Data size: 1485175504896 Basic stats: COMPLETE Column stats: COMPLETE Limit Number of rows: 100 Statistics: Num rows: 100 Data size: 28800 Basic stats: COMPLETE Column stats: COMPLETE File Output Operator compressed: false Statistics: Num rows: 100 Data size: 28800 Basic stats: COMPLETE Column stats: COMPLETE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Execution mode: vectorized Stage: Stage-0 Fetch Operator limit: 100 Processor Tree: ListSink
Actual rows counts
VERTICES TOTAL_TASKS FAILED_ATTEMPTS KILLED_TASKS DURATION_SECONDS CPU_TIME_MILLIS GC_TIME_MILLIS INPUT_RECORDS OUTPUT_RECORDS Map 1 1 0 0 1.01 1,280 0 462,000 65 Map 2 1 0 0 0.41 400 23 10,000 62 Map 3 2574 0 0 947.79 442,220,640 1,887,714 82,510,879,939 82,510,879,939 Map 7 9 0 0 869.22 42,490 1,215 56,133,127 3,081 Reducer 4 1009 0 0 389.75 69,471,510 1,149,529 82,510,883,020 33 Reducer 5 253 0 0 93.73 938,930 26,150 33 33 Reducer 6 1 0 0 2.08 730 10 33 33
Attachments
Attachments
Issue Links
- relates to
-
HIVE-9392 JoinStatsRule miscalculates join cardinality as incorrect NDV is used due to column names having duplicated fqColumnName
- Resolved
-
HIVE-10812 Scaling PK/FK's selectivity for stats annotation
- Resolved
- links to