Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
0.14.0
-
None
Description
Overflow in row counts and data size for several TPC-DS queries.
Interestingly the operators which have overflow end up running with a small parallelism.
For instance Reducer 2 has an overflow but it only runs with parallelism of 2.
Reducer 2 Reduce Operator Tree: Group By Operator aggregations: sum(VALUE._col0) keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string), KEY._col3 (type: string), KEY._col4 (type: float) mode: mergepartial outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 Statistics: Num rows: 9223372036854775807 Data size: 9223372036854775341 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col3 (type: string), _col3 (type: string) sort order: ++ Map-reduce partition columns: _col3 (type: string) Statistics: Num rows: 9223372036854775807 Data size: 9223372036854775341 Basic stats: COMPLETE Column stats: COMPLETE value expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: float), _col5 (type: double) Execution mode: vectorized
VERTEX TOTAL_TASKS DURATION_SECONDS CPU_TIME_MILLIS INPUT_RECORDS OUTPUT_RECORDS Map 1 62 26.41 1,779,510 211,978,502 60,628,390 Map 5 1 4.28 6,950 138,098 138,098 Map 6 1 2.44 3,910 31 31 Reducer 2 2 22.69 61,320 60,628,390 69,182 Reducer 3 1 2.63 3,910 69,182 100 Reducer 4 1 1.01 1,180 100 100
Query
explain select i_item_desc ,i_category ,i_class ,i_current_price ,i_item_id ,sum(ws_ext_sales_price) as itemrevenue ,sum(ws_ext_sales_price)*100/sum(sum(ws_ext_sales_price)) over (partition by i_class) as revenueratio from web_sales ,item ,date_dim where web_sales.ws_item_sk = item.i_item_sk and item.i_category in ('Jewelry', 'Sports', 'Books') and web_sales.ws_sold_date_sk = date_dim.d_date_sk and date_dim.d_date between '2001-01-12' and '2001-02-11' group by i_item_id ,i_item_desc ,i_category ,i_class ,i_current_price order by i_category ,i_class ,i_item_id ,i_item_desc ,revenueratio limit 100
Explain
STAGE PLANS: Stage: Stage-1 Tez Edges: Map 1 <- Map 5 (BROADCAST_EDGE), Map 6 (BROADCAST_EDGE) Reducer 2 <- Map 1 (SIMPLE_EDGE) Reducer 3 <- Reducer 2 (SIMPLE_EDGE) Reducer 4 <- Reducer 3 (SIMPLE_EDGE) DagName: mmokhtar_20141019164343_854cb757-01bd-40cb-843e-9ada7c5e6f38:1 Vertices: Map 1 Map Operator Tree: TableScan alias: web_sales filterExpr: ws_item_sk is not null (type: boolean) Statistics: Num rows: 21594638446 Data size: 2850189889652 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: ws_item_sk is not null (type: boolean) Statistics: Num rows: 21594638446 Data size: 172746300152 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: ws_item_sk (type: int), ws_ext_sales_price (type: float), ws_sold_date_sk (type: int) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 21594638446 Data size: 172746300152 Basic stats: COMPLETE Column stats: COMPLETE Map Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {_col0} {_col1} 1 keys: 0 _col2 (type: int) 1 _col0 (type: int) outputColumnNames: _col0, _col1 input vertices: 1 Map 6 Statistics: Num rows: 24145061366 Data size: 193160490928 Basic stats: COMPLETE Column stats: COMPLETE Map Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {_col1} 1 {_col1} {_col2} {_col3} {_col4} {_col5} keys: 0 _col0 (type: int) 1 _col0 (type: int) outputColumnNames: _col1, _col6, _col7, _col8, _col9, _col10 input vertices: 1 Map 5 Statistics: Num rows: 25381041158 Data size: 11929089344260 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: _col6 (type: string), _col7 (type: string), _col10 (type: string), _col9 (type: string), _col8 (type: float), _col1 (type: float) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 Statistics: Num rows: 25381041158 Data size: 11929089344260 Basic stats: COMPLETE Column stats: COMPLETE Group By Operator aggregations: sum(_col5) keys: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: float) mode: hash outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 Statistics: Num rows: 119291 Data size: 954328 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: float) sort order: +++++ Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: float) Statistics: Num rows: 119291 Data size: 954328 Basic stats: COMPLETE Column stats: COMPLETE value expressions: _col5 (type: double) Execution mode: vectorized Map 5 Map Operator Tree: TableScan alias: item filterExpr: ((i_category) IN ('Jewelry', 'Sports', 'Books') and i_item_sk is not null) (type: boolean) Statistics: Num rows: 462000 Data size: 663862160 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: ((i_category) IN ('Jewelry', 'Sports', 'Books') and i_item_sk is not null) (type: boolean) Statistics: Num rows: 231000 Data size: 109491664 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: i_item_sk (type: int), i_item_id (type: string), i_item_desc (type: string), i_current_price (type: float), i_class (type: string), i_category (type: string) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 Statistics: Num rows: 231000 Data size: 109491664 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col0 (type: int) sort order: + Map-reduce partition columns: _col0 (type: int) Statistics: Num rows: 231000 Data size: 109491664 Basic stats: COMPLETE Column stats: COMPLETE value expressions: _col1 (type: string), _col2 (type: string), _col3 (type: float), _col4 (type: string), _col5 (type: string) Execution mode: vectorized Map 6 Map Operator Tree: TableScan alias: date_dim filterExpr: (d_date BETWEEN '2001-01-12' AND '2001-02-11' and d_date_sk is not null) (type: boolean) Statistics: Num rows: 73049 Data size: 81741831 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: (d_date BETWEEN '2001-01-12' AND '2001-02-11' and d_date_sk is not null) (type: boolean) Statistics: Num rows: 36524 Data size: 3579352 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: d_date_sk (type: int) outputColumnNames: _col0 Statistics: Num rows: 36524 Data size: 146096 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col0 (type: int) sort order: + Map-reduce partition columns: _col0 (type: int) Statistics: Num rows: 36524 Data size: 146096 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: _col0 (type: int) outputColumnNames: _col0 Statistics: Num rows: 36524 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE Group By Operator keys: _col0 (type: int) mode: hash outputColumnNames: _col0 Statistics: Num rows: 36524 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE Dynamic Partitioning Event Operator Target Input: web_sales Partition key expr: ws_sold_date_sk Statistics: Num rows: 36524 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE Target column: ws_sold_date_sk Target Vertex: Map 1 Execution mode: vectorized Reducer 2 Reduce Operator Tree: Group By Operator aggregations: sum(VALUE._col0) keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string), KEY._col3 (type: string), KEY._col4 (type: float) mode: mergepartial outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 Statistics: Num rows: 119291 Data size: 1908656 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col3 (type: string), _col3 (type: string) sort order: ++ Map-reduce partition columns: _col3 (type: string) Statistics: Num rows: 119291 Data size: 1908656 Basic stats: COMPLETE Column stats: COMPLETE value expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: float), _col5 (type: double) Execution mode: vectorized Reducer 3 Reduce Operator Tree: Extract Statistics: Num rows: 119291 Data size: 1908656 Basic stats: COMPLETE Column stats: COMPLETE PTF Operator Statistics: Num rows: 119291 Data size: 1908656 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: float), _col0 (type: string), _col5 (type: double), ((_col5 * 100.0) / _wcol0) (type: double) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6 Statistics: Num rows: 119291 Data size: 954328 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col1 (type: string), _col2 (type: string), _col4 (type: string), _col0 (type: string), _col6 (type: double) sort order: +++++ Statistics: Num rows: 119291 Data size: 954328 Basic stats: COMPLETE Column stats: COMPLETE TopN Hash Memory Usage: 0.04 value expressions: _col3 (type: float), _col5 (type: double) Reducer 4 Reduce Operator Tree: Select Operator expressions: KEY.reducesinkkey3 (type: string), KEY.reducesinkkey0 (type: string), KEY.reducesinkkey1 (type: string), VALUE._col0 (type: float), KEY.reducesinkkey2 (type: string), VALUE._col1 (type: double), KEY.reducesinkkey4 (type: double) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6 Statistics: Num rows: 119291 Data size: 954328 Basic stats: COMPLETE Column stats: COMPLETE Limit Number of rows: 100 Statistics: Num rows: 100 Data size: 800 Basic stats: COMPLETE Column stats: COMPLETE File Output Operator compressed: false Statistics: Num rows: 100 Data size: 800 Basic stats: COMPLETE Column stats: COMPLETE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Execution mode: vectorized Stage: Stage-0 Fetch Operator limit: 100 Processor Tree: ListSink
Attachments
Attachments
Issue Links
- relates to
-
HIVE-8689 handle overflows in statistics better
- Closed
- links to