Details
-
Sub-task
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
Description
Merge statements may contain delete and update branches. Update is technically a delete and an insert operation. Column statistics can not be calculated in case of delete operations from the changed records. Example: min, max.
Currently Hive marks the column stats of the target table invalid after Update/Delete/Merge but for merge extra GBY operators and reducers are generated for insert branches to calculate column stats and Stats works are collecting Column stats too.
POSTHOOK: query: explain merge into acidTbl_n0 as t using nonAcidOrcTbl_n0 s ON t.a = s.a WHEN MATCHED AND s.a > 8 THEN DELETE WHEN MATCHED THEN UPDATE SET b = 7 WHEN NOT MATCHED THEN INSERT VALUES(s.a, s.b) POSTHOOK: type: QUERY POSTHOOK: Input: default@acidtbl_n0 POSTHOOK: Input: default@nonacidorctbl_n0 POSTHOOK: Output: default@acidtbl_n0 POSTHOOK: Output: default@acidtbl_n0 POSTHOOK: Output: default@merge_tmp_table STAGE DEPENDENCIES: Stage-5 is a root stage Stage-6 depends on stages: Stage-5 Stage-0 depends on stages: Stage-6 Stage-7 depends on stages: Stage-0 Stage-1 depends on stages: Stage-6 Stage-8 depends on stages: Stage-1 Stage-2 depends on stages: Stage-6 Stage-9 depends on stages: Stage-2 Stage-3 depends on stages: Stage-6 Stage-10 depends on stages: Stage-3 Stage-4 depends on stages: Stage-6 Stage-11 depends on stages: Stage-4 STAGE PLANS: Stage: Stage-5 Tez #### A masked pattern was here #### Edges: Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 10 (SIMPLE_EDGE) Reducer 3 <- Reducer 2 (SIMPLE_EDGE) Reducer 4 <- Reducer 2 (SIMPLE_EDGE) Reducer 5 <- Reducer 2 (SIMPLE_EDGE) Reducer 6 <- Reducer 5 (CUSTOM_SIMPLE_EDGE) Reducer 7 <- Reducer 2 (SIMPLE_EDGE) Reducer 8 <- Reducer 7 (CUSTOM_SIMPLE_EDGE) Reducer 9 <- Reducer 2 (SIMPLE_EDGE) #### A masked pattern was here #### Vertices: Map 1 Map Operator Tree: TableScan alias: s Statistics: Num rows: 4 Data size: 32 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: a (type: int), b (type: int) outputColumnNames: _col0, _col1 Statistics: Num rows: 4 Data size: 32 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col0 (type: int) null sort order: z sort order: + Map-reduce partition columns: _col0 (type: int) Statistics: Num rows: 4 Data size: 32 Basic stats: COMPLETE Column stats: COMPLETE value expressions: _col1 (type: int) Execution mode: vectorized, llap LLAP IO: all inputs Map 10 Map Operator Tree: TableScan alias: t filterExpr: a is not null (type: boolean) Statistics: Num rows: 2 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: a is not null (type: boolean) Statistics: Num rows: 2 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: a (type: int), ROW__ID (type: struct<writeid:bigint,bucketid:int,rowid:bigint>) outputColumnNames: _col0, _col1 Statistics: Num rows: 2 Data size: 160 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col0 (type: int) null sort order: z sort order: + Map-reduce partition columns: _col0 (type: int) Statistics: Num rows: 2 Data size: 160 Basic stats: COMPLETE Column stats: COMPLETE value expressions: _col1 (type: struct<writeid:bigint,bucketid:int,rowid:bigint>) Execution mode: vectorized, llap LLAP IO: may be used (ACID table) Reducer 2 Execution mode: llap Reduce Operator Tree: Merge Join Operator condition map: Left Outer Join 0 to 1 keys: 0 _col0 (type: int) 1 _col0 (type: int) outputColumnNames: _col0, _col1, _col2, _col3 Statistics: Num rows: 6 Data size: 288 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: _col3 (type: struct<writeid:bigint,bucketid:int,rowid:bigint>), _col1 (type: int), _col2 (type: int), _col0 (type: int) outputColumnNames: _col0, _col1, _col2, _col3 Statistics: Num rows: 6 Data size: 288 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: ((_col2 = _col3) and (_col3 > 8)) (type: boolean) Statistics: Num rows: 1 Data size: 88 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: _col0 (type: struct<writeid:bigint,bucketid:int,rowid:bigint>) outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 76 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col0 (type: struct<writeid:bigint,bucketid:int,rowid:bigint>) null sort order: z sort order: + Map-reduce partition columns: UDFToInteger(_col0) (type: int) Statistics: Num rows: 1 Data size: 76 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: ((_col2 = _col3) and (_col3 <= 8)) (type: boolean) Statistics: Num rows: 2 Data size: 176 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: _col0 (type: struct<writeid:bigint,bucketid:int,rowid:bigint>) outputColumnNames: _col0 Statistics: Num rows: 2 Data size: 152 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col0 (type: struct<writeid:bigint,bucketid:int,rowid:bigint>) null sort order: z sort order: + Map-reduce partition columns: UDFToInteger(_col0) (type: int) Statistics: Num rows: 2 Data size: 152 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: ((_col2 = _col3) and (_col3 <= 8)) (type: boolean) Statistics: Num rows: 2 Data size: 176 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: _col2 (type: int), 7 (type: int) outputColumnNames: _col0, _col1 Statistics: Num rows: 2 Data size: 16 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col0 (type: int) null sort order: a sort order: + Map-reduce partition columns: _col0 (type: int) Statistics: Num rows: 2 Data size: 16 Basic stats: COMPLETE Column stats: COMPLETE value expressions: _col1 (type: int) Filter Operator predicate: _col2 is null (type: boolean) Statistics: Num rows: 4 Data size: 192 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: _col3 (type: int), _col1 (type: int) outputColumnNames: _col0, _col1 Statistics: Num rows: 4 Data size: 32 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col0 (type: int) null sort order: a sort order: + Map-reduce partition columns: _col0 (type: int) Statistics: Num rows: 4 Data size: 32 Basic stats: COMPLETE Column stats: COMPLETE value expressions: _col1 (type: int) Filter Operator predicate: (_col2 = _col3) (type: boolean) Statistics: Num rows: 3 Data size: 184 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: _col0 (type: struct<writeid:bigint,bucketid:int,rowid:bigint>) outputColumnNames: _col0 Statistics: Num rows: 3 Data size: 184 Basic stats: COMPLETE Column stats: COMPLETE Group By Operator aggregations: count() keys: _col0 (type: struct<writeid:bigint,bucketid:int,rowid:bigint>) minReductionHashAggr: 0.4 mode: hash outputColumnNames: _col0, _col1 Statistics: Num rows: 2 Data size: 168 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col0 (type: struct<writeid:bigint,bucketid:int,rowid:bigint>) null sort order: z sort order: + Map-reduce partition columns: _col0 (type: struct<writeid:bigint,bucketid:int,rowid:bigint>) Statistics: Num rows: 2 Data size: 168 Basic stats: COMPLETE Column stats: COMPLETE value expressions: _col1 (type: bigint) Reducer 3 Execution mode: vectorized, llap Reduce Operator Tree: Select Operator expressions: KEY.reducesinkkey0 (type: struct<writeid:bigint,bucketid:int,rowid:bigint>) outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 76 Basic stats: COMPLETE Column stats: COMPLETE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 76 Basic stats: COMPLETE Column stats: COMPLETE table: input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde name: default.acidtbl_n0 Write Type: DELETE Reducer 4 Execution mode: vectorized, llap Reduce Operator Tree: Select Operator expressions: KEY.reducesinkkey0 (type: struct<writeid:bigint,bucketid:int,rowid:bigint>) outputColumnNames: _col0 Statistics: Num rows: 2 Data size: 152 Basic stats: COMPLETE Column stats: COMPLETE File Output Operator compressed: false Statistics: Num rows: 2 Data size: 152 Basic stats: COMPLETE Column stats: COMPLETE table: input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde name: default.acidtbl_n0 Write Type: DELETE Reducer 5 Execution mode: vectorized, llap Reduce Operator Tree: Select Operator expressions: KEY.reducesinkkey0 (type: int), VALUE._col0 (type: int) outputColumnNames: _col0, _col1 Statistics: Num rows: 2 Data size: 16 Basic stats: COMPLETE Column stats: COMPLETE File Output Operator compressed: false Statistics: Num rows: 2 Data size: 16 Basic stats: COMPLETE Column stats: COMPLETE table: input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde name: default.acidtbl_n0 Write Type: INSERT Select Operator expressions: _col0 (type: int), _col1 (type: int) outputColumnNames: a, b Statistics: Num rows: 2 Data size: 16 Basic stats: COMPLETE Column stats: COMPLETE Group By Operator aggregations: min(a), max(a), count(1), count(a), compute_bit_vector_hll(a), min(b), max(b), count(b), compute_bit_vector_hll(b) minReductionHashAggr: 0.5 mode: hash outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8 Statistics: Num rows: 1 Data size: 328 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator null sort order: sort order: Statistics: Num rows: 1 Data size: 328 Basic stats: COMPLETE Column stats: COMPLETE value expressions: _col0 (type: int), _col1 (type: int), _col2 (type: bigint), _col3 (type: bigint), _col4 (type: binary), _col5 (type: int), _col6 (type: int), _col7 (type: bigint), _col8 (type: binary) Reducer 6 Execution mode: vectorized, llap Reduce Operator Tree: Group By Operator aggregations: min(VALUE._col0), max(VALUE._col1), count(VALUE._col2), count(VALUE._col3), compute_bit_vector_hll(VALUE._col4), min(VALUE._col5), max(VALUE._col6), count(VALUE._col7), compute_bit_vector_hll(VALUE._col8) mode: mergepartial outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8 Statistics: Num rows: 1 Data size: 328 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: 'LONG' (type: string), UDFToLong(_col0) (type: bigint), UDFToLong(_col1) (type: bigint), (_col2 - _col3) (type: bigint), COALESCE(ndv_compute_bit_vector(_col4),0) (type: bigint), _col4 (type: binary), 'LONG' (type: string), UDFToLong(_col5) (type: bigint), UDFToLong(_col6) (type: bigint), (_col2 - _col7) (type: bigint), COALESCE(ndv_compute_bit_vector(_col8),0) (type: bigint), _col8 (type: binary) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11 Statistics: Num rows: 1 Data size: 528 Basic stats: COMPLETE Column stats: COMPLETE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 528 Basic stats: COMPLETE Column stats: COMPLETE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Reducer 7 Execution mode: vectorized, llap Reduce Operator Tree: Select Operator expressions: KEY.reducesinkkey0 (type: int), VALUE._col0 (type: int) outputColumnNames: _col0, _col1 Statistics: Num rows: 4 Data size: 32 Basic stats: COMPLETE Column stats: COMPLETE File Output Operator compressed: false Statistics: Num rows: 4 Data size: 32 Basic stats: COMPLETE Column stats: COMPLETE table: input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde name: default.acidtbl_n0 Write Type: INSERT Select Operator expressions: _col0 (type: int), _col1 (type: int) outputColumnNames: a, b Statistics: Num rows: 4 Data size: 32 Basic stats: COMPLETE Column stats: COMPLETE Group By Operator aggregations: min(a), max(a), count(1), count(a), compute_bit_vector_hll(a), min(b), max(b), count(b), compute_bit_vector_hll(b) minReductionHashAggr: 0.75 mode: hash outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8 Statistics: Num rows: 1 Data size: 328 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator null sort order: sort order: Statistics: Num rows: 1 Data size: 328 Basic stats: COMPLETE Column stats: COMPLETE value expressions: _col0 (type: int), _col1 (type: int), _col2 (type: bigint), _col3 (type: bigint), _col4 (type: binary), _col5 (type: int), _col6 (type: int), _col7 (type: bigint), _col8 (type: binary) Reducer 8 Execution mode: vectorized, llap Reduce Operator Tree: Group By Operator aggregations: min(VALUE._col0), max(VALUE._col1), count(VALUE._col2), count(VALUE._col3), compute_bit_vector_hll(VALUE._col4), min(VALUE._col5), max(VALUE._col6), count(VALUE._col7), compute_bit_vector_hll(VALUE._col8) mode: mergepartial outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8 Statistics: Num rows: 1 Data size: 328 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: 'LONG' (type: string), UDFToLong(_col0) (type: bigint), UDFToLong(_col1) (type: bigint), (_col2 - _col3) (type: bigint), COALESCE(ndv_compute_bit_vector(_col4),0) (type: bigint), _col4 (type: binary), 'LONG' (type: string), UDFToLong(_col5) (type: bigint), UDFToLong(_col6) (type: bigint), (_col2 - _col7) (type: bigint), COALESCE(ndv_compute_bit_vector(_col8),0) (type: bigint), _col8 (type: binary) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11 Statistics: Num rows: 1 Data size: 528 Basic stats: COMPLETE Column stats: COMPLETE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 528 Basic stats: COMPLETE Column stats: COMPLETE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Reducer 9 Execution mode: llap Reduce Operator Tree: Group By Operator aggregations: count(VALUE._col0) keys: KEY._col0 (type: struct<writeid:bigint,bucketid:int,rowid:bigint>) mode: mergepartial outputColumnNames: _col0, _col1 Statistics: Num rows: 2 Data size: 168 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: (_col1 > 1L) (type: boolean) Statistics: Num rows: 1 Data size: 84 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: cardinality_violation(_col0) (type: int) outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: COMPLETE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: COMPLETE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: default.merge_tmp_table Stage: Stage-6 Dependency Collection Stage: Stage-0 Move Operator tables: replace: false table: input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde name: default.acidtbl_n0 Write Type: DELETE Stage: Stage-7 Stats Work Basic Stats Work: Stage: Stage-1 Move Operator tables: replace: false table: input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde name: default.acidtbl_n0 Write Type: DELETE Stage: Stage-8 Stats Work Basic Stats Work: Stage: Stage-2 Move Operator tables: replace: false table: input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde name: default.acidtbl_n0 Write Type: INSERT Stage: Stage-9 Stats Work Basic Stats Work: Stage: Stage-3 Move Operator tables: replace: false table: input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde name: default.acidtbl_n0 Write Type: INSERT Stage: Stage-10 Stats Work Basic Stats Work: Column Stats Desc: Columns: a, b Column Types: int, int Table: default.acidtbl_n0 Stage: Stage-4 Move Operator tables: replace: false table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: default.merge_tmp_table Stage: Stage-11 Stats Work Basic Stats Work:
One of the insert Reducers and the follow-up Reducer for col stats collecting:
Reducer 5 Execution mode: vectorized, llap Reduce Operator Tree: Select Operator expressions: KEY.reducesinkkey0 (type: int), VALUE._col0 (type: int) outputColumnNames: _col0, _col1 Statistics: Num rows: 2 Data size: 16 Basic stats: COMPLETE Column stats: COMPLETE File Output Operator compressed: false Statistics: Num rows: 2 Data size: 16 Basic stats: COMPLETE Column stats: COMPLETE table: input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde name: default.acidtbl_n0 Write Type: INSERT Select Operator expressions: _col0 (type: int), _col1 (type: int) outputColumnNames: a, b Statistics: Num rows: 2 Data size: 16 Basic stats: COMPLETE Column stats: COMPLETE Group By Operator aggregations: min(a), max(a), count(1), count(a), compute_bit_vector_hll(a), min(b), max(b), count(b), compute_bit_vector_hll(b) minReductionHashAggr: 0.5 mode: hash outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8 Statistics: Num rows: 1 Data size: 328 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator null sort order: sort order: Statistics: Num rows: 1 Data size: 328 Basic stats: COMPLETE Column stats: COMPLETE value expressions: _col0 (type: int), _col1 (type: int), _col2 (type: bigint), _col3 (type: bigint), _col4 (type: binary), _col5 (type: int), _col6 (type: int), _col7 (type: bigint), _col8 (type: binary) Reducer 6 Execution mode: vectorized, llap Reduce Operator Tree: Group By Operator aggregations: min(VALUE._col0), max(VALUE._col1), count(VALUE._col2), count(VALUE._col3), compute_bit_vector_hll(VALUE._col4), min(VALUE._col5), max(VALUE._col6), count(VALUE._col7), compute_bit_vector_hll(VALUE._col8) mode: mergepartial outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8 Statistics: Num rows: 1 Data size: 328 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: 'LONG' (type: string), UDFToLong(_col0) (type: bigint), UDFToLong(_col1) (type: bigint), (_col2 - _col3) (type: bigint), COALESCE(ndv_compute_bit_vector(_col4),0) (type: bigint), _col4 (type: binary), 'LONG' (type: string), UDFToLong(_col5) (type: bigint), UDFToLong(_col6) (type: bigint), (_col2 - _col7) (type: bigint), COALESCE(ndv_compute_bit_vector(_col8),0) (type: bigint), _col8 (type: binary) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11 Statistics: Num rows: 1 Data size: 528 Basic stats: COMPLETE Column stats: COMPLETE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 528 Basic stats: COMPLETE Column stats: COMPLETE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Attachments
Issue Links
- links to