[HIVE-11394] Enhance EXPLAIN display for vectorization - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.3.0
Component/s: Hive
Labels:
None

Description

Add detail to the EXPLAIN output showing why a Map and Reduce work is not vectorized.

New syntax is: EXPLAIN VECTORIZATION [ONLY] [SUMMARY|OPERATOR|EXPRESSION|DETAIL]

The ONLY option suppresses most non-vectorization elements.

SUMMARY shows vectorization information for the PLAN (is vectorization enabled) and a summary of Map and Reduce work.

OPERATOR shows vectorization information for operators. E.g. Filter Vectorization. It includes all information of SUMMARY, too.

EXPRESSION shows vectorization information for expressions. E.g. predicateExpression. It includes all information of SUMMARY and OPERATOR, too.

DETAIL shows very vectorization information.
It includes all information of SUMMARY, OPERATOR, and EXPRESSION too.

The optional clause defaults are not ONLY and SUMMARY.

---------------------------------------------------------------------------------------------------

Here are some examples:

EXPLAIN VECTORIZATION example:

(Note the PLAN VECTORIZATION, Map Vectorization, Reduce Vectorization sections)

Since SUMMARY is the default, it is the output of EXPLAIN VECTORIZATION SUMMARY.

Under Reducer 3’s "Reduce Vectorization:" you’ll see
notVectorizedReason: Aggregation Function UDF avg parameter expression for GROUPBY operator: Data type struct<count:bigint,sum:double,input:int> of Column[VALUE._col2] not supported

For Reducer 2’s "Reduce Vectorization:" you’ll see "groupByVectorOutput:": "false" which says a node has a GROUP BY with an AVG or some other aggregator that outputs a non-PRIMITIVE type (e.g. STRUCT) and all downstream operators are row-mode. I.e. not vector output.

If "usesVectorUDFAdaptor:": "false" were true, it would say there was at least one vectorized expression is using VectorUDFAdaptor.

And, "allNative:": "false" will be true when all operators are native. Today, GROUP BY and FILE SINK are not native. MAP JOIN and REDUCE SINK are conditionally native. FILTER and SELECT are native.

PLAN VECTORIZATION:
  enabled: true
  enabledConditionsMet: [hive.vectorized.execution.enabled IS true]

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Tez
...
      Edges:
        Reducer 2 <- Map 1 (SIMPLE_EDGE)
        Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
...
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: alltypesorc
                  Statistics: Num rows: 12288 Data size: 36696 Basic stats: COMPLETE Column stats: COMPLETE
                  Select Operator
                    expressions: cint (type: int)
                    outputColumnNames: cint
                    Statistics: Num rows: 12288 Data size: 36696 Basic stats: COMPLETE Column stats: COMPLETE
                    Group By Operator
                      keys: cint (type: int)
                      mode: hash
                      outputColumnNames: _col0
                      Statistics: Num rows: 5775 Data size: 17248 Basic stats: COMPLETE Column stats: COMPLETE
                      Reduce Output Operator
                        key expressions: _col0 (type: int)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: int)
                        Statistics: Num rows: 5775 Data size: 17248 Basic stats: COMPLETE Column stats: COMPLETE
            Execution mode: vectorized, llap
            LLAP IO: all inputs
            Map Vectorization:
                enabled: true
                enabledConditionsMet: hive.vectorized.use.vectorized.input.format IS true
                groupByVectorOutput: true
                inputFileFormats: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
                allNative: false
                usesVectorUDFAdaptor: false
                vectorized: true
        Reducer 2 
            Execution mode: vectorized, llap
            Reduce Vectorization:
                enabled: true
                enableConditionsMet: hive.vectorized.execution.reduce.enabled IS true, hive.execution.engine tez IN [tez, spark] IS true
                groupByVectorOutput: false
                allNative: false
                usesVectorUDFAdaptor: false
                vectorized: true
            Reduce Operator Tree:
              Group By Operator
                keys: KEY._col0 (type: int)
                mode: mergepartial
                outputColumnNames: _col0
                Statistics: Num rows: 5775 Data size: 17248 Basic stats: COMPLETE Column stats: COMPLETE
                Group By Operator
                  aggregations: sum(_col0), count(_col0), avg(_col0), std(_col0)
                  mode: hash
                  outputColumnNames: _col0, _col1, _col2, _col3
                  Statistics: Num rows: 1 Data size: 172 Basic stats: COMPLETE Column stats: COMPLETE
                  Reduce Output Operator
                    sort order: 
                    Statistics: Num rows: 1 Data size: 172 Basic stats: COMPLETE Column stats: COMPLETE
                    value expressions: _col0 (type: bigint), _col1 (type: bigint), _col2 (type: struct<count:bigint,sum:double,input:int>), _col3 (type: struct<count:bigint,sum:double,variance:double>)
        Reducer 3 
            Execution mode: llap
            Reduce Vectorization:
                enabled: true
                enableConditionsMet: hive.vectorized.execution.reduce.enabled IS true, hive.execution.engine tez IN [tez, spark] IS true
                notVectorizedReason: Aggregation Function UDF avg parameter expression for GROUPBY operator: Data type struct<count:bigint,sum:double,input:int> of Column[VALUE._col2] not supported
                vectorized: false
            Reduce Operator Tree:
              Group By Operator
                aggregations: sum(VALUE._col0), count(VALUE._col1), avg(VALUE._col2), std(VALUE._col3)
                mode: mergepartial
                outputColumnNames: _col0, _col1, _col2, _col3
                Statistics: Num rows: 1 Data size: 32 Basic stats: COMPLETE Column stats: COMPLETE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 1 Data size: 32 Basic stats: COMPLETE Column stats: COMPLETE
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

EXPLAIN VECTORIZATION OPERATOR

Notice the added TableScan Vectorization, Select Vectorization, Group By Vectorization, Map Join Vectorizatin, Reduce Sink Vectorization sections in this example.

Notice the nativeConditionsMet detail on why Reduce Vectorization is native.

PLAN VECTORIZATION:
  enabled: true
  enabledConditionsMet: [hive.vectorized.execution.enabled IS true]

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Tez
#### A masked pattern was here ####
      Edges:
        Map 2 <- Map 1 (BROADCAST_EDGE)
        Reducer 3 <- Map 2 (SIMPLE_EDGE)
#### A masked pattern was here ####
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: a
                  Statistics: Num rows: 3 Data size: 294 Basic stats: COMPLETE Column stats: NONE
                  TableScan Vectorization:
                      native: true
                      projectedOutputColumns: [0, 1]
                  Filter Operator
                    Filter Vectorization:
                        className: VectorFilterOperator
                        native: true
predicate: c2 is not null (type: boolean)
                    Statistics: Num rows: 3 Data size: 294 Basic stats: COMPLETE Column stats: NONE
                    Select Operator
                      expressions: c1 (type: int), c2 (type: char(10))
                      outputColumnNames: _col0, _col1
                      Select Vectorization:
                          className: VectorSelectOperator
                          native: true
                          projectedOutputColumns: [0, 1]
                      Statistics: Num rows: 3 Data size: 294 Basic stats: COMPLETE Column stats: NONE
                      Reduce Output Operator
                        key expressions: _col1 (type: char(20))
                        sort order: +
                        Map-reduce partition columns: _col1 (type: char(20))
                        Reduce Sink Vectorization:
                            className: VectorReduceSinkStringOperator
                            native: true
                            nativeConditionsMet: hive.vectorized.execution.reducesink.new.enabled IS true, hive.execution.engine tez IN [tez, spark] IS true, Not ACID UPDATE or DELETE IS true, No buckets IS true, No TopN IS true, Uniform Hash IS true, No DISTINCT columns IS true, BinarySortableSerDe for keys IS true, LazyBinarySerDe for values IS true
                        Statistics: Num rows: 3 Data size: 294 Basic stats: COMPLETE Column stats: NONE
                        value expressions: _col0 (type: int)
            Execution mode: vectorized, llap
            LLAP IO: all inputs
            Map Vectorization:
                enabled: true
                enabledConditionsMet: hive.vectorized.use.vectorized.input.format IS true
                groupByVectorOutput: true
                inputFileFormats: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
                allNative: true
                usesVectorUDFAdaptor: false
                vectorized: true
        Map 2 
            Map Operator Tree:
                TableScan
                  alias: b
                  Statistics: Num rows: 3 Data size: 324 Basic stats: COMPLETE Column stats: NONE
                  TableScan Vectorization:
                      native: true
                      projectedOutputColumns: [0, 1]
                  Filter Operator
                    Filter Vectorization:
                        className: VectorFilterOperator
                        native: true
predicate: c2 is not null (type: boolean)
                    Statistics: Num rows: 3 Data size: 324 Basic stats: COMPLETE Column stats: NONE
                    Select Operator
                      expressions: c1 (type: int), c2 (type: char(20))
                      outputColumnNames: _col0, _col1
                      Select Vectorization:
                          className: VectorSelectOperator
                          native: true
                          projectedOutputColumns: [0, 1]
                      Statistics: Num rows: 3 Data size: 324 Basic stats: COMPLETE Column stats: NONE
                      Map Join Operator
                        condition map:
                             Inner Join 0 to 1
                        keys:
                          0 _col1 (type: char(20))
                          1 _col1 (type: char(20))
                        Map Join Vectorization:
                            className: VectorMapJoinInnerStringOperator
                            native: true
                            nativeConditionsMet: hive.vectorized.execution.mapjoin.native.enabled IS true, hive.execution.engine tez IN [tez, spark] IS true, One MapJoin Condition IS true, No nullsafe IS true, Supports Key Types IS true, Not empty key IS true, When Fast Hash Table, then requires no Hybrid Hash Join IS true, Small table vectorizes IS true
                        outputColumnNames: _col0, _col1, _col2, _col3
                        input vertices:
                          0 Map 1
                        Statistics: Num rows: 3 Data size: 323 Basic stats: COMPLETE Column stats: NONE
                        Reduce Output Operator
                          key expressions: _col0 (type: int)
                          sort order: +
                          Reduce Sink Vectorization:
                              className: VectorReduceSinkOperator
                              native: false
                              nativeConditionsMet: hive.vectorized.execution.reducesink.new.enabled IS true, hive.execution.engine tez IN [tez, spark] IS true, Not ACID UPDATE or DELETE IS true, No buckets IS true, No TopN IS true, No DISTINCT columns IS true, BinarySortableSerDe for keys IS true, LazyBinarySerDe for values IS true
                              nativeConditionsNotMet: Uniform Hash IS false
                          Statistics: Num rows: 3 Data size: 323 Basic stats: COMPLETE Column stats: NONE
                          value expressions: _col1 (type: char(10)), _col2 (type: int), _col3 (type: char(20))
            Execution mode: vectorized, llap
            LLAP IO: all inputs
            Map Vectorization:
                enabled: true
                enabledConditionsMet: hive.vectorized.use.vectorized.input.format IS true
                groupByVectorOutput: true
                inputFileFormats: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
                allNative: false
                usesVectorUDFAdaptor: false
                vectorized: true
        Reducer 3 
            Execution mode: vectorized, llap
            Reduce Vectorization:
                enabled: true
                enableConditionsMet: hive.vectorized.execution.reduce.enabled IS true, hive.execution.engine tez IN [tez, spark] IS true
                groupByVectorOutput: true
                allNative: false
                usesVectorUDFAdaptor: false
                vectorized: true
            Reduce Operator Tree:
              Select Operator
                expressions: KEY.reducesinkkey0 (type: int), VALUE._col0 (type: char(10)), VALUE._col1 (type: int), VALUE._col2 (type: char(20))
                outputColumnNames: _col0, _col1, _col2, _col3
                Select Vectorization:
                    className: VectorSelectOperator
                    native: true
                    projectedOutputColumns: [0, 1, 2, 3]
                Statistics: Num rows: 3 Data size: 323 Basic stats: COMPLETE Column stats: NONE
                File Output Operator
                  compressed: false
                  File Sink Vectorization:
                      className: VectorFileSinkOperator
                      native: false
                  Statistics: Num rows: 3 Data size: 323 Basic stats: COMPLETE Column stats: NONE
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

EXPLAIN VECTORIZATION EXPRESSION

Notice the predicateExpression in this example.

PLAN VECTORIZATION:
  enabled: true
  enabledConditionsMet: [hive.vectorized.execution.enabled IS true]

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Tez
#### A masked pattern was here ####
      Edges:
        Reducer 2 <- Map 1 (SIMPLE_EDGE)
#### A masked pattern was here ####
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: vector_interval_2
                  Statistics: Num rows: 2 Data size: 788 Basic stats: COMPLETE Column stats: NONE
                  TableScan Vectorization:
                      native: true
                      projectedOutputColumns: [0, 1, 2, 3, 4, 5]
                  Filter Operator
                    Filter Vectorization:
                        className: VectorFilterOperator
                        native: true
                        predicateExpression: FilterExprAndExpr(children: FilterTimestampScalarEqualTimestampColumn(val 2001-01-01 01:02:03.0, col 6)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:03.000000000) -> 6:timestamp) -> boolean, FilterTimestampScalarNotEqualTimestampColumn(val 2001-01-01 01:02:03.0, col 6)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:04.000000000) -> 6:timestamp) -> boolean, FilterTimestampScalarLessEqualTimestampColumn(val 2001-01-01 01:02:03.0, col 6)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:03.000000000) -> 6:timestamp) -> boolean, FilterTimestampScalarLessTimestampColumn(val 2001-01-01 01:02:03.0, col 6)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:04.000000000) -> 6:timestamp) -> boolean, FilterTimestampScalarGreaterEqualTimestampColumn(val 2001-01-01 01:02:03.0, col 6)(children: DateColSubtractIntervalDayTimeScalar(col 1, val 0 01:02:03.000000000) -> 6:timestamp) -> boolean, FilterTimestampScalarGreaterTimestampColumn(val 2001-01-01 01:02:03.0, col 6)(children: DateColSubtractIntervalDayTimeScalar(col 1, val 0 01:02:04.000000000) -> 6:timestamp) -> boolean, FilterTimestampColEqualTimestampScalar(col 6, val 2001-01-01 01:02:03.0)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:03.000000000) -> 6:timestamp) -> boolean, FilterTimestampColNotEqualTimestampScalar(col 6, val 2001-01-01 01:02:03.0)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:04.000000000) -> 6:timestamp) -> boolean, FilterTimestampColGreaterEqualTimestampScalar(col 6, val 2001-01-01 01:02:03.0)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:03.000000000) -> 6:timestamp) -> boolean, FilterTimestampColGreaterTimestampScalar(col 6, val 2001-01-01 01:02:03.0)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:04.000000000) -> 6:timestamp) -> boolean, FilterTimestampColLessEqualTimestampScalar(col 6, val 2001-01-01 01:02:03.0)(children: DateColSubtractIntervalDayTimeScalar(col 1, val 0 01:02:03.000000000) -> 6:timestamp) -> boolean, FilterTimestampColLessTimestampScalar(col 6, val 2001-01-01 01:02:03.0)(children: DateColSubtractIntervalDayTimeScalar(col 1, val 0 01:02:04.000000000) -> 6:timestamp) -> boolean, FilterTimestampColEqualTimestampColumn(col 0, col 6)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:03.000000000) -> 6:timestamp) -> boolean, FilterTimestampColNotEqualTimestampColumn(col 0, col 6)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:04.000000000) -> 6:timestamp) -> boolean, FilterTimestampColLessEqualTimestampColumn(col 0, col 6)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:03.000000000) -> 6:timestamp) -> boolean, FilterTimestampColLessTimestampColumn(col 0, col 6)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:04.000000000) -> 6:timestamp) -> boolean, FilterTimestampColGreaterEqualTimestampColumn(col 0, col 6)(children: DateColSubtractIntervalDayTimeScalar(col 1, val 0 01:02:03.000000000) -> 6:timestamp) -> boolean, FilterTimestampColGreaterTimestampColumn(col 0, col 6)(children: DateColSubtractIntervalDayTimeScalar(col 1, val 0 01:02:04.000000000) -> 6:timestamp) -> boolean) -> boolean
                    predicate: ((2001-01-01 01:02:03.0 = (dt + 0 01:02:03.000000000)) and (2001-01-01 01:02:03.0 <> (dt + 0 01:02:04.000000000)) and (2001-01-01 01:02:03.0 <= (dt + 0 01:02:03.000000000)) and (2001-01-01 01:02:03.0 < (dt + 0 01:02:04.000000000)) and (2001-01-01 01:02:03.0 >= (dt - 0 01:02:03.000000000)) and (2001-01-01 01:02:03.0 > (dt - 0 01:02:04.000000000)) and ((dt + 0 01:02:03.000000000) = 2001-01-01 01:02:03.0) and ((dt + 0 01:02:04.000000000) <> 2001-01-01 01:02:03.0) and ((dt + 0 01:02:03.000000000) >= 2001-01-01 01:02:03.0) and ((dt + 0 01:02:04.000000000) > 2001-01-01 01:02:03.0) and ((dt - 0 01:02:03.000000000) <= 2001-01-01 01:02:03.0) and ((dt - 0 01:02:04.000000000) < 2001-01-01 01:02:03.0) and (ts = (dt + 0 01:02:03.000000000)) and (ts <> (dt + 0 01:02:04.000000000)) and (ts <= (dt + 0 01:02:03.000000000)) and (ts < (dt + 0 01:02:04.000000000)) and (ts >= (dt - 0 01:02:03.000000000)) and (ts > (dt - 0 01:02:04.000000000))) (type: boolean)
                    Statistics: Num rows: 1 Data size: 394 Basic stats: COMPLETE Column stats: NONE
                    Select Operator
                      expressions: ts (type: timestamp)
                      outputColumnNames: _col0
                      Select Vectorization:
                          className: VectorSelectOperator
                          native: true
                          projectedOutputColumns: [0]
                      Statistics: Num rows: 1 Data size: 394 Basic stats: COMPLETE Column stats: NONE
                      Reduce Output Operator
                        key expressions: _col0 (type: timestamp)
                        sort order: +
                        Reduce Sink Vectorization:
                            className: VectorReduceSinkOperator
                            native: false
                            nativeConditionsMet: hive.vectorized.execution.reducesink.new.enabled IS true, hive.execution.engine tez IN [tez, spark] IS true, Not ACID UPDATE or DELETE IS true, No buckets IS true, No TopN IS true, No DISTINCT columns IS true, BinarySortableSerDe for keys IS true, LazyBinarySerDe for values IS true
                            nativeConditionsNotMet: Uniform Hash IS false
                        Statistics: Num rows: 1 Data size: 394 Basic stats: COMPLETE Column stats: NONE
            Execution mode: vectorized, llap
            LLAP IO: all inputs
            Map Vectorization:
                enabled: true
                enabledConditionsMet: hive.vectorized.use.vectorized.input.format IS true
                groupByVectorOutput: true
                inputFileFormats: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
                allNative: false
                usesVectorUDFAdaptor: false
                vectorized: true
        Reducer 2 
...