Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-11394

Enhance EXPLAIN display for vectorization

Log workAgile BoardRank to TopRank to BottomBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • None
    • 2.3.0
    • Hive
    • None

    Description

      Add detail to the EXPLAIN output showing why a Map and Reduce work is not vectorized.

      New syntax is: EXPLAIN VECTORIZATION [ONLY] [SUMMARY|OPERATOR|EXPRESSION|DETAIL]

      The ONLY option suppresses most non-vectorization elements.

      SUMMARY shows vectorization information for the PLAN (is vectorization enabled) and a summary of Map and Reduce work.

      OPERATOR shows vectorization information for operators. E.g. Filter Vectorization. It includes all information of SUMMARY, too.

      EXPRESSION shows vectorization information for expressions. E.g. predicateExpression. It includes all information of SUMMARY and OPERATOR, too.

      DETAIL shows very vectorization information.
      It includes all information of SUMMARY, OPERATOR, and EXPRESSION too.

      The optional clause defaults are not ONLY and SUMMARY.

      ---------------------------------------------------------------------------------------------------

      Here are some examples:

      EXPLAIN VECTORIZATION example:

      (Note the PLAN VECTORIZATION, Map Vectorization, Reduce Vectorization sections)

      Since SUMMARY is the default, it is the output of EXPLAIN VECTORIZATION SUMMARY.

      Under Reducer 3’s "Reduce Vectorization:" you’ll see
      notVectorizedReason: Aggregation Function UDF avg parameter expression for GROUPBY operator: Data type struct<count:bigint,sum:double,input:int> of Column[VALUE._col2] not supported

      For Reducer 2’s "Reduce Vectorization:" you’ll see "groupByVectorOutput:": "false" which says a node has a GROUP BY with an AVG or some other aggregator that outputs a non-PRIMITIVE type (e.g. STRUCT) and all downstream operators are row-mode. I.e. not vector output.

      If "usesVectorUDFAdaptor:": "false" were true, it would say there was at least one vectorized expression is using VectorUDFAdaptor.

      And, "allNative:": "false" will be true when all operators are native. Today, GROUP BY and FILE SINK are not native. MAP JOIN and REDUCE SINK are conditionally native. FILTER and SELECT are native.

      PLAN VECTORIZATION:
        enabled: true
        enabledConditionsMet: [hive.vectorized.execution.enabled IS true]
      
      STAGE DEPENDENCIES:
        Stage-1 is a root stage
        Stage-0 depends on stages: Stage-1
      
      STAGE PLANS:
        Stage: Stage-1
          Tez
      ...
            Edges:
              Reducer 2 <- Map 1 (SIMPLE_EDGE)
              Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
      ...
            Vertices:
              Map 1 
                  Map Operator Tree:
                      TableScan
                        alias: alltypesorc
                        Statistics: Num rows: 12288 Data size: 36696 Basic stats: COMPLETE Column stats: COMPLETE
                        Select Operator
                          expressions: cint (type: int)
                          outputColumnNames: cint
                          Statistics: Num rows: 12288 Data size: 36696 Basic stats: COMPLETE Column stats: COMPLETE
                          Group By Operator
                            keys: cint (type: int)
                            mode: hash
                            outputColumnNames: _col0
                            Statistics: Num rows: 5775 Data size: 17248 Basic stats: COMPLETE Column stats: COMPLETE
                            Reduce Output Operator
                              key expressions: _col0 (type: int)
                              sort order: +
                              Map-reduce partition columns: _col0 (type: int)
                              Statistics: Num rows: 5775 Data size: 17248 Basic stats: COMPLETE Column stats: COMPLETE
                  Execution mode: vectorized, llap
                  LLAP IO: all inputs
                  Map Vectorization:
                      enabled: true
                      enabledConditionsMet: hive.vectorized.use.vectorized.input.format IS true
                      groupByVectorOutput: true
                      inputFileFormats: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
                      allNative: false
                      usesVectorUDFAdaptor: false
                      vectorized: true
              Reducer 2 
                  Execution mode: vectorized, llap
                  Reduce Vectorization:
                      enabled: true
                      enableConditionsMet: hive.vectorized.execution.reduce.enabled IS true, hive.execution.engine tez IN [tez, spark] IS true
                      groupByVectorOutput: false
                      allNative: false
                      usesVectorUDFAdaptor: false
                      vectorized: true
                  Reduce Operator Tree:
                    Group By Operator
                      keys: KEY._col0 (type: int)
                      mode: mergepartial
                      outputColumnNames: _col0
                      Statistics: Num rows: 5775 Data size: 17248 Basic stats: COMPLETE Column stats: COMPLETE
                      Group By Operator
                        aggregations: sum(_col0), count(_col0), avg(_col0), std(_col0)
                        mode: hash
                        outputColumnNames: _col0, _col1, _col2, _col3
                        Statistics: Num rows: 1 Data size: 172 Basic stats: COMPLETE Column stats: COMPLETE
                        Reduce Output Operator
                          sort order: 
                          Statistics: Num rows: 1 Data size: 172 Basic stats: COMPLETE Column stats: COMPLETE
                          value expressions: _col0 (type: bigint), _col1 (type: bigint), _col2 (type: struct<count:bigint,sum:double,input:int>), _col3 (type: struct<count:bigint,sum:double,variance:double>)
              Reducer 3 
                  Execution mode: llap
                  Reduce Vectorization:
                      enabled: true
                      enableConditionsMet: hive.vectorized.execution.reduce.enabled IS true, hive.execution.engine tez IN [tez, spark] IS true
                      notVectorizedReason: Aggregation Function UDF avg parameter expression for GROUPBY operator: Data type struct<count:bigint,sum:double,input:int> of Column[VALUE._col2] not supported
                      vectorized: false
                  Reduce Operator Tree:
                    Group By Operator
                      aggregations: sum(VALUE._col0), count(VALUE._col1), avg(VALUE._col2), std(VALUE._col3)
                      mode: mergepartial
                      outputColumnNames: _col0, _col1, _col2, _col3
                      Statistics: Num rows: 1 Data size: 32 Basic stats: COMPLETE Column stats: COMPLETE
                      File Output Operator
                        compressed: false
                        Statistics: Num rows: 1 Data size: 32 Basic stats: COMPLETE Column stats: COMPLETE
                        table:
                            input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                            output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                            serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      
        Stage: Stage-0
          Fetch Operator
            limit: -1
            Processor Tree:
              ListSink 
      

      EXPLAIN VECTORIZATION OPERATOR

      Notice the added TableScan Vectorization, Select Vectorization, Group By Vectorization, Map Join Vectorizatin, Reduce Sink Vectorization sections in this example.

      Notice the nativeConditionsMet detail on why Reduce Vectorization is native.

      PLAN VECTORIZATION:
        enabled: true
        enabledConditionsMet: [hive.vectorized.execution.enabled IS true]
      
      STAGE DEPENDENCIES:
        Stage-1 is a root stage
        Stage-0 depends on stages: Stage-1
      
      STAGE PLANS:
        Stage: Stage-1
          Tez
      #### A masked pattern was here ####
            Edges:
              Map 2 <- Map 1 (BROADCAST_EDGE)
              Reducer 3 <- Map 2 (SIMPLE_EDGE)
      #### A masked pattern was here ####
            Vertices:
              Map 1 
                  Map Operator Tree:
                      TableScan
                        alias: a
                        Statistics: Num rows: 3 Data size: 294 Basic stats: COMPLETE Column stats: NONE
                        TableScan Vectorization:
                            native: true
                            projectedOutputColumns: [0, 1]
                        Filter Operator
                          Filter Vectorization:
                              className: VectorFilterOperator
                              native: true
      predicate: c2 is not null (type: boolean)
                          Statistics: Num rows: 3 Data size: 294 Basic stats: COMPLETE Column stats: NONE
                          Select Operator
                            expressions: c1 (type: int), c2 (type: char(10))
                            outputColumnNames: _col0, _col1
                            Select Vectorization:
                                className: VectorSelectOperator
                                native: true
                                projectedOutputColumns: [0, 1]
                            Statistics: Num rows: 3 Data size: 294 Basic stats: COMPLETE Column stats: NONE
                            Reduce Output Operator
                              key expressions: _col1 (type: char(20))
                              sort order: +
                              Map-reduce partition columns: _col1 (type: char(20))
                              Reduce Sink Vectorization:
                                  className: VectorReduceSinkStringOperator
                                  native: true
                                  nativeConditionsMet: hive.vectorized.execution.reducesink.new.enabled IS true, hive.execution.engine tez IN [tez, spark] IS true, Not ACID UPDATE or DELETE IS true, No buckets IS true, No TopN IS true, Uniform Hash IS true, No DISTINCT columns IS true, BinarySortableSerDe for keys IS true, LazyBinarySerDe for values IS true
                              Statistics: Num rows: 3 Data size: 294 Basic stats: COMPLETE Column stats: NONE
                              value expressions: _col0 (type: int)
                  Execution mode: vectorized, llap
                  LLAP IO: all inputs
                  Map Vectorization:
                      enabled: true
                      enabledConditionsMet: hive.vectorized.use.vectorized.input.format IS true
                      groupByVectorOutput: true
                      inputFileFormats: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
                      allNative: true
                      usesVectorUDFAdaptor: false
                      vectorized: true
              Map 2 
                  Map Operator Tree:
                      TableScan
                        alias: b
                        Statistics: Num rows: 3 Data size: 324 Basic stats: COMPLETE Column stats: NONE
                        TableScan Vectorization:
                            native: true
                            projectedOutputColumns: [0, 1]
                        Filter Operator
                          Filter Vectorization:
                              className: VectorFilterOperator
                              native: true
      predicate: c2 is not null (type: boolean)
                          Statistics: Num rows: 3 Data size: 324 Basic stats: COMPLETE Column stats: NONE
                          Select Operator
                            expressions: c1 (type: int), c2 (type: char(20))
                            outputColumnNames: _col0, _col1
                            Select Vectorization:
                                className: VectorSelectOperator
                                native: true
                                projectedOutputColumns: [0, 1]
                            Statistics: Num rows: 3 Data size: 324 Basic stats: COMPLETE Column stats: NONE
                            Map Join Operator
                              condition map:
                                   Inner Join 0 to 1
                              keys:
                                0 _col1 (type: char(20))
                                1 _col1 (type: char(20))
                              Map Join Vectorization:
                                  className: VectorMapJoinInnerStringOperator
                                  native: true
                                  nativeConditionsMet: hive.vectorized.execution.mapjoin.native.enabled IS true, hive.execution.engine tez IN [tez, spark] IS true, One MapJoin Condition IS true, No nullsafe IS true, Supports Key Types IS true, Not empty key IS true, When Fast Hash Table, then requires no Hybrid Hash Join IS true, Small table vectorizes IS true
                              outputColumnNames: _col0, _col1, _col2, _col3
                              input vertices:
                                0 Map 1
                              Statistics: Num rows: 3 Data size: 323 Basic stats: COMPLETE Column stats: NONE
                              Reduce Output Operator
                                key expressions: _col0 (type: int)
                                sort order: +
                                Reduce Sink Vectorization:
                                    className: VectorReduceSinkOperator
                                    native: false
                                    nativeConditionsMet: hive.vectorized.execution.reducesink.new.enabled IS true, hive.execution.engine tez IN [tez, spark] IS true, Not ACID UPDATE or DELETE IS true, No buckets IS true, No TopN IS true, No DISTINCT columns IS true, BinarySortableSerDe for keys IS true, LazyBinarySerDe for values IS true
                                    nativeConditionsNotMet: Uniform Hash IS false
                                Statistics: Num rows: 3 Data size: 323 Basic stats: COMPLETE Column stats: NONE
                                value expressions: _col1 (type: char(10)), _col2 (type: int), _col3 (type: char(20))
                  Execution mode: vectorized, llap
                  LLAP IO: all inputs
                  Map Vectorization:
                      enabled: true
                      enabledConditionsMet: hive.vectorized.use.vectorized.input.format IS true
                      groupByVectorOutput: true
                      inputFileFormats: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
                      allNative: false
                      usesVectorUDFAdaptor: false
                      vectorized: true
              Reducer 3 
                  Execution mode: vectorized, llap
                  Reduce Vectorization:
                      enabled: true
                      enableConditionsMet: hive.vectorized.execution.reduce.enabled IS true, hive.execution.engine tez IN [tez, spark] IS true
                      groupByVectorOutput: true
                      allNative: false
                      usesVectorUDFAdaptor: false
                      vectorized: true
                  Reduce Operator Tree:
                    Select Operator
                      expressions: KEY.reducesinkkey0 (type: int), VALUE._col0 (type: char(10)), VALUE._col1 (type: int), VALUE._col2 (type: char(20))
                      outputColumnNames: _col0, _col1, _col2, _col3
                      Select Vectorization:
                          className: VectorSelectOperator
                          native: true
                          projectedOutputColumns: [0, 1, 2, 3]
                      Statistics: Num rows: 3 Data size: 323 Basic stats: COMPLETE Column stats: NONE
                      File Output Operator
                        compressed: false
                        File Sink Vectorization:
                            className: VectorFileSinkOperator
                            native: false
                        Statistics: Num rows: 3 Data size: 323 Basic stats: COMPLETE Column stats: NONE
                        table:
                            input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                            output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                            serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      
        Stage: Stage-0
          Fetch Operator
            limit: -1
            Processor Tree:
              ListSink
       

      EXPLAIN VECTORIZATION EXPRESSION

      Notice the predicateExpression in this example.

      PLAN VECTORIZATION:
        enabled: true
        enabledConditionsMet: [hive.vectorized.execution.enabled IS true]
      
      STAGE DEPENDENCIES:
        Stage-1 is a root stage
        Stage-0 depends on stages: Stage-1
      
      STAGE PLANS:
        Stage: Stage-1
          Tez
      #### A masked pattern was here ####
            Edges:
              Reducer 2 <- Map 1 (SIMPLE_EDGE)
      #### A masked pattern was here ####
            Vertices:
              Map 1 
                  Map Operator Tree:
                      TableScan
                        alias: vector_interval_2
                        Statistics: Num rows: 2 Data size: 788 Basic stats: COMPLETE Column stats: NONE
                        TableScan Vectorization:
                            native: true
                            projectedOutputColumns: [0, 1, 2, 3, 4, 5]
                        Filter Operator
                          Filter Vectorization:
                              className: VectorFilterOperator
                              native: true
                              predicateExpression: FilterExprAndExpr(children: FilterTimestampScalarEqualTimestampColumn(val 2001-01-01 01:02:03.0, col 6)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:03.000000000) -> 6:timestamp) -> boolean, FilterTimestampScalarNotEqualTimestampColumn(val 2001-01-01 01:02:03.0, col 6)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:04.000000000) -> 6:timestamp) -> boolean, FilterTimestampScalarLessEqualTimestampColumn(val 2001-01-01 01:02:03.0, col 6)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:03.000000000) -> 6:timestamp) -> boolean, FilterTimestampScalarLessTimestampColumn(val 2001-01-01 01:02:03.0, col 6)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:04.000000000) -> 6:timestamp) -> boolean, FilterTimestampScalarGreaterEqualTimestampColumn(val 2001-01-01 01:02:03.0, col 6)(children: DateColSubtractIntervalDayTimeScalar(col 1, val 0 01:02:03.000000000) -> 6:timestamp) -> boolean, FilterTimestampScalarGreaterTimestampColumn(val 2001-01-01 01:02:03.0, col 6)(children: DateColSubtractIntervalDayTimeScalar(col 1, val 0 01:02:04.000000000) -> 6:timestamp) -> boolean, FilterTimestampColEqualTimestampScalar(col 6, val 2001-01-01 01:02:03.0)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:03.000000000) -> 6:timestamp) -> boolean, FilterTimestampColNotEqualTimestampScalar(col 6, val 2001-01-01 01:02:03.0)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:04.000000000) -> 6:timestamp) -> boolean, FilterTimestampColGreaterEqualTimestampScalar(col 6, val 2001-01-01 01:02:03.0)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:03.000000000) -> 6:timestamp) -> boolean, FilterTimestampColGreaterTimestampScalar(col 6, val 2001-01-01 01:02:03.0)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:04.000000000) -> 6:timestamp) -> boolean, FilterTimestampColLessEqualTimestampScalar(col 6, val 2001-01-01 01:02:03.0)(children: DateColSubtractIntervalDayTimeScalar(col 1, val 0 01:02:03.000000000) -> 6:timestamp) -> boolean, FilterTimestampColLessTimestampScalar(col 6, val 2001-01-01 01:02:03.0)(children: DateColSubtractIntervalDayTimeScalar(col 1, val 0 01:02:04.000000000) -> 6:timestamp) -> boolean, FilterTimestampColEqualTimestampColumn(col 0, col 6)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:03.000000000) -> 6:timestamp) -> boolean, FilterTimestampColNotEqualTimestampColumn(col 0, col 6)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:04.000000000) -> 6:timestamp) -> boolean, FilterTimestampColLessEqualTimestampColumn(col 0, col 6)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:03.000000000) -> 6:timestamp) -> boolean, FilterTimestampColLessTimestampColumn(col 0, col 6)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:04.000000000) -> 6:timestamp) -> boolean, FilterTimestampColGreaterEqualTimestampColumn(col 0, col 6)(children: DateColSubtractIntervalDayTimeScalar(col 1, val 0 01:02:03.000000000) -> 6:timestamp) -> boolean, FilterTimestampColGreaterTimestampColumn(col 0, col 6)(children: DateColSubtractIntervalDayTimeScalar(col 1, val 0 01:02:04.000000000) -> 6:timestamp) -> boolean) -> boolean
                          predicate: ((2001-01-01 01:02:03.0 = (dt + 0 01:02:03.000000000)) and (2001-01-01 01:02:03.0 <> (dt + 0 01:02:04.000000000)) and (2001-01-01 01:02:03.0 <= (dt + 0 01:02:03.000000000)) and (2001-01-01 01:02:03.0 < (dt + 0 01:02:04.000000000)) and (2001-01-01 01:02:03.0 >= (dt - 0 01:02:03.000000000)) and (2001-01-01 01:02:03.0 > (dt - 0 01:02:04.000000000)) and ((dt + 0 01:02:03.000000000) = 2001-01-01 01:02:03.0) and ((dt + 0 01:02:04.000000000) <> 2001-01-01 01:02:03.0) and ((dt + 0 01:02:03.000000000) >= 2001-01-01 01:02:03.0) and ((dt + 0 01:02:04.000000000) > 2001-01-01 01:02:03.0) and ((dt - 0 01:02:03.000000000) <= 2001-01-01 01:02:03.0) and ((dt - 0 01:02:04.000000000) < 2001-01-01 01:02:03.0) and (ts = (dt + 0 01:02:03.000000000)) and (ts <> (dt + 0 01:02:04.000000000)) and (ts <= (dt + 0 01:02:03.000000000)) and (ts < (dt + 0 01:02:04.000000000)) and (ts >= (dt - 0 01:02:03.000000000)) and (ts > (dt - 0 01:02:04.000000000))) (type: boolean)
                          Statistics: Num rows: 1 Data size: 394 Basic stats: COMPLETE Column stats: NONE
                          Select Operator
                            expressions: ts (type: timestamp)
                            outputColumnNames: _col0
                            Select Vectorization:
                                className: VectorSelectOperator
                                native: true
                                projectedOutputColumns: [0]
                            Statistics: Num rows: 1 Data size: 394 Basic stats: COMPLETE Column stats: NONE
                            Reduce Output Operator
                              key expressions: _col0 (type: timestamp)
                              sort order: +
                              Reduce Sink Vectorization:
                                  className: VectorReduceSinkOperator
                                  native: false
                                  nativeConditionsMet: hive.vectorized.execution.reducesink.new.enabled IS true, hive.execution.engine tez IN [tez, spark] IS true, Not ACID UPDATE or DELETE IS true, No buckets IS true, No TopN IS true, No DISTINCT columns IS true, BinarySortableSerDe for keys IS true, LazyBinarySerDe for values IS true
                                  nativeConditionsNotMet: Uniform Hash IS false
                              Statistics: Num rows: 1 Data size: 394 Basic stats: COMPLETE Column stats: NONE
                  Execution mode: vectorized, llap
                  LLAP IO: all inputs
                  Map Vectorization:
                      enabled: true
                      enabledConditionsMet: hive.vectorized.use.vectorized.input.format IS true
                      groupByVectorOutput: true
                      inputFileFormats: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
                      allNative: false
                      usesVectorUDFAdaptor: false
                      vectorized: true
              Reducer 2 
      ... 
      

      The standard @Explain Annotation Type is used. A new 'vectorization' annotation marks each new class and method.

      Works for FORMATTED, like other non-vectorization EXPLAIN variations.

      Attachments

        1. HIVE-11394.01.patch
          5.74 MB
          Matt McCline
        2. HIVE-11394.02.patch
          6.70 MB
          Matt McCline
        3. HIVE-11394.03.patch
          6.70 MB
          Matt McCline
        4. HIVE-11394.04.patch
          6.65 MB
          Matt McCline
        5. HIVE-11394.05.patch
          6.65 MB
          Matt McCline
        6. HIVE-11394.06.patch
          7.36 MB
          Matt McCline
        7. HIVE-11394.07.patch
          6.07 MB
          Matt McCline
        8. HIVE-11394.08.patch
          6.74 MB
          Matt McCline
        9. HIVE-11394.09.patch
          6.83 MB
          Matt McCline
        10. HIVE-11394.091.patch
          7.18 MB
          Matt McCline
        11. HIVE-11394.092.patch
          7.20 MB
          Matt McCline
        12. HIVE-11394.093.patch
          6.88 MB
          Matt McCline
        13. HIVE-11394.094.patch
          2.47 MB
          Matt McCline
        14. HIVE-11394.095.patch
          2.50 MB
          Matt McCline
        15. HIVE-11394.096.patch
          5.08 MB
          Matt McCline
        16. HIVE-11394.097.patch
          5.08 MB
          Matt McCline
        17. HIVE-11394.098.patch
          6.11 MB
          Matt McCline
        18. HIVE-11394.099.patch
          6.51 MB
          Matt McCline
        19. HIVE-11394.0991.patch
          6.53 MB
          Matt McCline
        20. HIVE-11394.0992.patch
          6.54 MB
          Matt McCline

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            mmccline Matt McCline Assign to me
            mmccline Matt McCline
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment