Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-21935

Hive Vectorization : degraded performance with vectorize UDF

    XMLWordPrintableJSON

Details

    Description

      with vectorization turned on and hive.vectorized.adaptor.usage.mode=all we were seeing severe performance degradation. looking at the task jstacks it seems that it is running the code which vectorizes UDF and stuck in some loop.

      jstack -l 14954 | grep 0x3af0 -A20
      "TezChild" #15 daemon prio=5 os_prio=0 tid=0x00007f157538d800 nid=0x3af0 runnable [0x00007f1547581000]
         java.lang.Thread.State: RUNNABLE
      	at org.apache.hadoop.hive.ql.exec.vector.VectorAssignRow.assignRowColumn(VectorAssignRow.java:573)
      	at org.apache.hadoop.hive.ql.exec.vector.VectorAssignRow.assignRowColumn(VectorAssignRow.java:350)
      	at org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.setResult(VectorUDFAdaptor.java:205)
      	at org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.evaluate(VectorUDFAdaptor.java:150)
      	at org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression.evaluateChildren(VectorExpression.java:271)
      	at org.apache.hadoop.hive.ql.exec.vector.expressions.ListIndexColScalar.evaluate(ListIndexColScalar.java:59)
      	at org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.process(VectorSelectOperator.java:146)
      	at org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:965)
      	at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:938)
      	at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:125)
      	at org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:889)
      	at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:92)
      	at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:76)
      	at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:426)
      	at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:267)
      	at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)
      	at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
      	at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
      	at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
      [yarn@hdp32b ~]$ jstack -l 14954 | grep 0x3af0 -A20
      "TezChild" #15 daemon prio=5 os_prio=0 tid=0x00007f157538d800 nid=0x3af0 runnable [0x00007f1547581000]
         java.lang.Thread.State: RUNNABLE
      	at org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector.ensureSize(BytesColumnVector.java:554)
      	at org.apache.hadoop.hive.ql.exec.vector.VectorAssignRow.assignRowColumn(VectorAssignRow.java:570)
      	at org.apache.hadoop.hive.ql.exec.vector.VectorAssignRow.assignRowColumn(VectorAssignRow.java:350)
      	at org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.setResult(VectorUDFAdaptor.java:205)
      	at org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.evaluate(VectorUDFAdaptor.java:150)
      	at org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression.evaluateChildren(VectorExpression.java:271)
      	at org.apache.hadoop.hive.ql.exec.vector.expressions.ListIndexColScalar.evaluate(ListIndexColScalar.java:59)
      	at org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.process(VectorSelectOperator.java:146)
      	at org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:965)
      	at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:938)
      	at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:125)
      	at org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:889)
      	at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:92)
      	at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:76)
      	at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:426)
      	at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:267)
      	at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)
      	at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
      	at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
      
      

      after setting the hive.vectorized.adaptor.usage.mode=none query did complete much faster.

      Steps To Reproduce:
      1. Create Table:

      +----------------------------------------------------+
      |                   createtab_stmt                   |
      +----------------------------------------------------+
      | CREATE EXTERNAL TABLE `splittestloc`(              |
      |   `id` int,                                        |
      |   `value` string)                                  |
      | ROW FORMAT SERDE                                   |
      |   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
      | WITH SERDEPROPERTIES (                             |
      |   'field.delim'=',',                               |
      |   'serialization.format'=',')                      |
      | STORED AS INPUTFORMAT                              |
      |   'org.apache.hadoop.mapred.TextInputFormat'       |
      | OUTPUTFORMAT                                       |
      |   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
      | LOCATION                                           |
      |   'hdfs://hdp31a.hdp.local:8020/tmp/splittableloc' |
      | TBLPROPERTIES (                                    |
      |   'bucketing_version'='2',                         |
      |   'transient_lastDdlTime'='1561482451')            |
      +----------------------------------------------------+
      

      2. Sample data: table has some 40M rows and sample data is generated using following script.

      for i in {1..40000000} ; do echo $i,"start#mid#"$i >> data.log ; done
      

      3. I believe this should be reproducible with hive generic split but I am attaching the custom UDF to split the string.

      4. create a function

      add jar /tmp/CustomSplit-1.0-SNAPSHOT.jar; 
      create temporary function mysplit as 'com.rajkrrsingh.split.test.CustomSplit' 
      

      5. run the following query which will reproduce the issue if vectorization turned on.

      create temporary table tmp2 as select id,mysplit(value,"#")[2] from splittestloc 
      

      Attachments

        1. CustomSplit-1.0-SNAPSHOT.jar
          5 kB
          Rajkumar Singh

        Issue Links

          Activity

            People

              mustafaiman Mustafa İman
              Rajkumar Singh Rajkumar Singh
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 40m
                  1h 40m