Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
3.1.1
-
Hive-3, JDK-8
Description
with vectorization turned on and hive.vectorized.adaptor.usage.mode=all we were seeing severe performance degradation. looking at the task jstacks it seems that it is running the code which vectorizes UDF and stuck in some loop.
jstack -l 14954 | grep 0x3af0 -A20 "TezChild" #15 daemon prio=5 os_prio=0 tid=0x00007f157538d800 nid=0x3af0 runnable [0x00007f1547581000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.hive.ql.exec.vector.VectorAssignRow.assignRowColumn(VectorAssignRow.java:573) at org.apache.hadoop.hive.ql.exec.vector.VectorAssignRow.assignRowColumn(VectorAssignRow.java:350) at org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.setResult(VectorUDFAdaptor.java:205) at org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.evaluate(VectorUDFAdaptor.java:150) at org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression.evaluateChildren(VectorExpression.java:271) at org.apache.hadoop.hive.ql.exec.vector.expressions.ListIndexColScalar.evaluate(ListIndexColScalar.java:59) at org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.process(VectorSelectOperator.java:146) at org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:965) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:938) at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:125) at org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:889) at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:92) at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:76) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:426) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:267) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61) [yarn@hdp32b ~]$ jstack -l 14954 | grep 0x3af0 -A20 "TezChild" #15 daemon prio=5 os_prio=0 tid=0x00007f157538d800 nid=0x3af0 runnable [0x00007f1547581000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector.ensureSize(BytesColumnVector.java:554) at org.apache.hadoop.hive.ql.exec.vector.VectorAssignRow.assignRowColumn(VectorAssignRow.java:570) at org.apache.hadoop.hive.ql.exec.vector.VectorAssignRow.assignRowColumn(VectorAssignRow.java:350) at org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.setResult(VectorUDFAdaptor.java:205) at org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.evaluate(VectorUDFAdaptor.java:150) at org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression.evaluateChildren(VectorExpression.java:271) at org.apache.hadoop.hive.ql.exec.vector.expressions.ListIndexColScalar.evaluate(ListIndexColScalar.java:59) at org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.process(VectorSelectOperator.java:146) at org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:965) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:938) at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:125) at org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:889) at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:92) at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:76) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:426) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:267) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
after setting the hive.vectorized.adaptor.usage.mode=none query did complete much faster.
Steps To Reproduce:
1. Create Table:
+----------------------------------------------------+ | createtab_stmt | +----------------------------------------------------+ | CREATE EXTERNAL TABLE `splittestloc`( | | `id` int, | | `value` string) | | ROW FORMAT SERDE | | 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' | | WITH SERDEPROPERTIES ( | | 'field.delim'=',', | | 'serialization.format'=',') | | STORED AS INPUTFORMAT | | 'org.apache.hadoop.mapred.TextInputFormat' | | OUTPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' | | LOCATION | | 'hdfs://hdp31a.hdp.local:8020/tmp/splittableloc' | | TBLPROPERTIES ( | | 'bucketing_version'='2', | | 'transient_lastDdlTime'='1561482451') | +----------------------------------------------------+
2. Sample data: table has some 40M rows and sample data is generated using following script.
for i in {1..40000000} ; do echo $i,"start#mid#"$i >> data.log ; done
3. I believe this should be reproducible with hive generic split but I am attaching the custom UDF to split the string.
4. create a function
add jar /tmp/CustomSplit-1.0-SNAPSHOT.jar;
create temporary function mysplit as 'com.rajkrrsingh.split.test.CustomSplit'
5. run the following query which will reproduce the issue if vectorization turned on.
create temporary table tmp2 as select id,mysplit(value,"#")[2] from splittestloc
Attachments
Attachments
Issue Links
- is related to
-
HIVE-27412 Vectorization: VectorMapOperator does not reset VectorizedRowBatch after processing
- Resolved
- links to