Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
3.1.0, 3.1.2
-
None
Description
We have a query which produces duplicate results even when there is no duplicate records in underlying tables.
Sample Query
select * from orderdatatest_ext order by col1 limit 1000,50
The problem appears when order by clause is used with col1 having non-unique rows. Apparently the duplicates are being produced during reducer phase of the query.
set hive.vectorized.execution.reduce.enabled=false does not cause the problem.
Data in table is as follows.
1,1 1,2 1,3 . . 1,1500
Results with hive.vectorized.execution.reduce.enabled=true
+-------------------------+-------------------------+ | orderdatatest_ext.col1 | orderdatatest_ext.col2 | +-------------------------+-------------------------+ | 1 | 1001 | | 1 | 1002 | | 1 | 1003 | | 1 | 1004 | | 1 | 1005 | | 1 | 1006 | | 1 | 1007 | | 1 | 1008 | | 1 | 1009 | | 1 | 1010 | | 1 | 1011 | | 1 | 1012 | | 1 | 1013 | | 1 | 1014 | | 1 | 1015 | | 1 | 1016 | | 1 | 1017 | | 1 | 1018 | | 1 | 1019 | | 1 | 1020 | | 1 | 1021 | | 1 | 1022 | | 1 | 1023 | | 1 | 1024 | | 1 | 1 | | 1 | 1 | | 1 | 1 | | 1 | 1 | | 1 | 1 | | 1 | 1 | | 1 | 1 | | 1 | 1 | | 1 | 1 | | 1 | 1 | | 1 | 1 | | 1 | 1 | | 1 | 1 | | 1 | 1 | | 1 | 1 | | 1 | 1 | | 1 | 1 | | 1 | 1 | | 1 | 1 | | 1 | 1 | | 1 | 1 | | 1 | 1 | | 1 | 1 | | 1 | 1 | | 1 | 1 | | 1 | 1 | +-------------------------+-------------------------+
Results with hive.vectorized.execution.reduce.enabled=false
+-------------------------+-------------------------+ | orderdatatest_ext.col1 | orderdatatest_ext.col2 | +-------------------------+-------------------------+ | 1 | 1001 | | 1 | 1002 | | 1 | 1003 | | 1 | 1004 | | 1 | 1005 | | 1 | 1006 | | 1 | 1007 | | 1 | 1008 | | 1 | 1009 | | 1 | 1010 | | 1 | 1011 | | 1 | 1012 | | 1 | 1013 | | 1 | 1014 | | 1 | 1015 | | 1 | 1016 | | 1 | 1017 | | 1 | 1018 | | 1 | 1019 | | 1 | 1020 | | 1 | 1021 | | 1 | 1022 | | 1 | 1023 | | 1 | 1024 | | 1 | 1025 | | 1 | 1026 | | 1 | 1027 | | 1 | 1028 | | 1 | 1029 | | 1 | 1030 | | 1 | 1031 | | 1 | 1032 | | 1 | 1033 | | 1 | 1034 | | 1 | 1035 | | 1 | 1036 | | 1 | 1037 | | 1 | 1038 | | 1 | 1039 | | 1 | 1040 | | 1 | 1041 | | 1 | 1042 | | 1 | 1043 | | 1 | 1044 | | 1 | 1045 | | 1 | 1046 | | 1 | 1047 | | 1 | 1048 | | 1 | 1049 | | 1 | 1050 | +-------------------------+-------------------------+
Table DDL
CREATE EXTERNAL TABLE orderdatatest_ext (col1 int, col2 int) stored as orc
Attached sample ORC file.
Problem appears to be with VectorLimitOperator.
2020-04-20 15:35:49,693 [INFO] [TezChild] |vector.VectorSelectOperator|: Initializing operator SEL[6] 2020-04-20 15:35:49,747 [INFO] [TezChild] |vector.VectorSelectOperator|: RECORDS_OUT_INTERMEDIATE_Map_1:0, RECORDS_OUT_OPERATOR_SEL_6:1500, 2020-04-20 15:35:50,142 [INFO] [TezChild] |vector.VectorSelectOperator|: Initializing operator SEL[8] 2020-04-20 15:35:50,303 [INFO] [TezChild] |vector.VectorSelectOperator|: RECORDS_OUT_OPERATOR_SEL_8:1050, RECORDS_OUT_INTERMEDIATE_Reducer_2:0, 2020-04-20 15:35:50,142 [INFO] [TezChild] |vector.VectorLimitOperator|: Initializing operator LIM[9] 2020-04-20 15:35:50,303 [INFO] [TezChild] |vector.VectorLimitOperator|: RECORDS_OUT_INTERMEDIATE_Reducer_2:0, RECORDS_OUT_OPERATOR_LIM_9:1050,