Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
0.10.0, 0.12.0
-
None
-
None
-
Tested in a linux environment
Description
Hive does not return the correct results when you run "max()" on a table that has been partitioned on a numeric column when more than 10 partitions are present and they are numbered 1 - 10. The same thing happens if the numbers are in the range of 1 - 100. It appears as if Hive is using a string-based sort algorithm instead of honoring the numeric data type and applying a true numeric sort.
The steps to recreate this issue are included below.
First, put the file below into HDFS at '/tmp/hive_bug/people_flat/people.dat'.
(Leave lines with leading dashes out)
----- File contents -----
1,David
2,Jeff
3,Cindy
4,Prakash
5,Kate
6,Chung
7,Ginny
8,Huy
9,Brett
10,Jennifer
11,Dan
12,Shivani
13,Nate
----- EOF -----
Now run these statements in either Hive or beeline:
drop table if exists people_flat;
create external table people_flat (
pnum int,
pname string
)
row format delimited fields terminated by ','
location '/tmp/hive_bug/people_flat';
select * from people_flat order by plum;
– order of records should be correct
select max(pnum) from people_flat;
– 13 (this is correct)
create table people_partitioned (
pname string
)
partitioned by (pnum int)
row format delimited fields terminated by ',';
SET hive.exec.max.dynamic.partitions=10000;
SET hive.exec.max.dynamic.partitions.pernode=10000;
SET hive.exec.dynamic.partition.mode=nonstrict;
insert into table people_partitioned
partition (pnum)
select pname,
pnum
from people_flat;
select max(pnum) from people_partitioned;
– result 9 is wrong!
select * from people_partitioned order by pnum;
– records not sorted correctly in numeric order
– Although this is not a solution, the next query yield correct results
select max(pnum_int)
from (
select cast(pnum as int) pnum_int
from people_partitioned
) a;