Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
Impala 2.5.0
Description
The aggregate function count is not working properly on those table columns that contain '' (blank) values only. It returns incorrect values seemingly handleing it as NULLs. The issue is reproducible in multiple environments (CDH 5.5.2, 5.7.2) and it seems it is Impala and RCFile specific:
[morhidi-552-sec-2.gce.cloudera.com:21000] > select * from impala_rcfile_count_issue; Query: select * from impala_rcfile_count_issue +------+------+------+ | col1 | col2 | col3 | +------+------+------+ | 1 | | a | | | | b | | 2 | | c | +------+------+------+ Fetched 3 row(s) in 0.22s [morhidi-552-sec-2.gce.cloudera.com:21000] > select * from impala_rcfile_count_issue where col1=''; Query: select * from impala_rcfile_count_issue where col1='' +------+------+------+ | col1 | col2 | col3 | +------+------+------+ | | | b | +------+------+------+ Fetched 1 row(s) in 0.12s [morhidi-552-sec-2.gce.cloudera.com:21000] > select count(*) from impala_rcfile_count_issue where col1=''; Query: select count(*) from impala_rcfile_count_issue where col1='' +----------+ | count(*) | +----------+ | 1 | +----------+ Fetched 1 row(s) in 0.67s [morhidi-552-sec-2.gce.cloudera.com:21000] > select * from impala_rcfile_count_issue where col2=''; Query: select * from impala_rcfile_count_issue where col2='' +------+------+------+ | col1 | col2 | col3 | +------+------+------+ | 1 | | a | | | | b | | 2 | | c | +------+------+------+ Fetched 3 row(s) in 0.12s INCORRECT: [morhidi-552-sec-2.gce.cloudera.com:21000] > select count(*) from impala_rcfile_count_issue where col2=''; Query: select count(*) from impala_rcfile_count_issue where col2='' +----------+ | count(*) | +----------+ | 0 | +----------+ Fetched 1 row(s) in 0.76s [morhidi-552-sec-2.gce.cloudera.com:21000] > select * from impala_rcfile_count_issue where col1 is NULL; Query: select * from impala_rcfile_count_issue where col1 is NULL Fetched 0 row(s) in 0.22s [morhidi-552-sec-2.gce.cloudera.com:21000] > select count(*) from impala_rcfile_count_issue where col1 is NULL; Query: select count(*) from impala_rcfile_count_issue where col1 is NULL +----------+ | count(*) | +----------+ | 0 | +----------+ Fetched 1 row(s) in 0.68s [morhidi-552-sec-2.gce.cloudera.com:21000] > select * from impala_rcfile_count_issue where col2 is NULL; Query: select * from impala_rcfile_count_issue where col2 is NULL Fetched 0 row(s) in 0.12s INCORRECT: [morhidi-552-sec-2.gce.cloudera.com:21000] > select count(*) from impala_rcfile_count_issue where col2 is NULL; Query: select count(*) from impala_rcfile_count_issue where col2 is NULL +----------+ | count(*) | +----------+ | 3 | +----------+ Fetched 1 row(s) in 0.64s
Here is the command to create the table from Hive:
create table impala_rcfile_count_issue stored as rcfile as select * from ( select '1' as col1,'' as col2, 'a' as col3 union all select '' as col1,'' as col2, 'b' as col3 union all select '2' as col1,'' as col2, 'c' as col3 ) t;