Details
-
Bug
-
Status: Open
-
Critical
-
Resolution: Unresolved
-
Impala 4.1.0
-
None
-
None
-
ghx-label-8
Description
We found that Impala can't filter NULL values from sub-query. For example, prepare two test tables err_tbl1 and err_tbl2(id INT, dt STRING):
+----+------------+ | id | dt | +----+------------+ | 14 | 2022-11-13 | | 15 | 2022-11-13 | | 13 | 2022-11-13 | +----+------------+ +----+------------+ | id | dt | +----+------------+ | 14 | 2022-11-13 | | 16 | 2022-11-13 | | 13 | 2022-11-13 | +----+------------+
And submit query below:
SELECT * FROM ( SELECT aid, bid, COUNT(*) AS c FROM ( SELECT id AS aid FROM err_tbl1 WHERE dt = '2022-11-13' ) a FULL JOIN ( SELECT id AS bid FROM err_tbl2 WHERE dt = '2022-11-13' ) b ON a.aid = b.bid GROUP BY aid, bid ) t1 WHERE aid = bid;
Out result includes 4 rows:
+------+------+---+ | aid | bid | c | +------+------+---+ | 13 | 13 | 1 | | 14 | 14 | 1 | | NULL | 15 | 1 | | 16 | NULL | 1 | +------+------+—+
Obviously, condition of 'aid=bid' is invalid. 'NULL' value should be filtered out. But if we use condition of 'aid!=bid', there will be empty result set, which means '!=' can filter out 'NULL' value while '=' cannot. What's more, if we create a table as select * from sub-query and execute 'SELECT * FROM sub_table WHERE aid=bid'. The result is correct. If we surround 'aid=bid' with 'trim()' like 'trim(cast(aid as string))=trim(cast(bid as string))', result is also correct. In Spark, result of the same query doesn't contain 'NULL' values.