[HIVE-12762] Common join on parquet tables returns incorrect result when hive.optimize.index.filter set to true - ASF JIRA

Log work

Agile Board

Rank to Top

Rank to Bottom

Bulk Copy Attachments

Bulk Move Attachments

Voters

Watch issue

Watchers

Create sub-task

Convert to sub-task

Move

Link

Clone

Labels

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.0, 2.1.0
Fix Version/s: 2.0.0
Component/s: Logical Optimizer
Labels:
None

Description

The following query will give incorrect result.

CREATE TABLE tbl1(id INT) STORED AS PARQUET;
INSERT INTO tbl1 VALUES(1), (2);

CREATE TABLE tbl2(id INT, value STRING) STORED AS PARQUET;
INSERT INTO tbl2 VALUES(1, 'value1');
INSERT INTO tbl2 VALUES(1, 'value2');

set hive.optimize.index.filter = true;
set hive.auto.convert.join=false;
select tbl1.id, t1.value, t2.value
FROM tbl1
JOIN (SELECT * FROM tbl2 WHERE value='value1') t1 ON tbl1.id=t1.id
JOIN (SELECT * FROM tbl2 WHERE value='value2') t2 ON tbl1.id=t2.id;

We are enforcing to use common join and tbl2 will have 2 files after 2 insertions underneath.

the map job contains 3 TableScan operators (2 for tbl2 and 1 for tbl1). When hive.optimize.index.filter is set to true, we are incorrectly applying the later filtering condition to each block, which causes no data is returned for the subquery SELECT * FROM tbl2 WHERE value='value1'.