[HIVE-26452] NPE when converting join to mapjoin and join column referenced more than once - ASF JIRA

Log work

Agile Board

Rank to Top

Rank to Bottom

Bulk Copy Attachments

Bulk Move Attachments

Voters

Watch issue

Watchers

Create sub-task

Convert to sub-task

Move

Link

Clone

Labels

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
- pull-request-available

Target Version/s:

4.0.0

Description

explain
select count(*)
from LU_CUSTOMER pa11
      join        ORDER_FACT        a15
      on         (pa11.CUSTOMER_ID = a15.CUSTOMER_ID)
      join        LU_CUSTOMER        a16
      on         (a15.CUSTOMER_ID = a16.CUSTOMER_ID and pa11.CUSTOMER_ID = a16.CUSTOMER_ID);

a16.CUSTOMER_ID is referenced more than once in the join condition.

Hive generates Reduce sink operators for the join's children and one of the RS row schema contains only one instance of the join keys (customer_id).

RS[13]                    
result = {HashMap@16092}  size = 2
 "KEY.reducesinkkey0" -> {ExprNodeColumnDesc@16083} "Column[_col0]"
 "KEY.reducesinkkey1" -> {ExprNodeColumnDesc@16102} "Column[_col0]"                    
 
 
result = {RowSchema@16104} "(KEY.reducesinkkey0: int|{$hdt$_2}customer_id)"
 signature = {ArrayList@16110}  size = 1
  0 = {ColumnInfo@16087} "KEY.reducesinkkey0: int"

KEY.reducesinkkey1 is missing from the schema.

When converting the join to mapjoin the converter algorithm fails looking up both join key column instances.

https://github.com/apache/hive/blob/2aaba3c79e740ef27fc263b5a8ff33ad679c5a12/ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeDescUtils.java#L538