Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.11.0
Description
Nested Loop Join produces wrong result's if there are multiple batches on the right side. It builds an ExapandableHyperContainer to hold all the right side of batches. Then for each record on left side input evaluates the condition with all records on right side and emit the output if condition is satisfied. The main loop inside populateOutgoingBatch call's doEval with correct indexes to evaluate records on both the sides. In generated code of doEval for some reason there is a right shift of 16 done on the rightBatchIndex (sample shared below).
public boolean doEval(int leftIndex, int rightBatchIndex, int rightRecordIndexWithinBatch) throws SchemaChangeException { { IntHolder out3 = new IntHolder(); { out3 .value = vv0 .getAccessor().get((leftIndex)); } IntHolder out7 = new IntHolder(); { out7 .value = vv4[((rightBatchIndex)>>>16)].getAccessor().get(((rightRecordIndexWithinBatch)& 65535)); } ...... ...... }
When the actual loop is processing second batch, inside eval method the index with right shift becomes 0 and it ends up evaluating condition w.r.t first right batch again. So if there is more than one batch (upto 65535) on right side doEval will always consider first batch for condition evaluation. But the output data will be based on correct batch so there will be issues like OutOfBound and WrongData. Cases can be:
Let's say: rightBatchIndex: index of right batch to consider, rightRecordIndexWithinBatch: index of record in right batch at rightBatchIndex
1) First right batch comes with zero data and with OK_NEW_SCHEMA (let's say because of filter in the operator tree). Next Right batch has > 0 data. So when we call doEval for second batch(rightBatchIndex = 1) and first record in it (i.e. rightRecordIndexWithinBatch = 0), actual evaluation will happen using first batch (since rightBatchIndex >>> 16 = 0). On accessing record at rightRecordIndexWithinBatch in first batch it will throw IndexOutofBoundException since the first batch has no records.
2) Let's say there are 2 batches on right side. Also let's say first batch contains 3 records (with id_right=1/2/3) and 2nd batch also contain 3 records (with id_right=10/20/30). Also let's say there is 1 batch on left side with 3 records (with id_left=1/2/3). Then in this case the NestedLoopJoin (with equality condition) will end up producing 6 records instead of 3. It produces first 3 records based on match between left records and match in first right batch records. But while 2nd right batch it will evaluate id_left=id_right based on first batch instead and will again find matches and will produce another 3 records. Example:
Left Batch Data:
Batch1: { "id_left": 1, "cost_left": 11, "name_left": "item11" } { "id_left": 2, "cost_left": 21, "name_left": "item21" } { "id_left": 3, "cost_left": 31, "name_left": "item31" }
Right Batch Data:
Batch 1: { "id_right": 1, "cost_right": 10, "name_right": "item1" } { "id_right": 2, "cost_right": 20, "name_right": "item2" } { "id_right": 3, "cost_right": 30, "name_right": "item3" }
Batch 2: { "id_right": 4, "cost_right": 40, "name_right": "item4" } { "id_right": 4, "cost_right": 40, "name_right": "item4" } { "id_right": 4, "cost_right": 40, "name_right": "item4" }
Produced output:
{ "id_left": 1, "cost_left": 11, "name_left": "item11", "id_right": 1, "cost_right": 10, "name_right": "item1" } { "id_left": 1, "cost_left": 11, "name_left": "item11", "id_right": 4, "cost_right": 40, "name_right": "item4" } { "id_left": 2, "cost_left": 21, "name_left": "item21" "id_right": 2, "cost_right": 20, "name_right": "item2" } { "id_left": 2, "cost_left": 21, "name_left": "item21" "id_right": 4, "cost_right": 40, "name_right": "item4" } { "id_left": 3, "cost_left": 31, "name_left": "item31" "id_right": 3, "cost_right": 30, "name_right": "item3" } { "id_left": 3, "cost_left": 31, "name_left": "item31" "id_right": 4, "cost_right": 40, "name_right": "item4" }
Attachments
Issue Links
- links to