Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.5
Description
Steps to reproduce
- Run org.apache.ignite.sqltests.ReplicatedSqlTest#testLeftJoinReplicatedPartitioned
- Observe that we have 2x results on 2-node cluster
Root Cause
left LEFT JOIN right ON cond operation assumes full scan of of a left expression. Currently we perform this scan on every node and then simply merge results on reducer. Two nodes, two scans of REPLICATED cache, 2x results.
Potential Solutions
We may consider several solutions. Deeper analysis is required to understand which is the right one.
- Perform deduplication on reducer - this most prospective and general technique, described in more details below
- Treat REPLICATED cache as PARTITIONED. Essentially, we just need to pass proper backup filter. But what if REPLICATED cache spans more nodes than PARTITIONED? We cannot rely on primary/backup in this case
- Implement additional execution phase as follows:
SELECT left.cols, right.cols FROM left INNER JOIN right ON cond; // Get "inner join" part UNION UNICAST SELECT left.cols, [NULL].cols FROM left WHERE left.id NOT IN ([ids from the first phase]) // Get "outer join" part
Reducer Deduplication
The idea is to get all data locally and then perform final deduplication. This may incur high network overhead, because of lot of duplicated left parts would be transferred. However, this could be optimized greatly with the following techniques applied one after another
- Semi-jions: left is joined on mapper node, but instead of sending (left, right) relation, we send (left) + (right)
- In case left part is known to be idempotent (i.e. it produces the same result set on all nodes), only one node will send (left) + (right), other nodes will send (right) only
- Merge left results with if needed (i.e. if idempotence-related opto was not applicable)
- Join left and right parts on reducer
UPDATE
After a few attempts at the implementation, the solution of treating REPLICATED cache as PARTITIONED looks the most practical. The solution works in a limited case:
- REPLICATED and PARTITIONED both have the same affinity function, number of partitions, node filter
-
- Note that REPLICATED has a different number of partitions by default
- The JOIN is done on an affinity column of both caches
-
- Note that users often don’t create affinity keys for REPLICATED caches today
- distributedJoins=false (distributed joins aren’t supported for now)
Attachments
Issue Links
- is duplicated by
-
IGNITE-5016 SQL: Support LEFT JOIN from replicated table to a partitioned.
- Resolved
- relates to
-
IGNITE-13975 Validate that tables beind joined have the same partition distribution
- Open
- links to