Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-8732

SQL: REPLICATED cache cannot be left-joined to PARTITIONED

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.5
    • 2.10
    • sql

    Description

      Steps to reproduce

      1. Run org.apache.ignite.sqltests.ReplicatedSqlTest#testLeftJoinReplicatedPartitioned
      2. Observe that we have 2x results on 2-node cluster

      Root Cause
      left LEFT JOIN right ON cond operation assumes full scan of of a left expression. Currently we perform this scan on every node and then simply merge results on reducer. Two nodes, two scans of REPLICATED cache, 2x results.

      Potential Solutions
      We may consider several solutions. Deeper analysis is required to understand which is the right one.

      1. Perform deduplication on reducer - this most prospective and general technique, described in more details below
      2. Treat REPLICATED cache as PARTITIONED. Essentially, we just need to pass proper backup filter. But what if REPLICATED cache spans more nodes than PARTITIONED? We cannot rely on primary/backup in this case
      3. Implement additional execution phase as follows:
        SELECT left.cols, right.cols FROM left INNER JOIN right ON cond;                          // Get "inner join" part
        UNION
        UNICAST SELECT left.cols, [NULL].cols FROM left WHERE left.id NOT IN ([ids from the first phase]) // Get "outer join" part
        

      Reducer Deduplication
      The idea is to get all data locally and then perform final deduplication. This may incur high network overhead, because of lot of duplicated left parts would be transferred. However, this could be optimized greatly with the following techniques applied one after another

      1. Semi-jions: left is joined on mapper node, but instead of sending (left, right) relation, we send (left) + (right)
      2. In case left part is known to be idempotent (i.e. it produces the same result set on all nodes), only one node will send (left) + (right), other nodes will send (right) only
      3. Merge left results with if needed (i.e. if idempotence-related opto was not applicable)
      4. Join left and right parts on reducer

      UPDATE

      After a few attempts at the implementation, the solution of treating REPLICATED cache as PARTITIONED looks the most practical. The solution works in a limited case:

      • REPLICATED and PARTITIONED both have the same affinity function, number of partitions, node filter
        • Note that REPLICATED has a different number of partitions by default
      • The JOIN is done on an affinity column of both caches
        • Note that users often don’t create affinity keys for REPLICATED caches today
      • distributedJoins=false (distributed joins aren’t supported for now)

      Attachments

        Issue Links

          Activity

            People

              slukyanov Stanislav Lukyanov
              vozerov Vladimir Ozerov
              Taras Ledkov Taras Ledkov
              Votes:
              2 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h
                  1h