Uploaded image for project: 'Phoenix'
  1. Phoenix
  2. PHOENIX-852

Optimize child/parent foreign key joins

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.2.0, 3.2.0
    • Labels:
      None

      Description

      Often times a join will occur from a child to a parent. Our current algorithm would do a full scan of one side or the other. We can do much better than that if the HashCache contains the PK (or even part of the PK) from the table being joined to. In these cases, we should drive the second scan through a skip scan on the server side.

      1. 852.patch
        45 kB
        Maryann Xue
      2. 852-2.patch
        48 kB
        Maryann Xue
      3. 852-3.patch
        83 kB
        Maryann Xue
      4. 852-4.patch
        83 kB
        Maryann Xue
      5. PHOENIX-852.patch
        41 kB
        Maryann Xue

        Issue Links

          Activity

          Hide
          jamestaylor James Taylor added a comment -

          Another, perhaps better approach, would be to recognize this case at compile time and generate a different plan that does a RHS query with an IN clause to drive the skip scan. This would then get pushed down to all the region servers as a skip scan.

          Show
          jamestaylor James Taylor added a comment - Another, perhaps better approach, would be to recognize this case at compile time and generate a different plan that does a RHS query with an IN clause to drive the skip scan. This would then get pushed down to all the region servers as a skip scan.
          Hide
          jamestaylor James Taylor added a comment -

          I think it'd be easiest to just take the row key columns from the results of the smaller table, and use them to generate a query with an IN clause. You can use our row value constructor syntax for this. See RowValueConstructorIT.testQueryMoreWithInListRowValueConstructor() for an example. You wouldn't even send a hash cache to the region servers. Just let the IN query go through the normal path, as it's executed pretty optimally already. I think you'd need to generate a different query plan altogether for this, though.

          Show
          jamestaylor James Taylor added a comment - I think it'd be easiest to just take the row key columns from the results of the smaller table, and use them to generate a query with an IN clause. You can use our row value constructor syntax for this. See RowValueConstructorIT.testQueryMoreWithInListRowValueConstructor() for an example. You wouldn't even send a hash cache to the region servers. Just let the IN query go through the normal path, as it's executed pretty optimally already. I think you'd need to generate a different query plan altogether for this, though.
          Hide
          maryannxue Maryann Xue added a comment -

          We now need to introduce something like dynamic compilation – compilation that happens in the middle of an execution. And it is going to used by "IN / NOT IN" sub-queries as well. I am thinking if it is possible to merge scan ranges so that we can avoid compiling those static where conditions but only compile those dynamic ones?

          Do we need a size limit for such optimizations? The skip scan range list might be too long, or the InListExpression's hashset could be too big and unlike hash-cache it escapes the memory manager.

          However the hash cache thing in join queries is inevitable I think, for there could be 1-to-many mappings between two tables, which means the join semantics is not totally equivalent to IN clause.

          Show
          maryannxue Maryann Xue added a comment - We now need to introduce something like dynamic compilation – compilation that happens in the middle of an execution. And it is going to used by "IN / NOT IN" sub-queries as well. I am thinking if it is possible to merge scan ranges so that we can avoid compiling those static where conditions but only compile those dynamic ones? Do we need a size limit for such optimizations? The skip scan range list might be too long, or the InListExpression's hashset could be too big and unlike hash-cache it escapes the memory manager. However the hash cache thing in join queries is inevitable I think, for there could be 1-to-many mappings between two tables, which means the join semantics is not totally equivalent to IN clause.
          Hide
          jamestaylor James Taylor added a comment -

          I am thinking if it is possible to merge scan ranges so that we can avoid compiling those static where conditions but only compile those dynamic ones?

          Eventually we should support caching query plans and being able to merge them together, but I don't think this is a very expensive operation. Let's wait on this and see where the cost is. We could come in at a slightly lower level for this, bypassing the parser, and just generating an InListExpression directly.

          Do we need a size limit for such optimizations?

          I don't think this will be necessary, as an IN clause compiles down to a set of keys. We've tested it with 250K keys and it was very fast.

          However the hash cache thing in join queries is inevitable I think, for there could be 1-to-many mappings between two tables,

          The IN construct can handle partial key matches, so I think it'd work fine. There's a very little bit of work to pass the number of PK slots a key encompasses over to the skip scan (for example, if the RHS ends up contributing to the leading 3 of 4 pk columns, but the underlying engine will handle this fine).

          The nice thing about this approach is you'll be leveraging the way we optimize these IN expressions. The skip scan will just skip from row key to row key and be so much faster than a full table scan. It'll be a huge speedup for a relatively common case.

          Show
          jamestaylor James Taylor added a comment - I am thinking if it is possible to merge scan ranges so that we can avoid compiling those static where conditions but only compile those dynamic ones? Eventually we should support caching query plans and being able to merge them together, but I don't think this is a very expensive operation. Let's wait on this and see where the cost is. We could come in at a slightly lower level for this, bypassing the parser, and just generating an InListExpression directly. Do we need a size limit for such optimizations? I don't think this will be necessary, as an IN clause compiles down to a set of keys. We've tested it with 250K keys and it was very fast. However the hash cache thing in join queries is inevitable I think, for there could be 1-to-many mappings between two tables, The IN construct can handle partial key matches, so I think it'd work fine. There's a very little bit of work to pass the number of PK slots a key encompasses over to the skip scan (for example, if the RHS ends up contributing to the leading 3 of 4 pk columns, but the underlying engine will handle this fine). The nice thing about this approach is you'll be leveraging the way we optimize these IN expressions. The skip scan will just skip from row key to row key and be so much faster than a full table scan. It'll be a huge speedup for a relatively common case.
          Hide
          maryannxue Maryann Xue added a comment -

          We could come in at a slightly lower level for this, bypassing the parser, and just generating an InListExpression directly.

          Sure, we are gonna do that. It's just the re-compilation of other where clauses (aside from this IN clause) seems not so neat to me. But I'll just do it this way for now, since the overhead is not really gonna be a problem.

          There is no doubt that the IN construct can handle the key mapping, but what I'm saying is that it is not sufficient in some cases. Suppose we have left table tuples (a, 1), (c, 2) and right table tuples (a, 3), (c, 4) and we perform a join on the first column but only select those columns from the left table. In this case, we can simply use the IN construct and we don't need that hash cache. But imagine we have another right table tuple (a, 5), the result should now be (a, 1), (a, 1), (c, 2), for there are two tuples matching "a" from the right table. In this latter case, we still have to keep the hash cache.

          The nice thing about this approach is you'll be leveraging the way we optimize these IN expressions. The skip scan will just skip from row key to row key and be so much faster than a full table scan. It'll be a huge speedup for a relatively common case.

          100% agree. PHOENIX-889 is a very good example.

          Show
          maryannxue Maryann Xue added a comment - We could come in at a slightly lower level for this, bypassing the parser, and just generating an InListExpression directly. Sure, we are gonna do that. It's just the re-compilation of other where clauses (aside from this IN clause) seems not so neat to me. But I'll just do it this way for now, since the overhead is not really gonna be a problem. There is no doubt that the IN construct can handle the key mapping, but what I'm saying is that it is not sufficient in some cases. Suppose we have left table tuples (a, 1), (c, 2) and right table tuples (a, 3), (c, 4) and we perform a join on the first column but only select those columns from the left table. In this case, we can simply use the IN construct and we don't need that hash cache. But imagine we have another right table tuple (a, 5), the result should now be (a, 1), (a, 1), (c, 2), for there are two tuples matching "a" from the right table. In this latter case, we still have to keep the hash cache. The nice thing about this approach is you'll be leveraging the way we optimize these IN expressions. The skip scan will just skip from row key to row key and be so much faster than a full table scan. It'll be a huge speedup for a relatively common case. 100% agree. PHOENIX-889 is a very good example.
          Hide
          jamestaylor James Taylor added a comment -

          With RHS = (a,5) and LHS = (a,1),(a,1),(c,2) with where pk1 IN (a) will return two rows from the LHS table (assuming the LHS has a composite primary constraint of more than just one column, as otherwise there'd be just one row with a pk of a). Is that what we'd want?

          Show
          jamestaylor James Taylor added a comment - With RHS = (a,5) and LHS = (a,1),(a,1),(c,2) with where pk1 IN (a) will return two rows from the LHS table (assuming the LHS has a composite primary constraint of more than just one column, as otherwise there'd be just one row with a pk of a). Is that what we'd want?
          Hide
          maryannxue Maryann Xue added a comment -

          Yes. Let me make it clearer:
          LHS=(a, 1),(c, 2); Schema: (col0, col1) pk (col0).
          RHS=(a, 3),(a, 5),(c, 4); Schema: (col0, col1) pk (col0, col1).
          With "select LHS.col0, LHS.col1 from LHS inner join RHS on LHS.col0 = RHS.col0", we should get (a,1),(a,1),(c,2)
          With "select col0, col1 from LHS where col0 in (select col0 from RHS)", we will get (a,1),(c,2)

          Show
          maryannxue Maryann Xue added a comment - Yes. Let me make it clearer: LHS=(a, 1),(c, 2); Schema: (col0, col1) pk (col0). RHS=(a, 3),(a, 5),(c, 4); Schema: (col0, col1) pk (col0, col1). With "select LHS.col0, LHS.col1 from LHS inner join RHS on LHS.col0 = RHS.col0", we should get (a,1),(a,1),(c,2) With "select col0, col1 from LHS where col0 in (select col0 from RHS)", we will get (a,1),(c,2)
          Hide
          jamestaylor James Taylor added a comment -

          I can see that the correlated subquery case will be different than the join case, but let's ignore that one for now (as it's an issue that needs to be solved when we support correlated subqueries regardless of the approach here).

          So the IN query will do what you want, but I agree, you'd need to coordinate the LHS and RHS iterators carefully. For example, you'd need to not do a next on the RHS iterator until the LHS iterator returns a different value for the join key. You might be able to order the RHS by the join key and work out this coordination as you iterate, though doing this in a parallel manner would get tricky. Or maybe you could create the hash cache like today and then use it on the client side.

          If you think it's easier/better to push the hash cache to all region servers like we do for the standard join case, extract the relevant keys on the server-side, and add a skip scan filter to the RHS scan on the server-side, that's fine too. You'll be pushing more information than you need to the region servers, as could slice this up and only send what's required for each region (that's what we do when we process an IN).

          So seems like there are a lot of different options. I suspect it'll become clearer as you get into it which is the best. Let me know how I can help.

          Show
          jamestaylor James Taylor added a comment - I can see that the correlated subquery case will be different than the join case, but let's ignore that one for now (as it's an issue that needs to be solved when we support correlated subqueries regardless of the approach here). So the IN query will do what you want, but I agree, you'd need to coordinate the LHS and RHS iterators carefully. For example, you'd need to not do a next on the RHS iterator until the LHS iterator returns a different value for the join key. You might be able to order the RHS by the join key and work out this coordination as you iterate, though doing this in a parallel manner would get tricky. Or maybe you could create the hash cache like today and then use it on the client side. If you think it's easier/better to push the hash cache to all region servers like we do for the standard join case, extract the relevant keys on the server-side, and add a skip scan filter to the RHS scan on the server-side, that's fine too. You'll be pushing more information than you need to the region servers, as could slice this up and only send what's required for each region (that's what we do when we process an IN). So seems like there are a lot of different options. I suspect it'll become clearer as you get into it which is the best. Let me know how I can help.
          Hide
          maryannxue Maryann Xue added a comment -

          The approach I'm gonna take for now for join optimization is to construct an IN clause and use it to generate skip scan filter but not as a hash set filter. In other words, I'm gonna let the IN clause go through WhereOptimizer only, and not through the later stages of WhereCompiler. So if the join expression contains some key part, a skip scan will be generated to avoid a full scan while the real join work will still be exactly the same process as it is now.

          Show
          maryannxue Maryann Xue added a comment - The approach I'm gonna take for now for join optimization is to construct an IN clause and use it to generate skip scan filter but not as a hash set filter. In other words, I'm gonna let the IN clause go through WhereOptimizer only, and not through the later stages of WhereCompiler. So if the join expression contains some key part, a skip scan will be generated to avoid a full scan while the real join work will still be exactly the same process as it is now.
          Hide
          jamestaylor James Taylor added a comment -

          +1. Nice approach. Note that in the case of a partial match (say 2 of 3 PK columns), there'll be a little bit of work required in the WhereOptimizer and SkipScanFilter to handle this correctly.

          Show
          jamestaylor James Taylor added a comment - +1. Nice approach. Note that in the case of a partial match (say 2 of 3 PK columns), there'll be a little bit of work required in the WhereOptimizer and SkipScanFilter to handle this correctly.
          Hide
          maryannxue Maryann Xue added a comment - - edited

          Yes... guess I'm a little confused about the handling of pk columns in WhereOptimizer. Is it that currently we can only handle cases that satisfy 1) PK positions are consecutive and 2) PK positions start from 0 (based on pk-offset)?

          So before any improvement has been made to WhereOptimizer and SkipScanFilter, we only find those join conditions that together have the leading PK columns and optimize them? For example, table A's PK is (c0,c1,c2), and our join conditions on table A has c0 and c2. We simply include c0 in the skip scan optimization.

          Show
          maryannxue Maryann Xue added a comment - - edited Yes... guess I'm a little confused about the handling of pk columns in WhereOptimizer. Is it that currently we can only handle cases that satisfy 1) PK positions are consecutive and 2) PK positions start from 0 (based on pk-offset)? So before any improvement has been made to WhereOptimizer and SkipScanFilter, we only find those join conditions that together have the leading PK columns and optimize them? For example, table A's PK is (c0,c1,c2), and our join conditions on table A has c0 and c2. We simply include c0 in the skip scan optimization.
          Hide
          jamestaylor James Taylor added a comment -

          Yes, correct - must be leading columns. There's a little bit of work required to support the case of join condition including both c0 and c1. We handle the basic case (i.e. WHERE c0=1 and c1=2), but not the IN case (i.e. WHERE (c0, c1) IN ((?,?),(?,?)) ). Your case is really the latter. This would be easy to add, though and I think well worth it.

          FWIW, we can handle non leading columns or gaps in columns, but we don't by default today. The reason is that we don't know the cardinality of these missing columns, so don't know if doing a skip scan would be better or worse that a skip scan. When we start collecting histogram information, we can start to change this.

          Show
          jamestaylor James Taylor added a comment - Yes, correct - must be leading columns. There's a little bit of work required to support the case of join condition including both c0 and c1. We handle the basic case (i.e. WHERE c0=1 and c1=2), but not the IN case (i.e. WHERE (c0, c1) IN ((?,?),(?,?)) ). Your case is really the latter. This would be easy to add, though and I think well worth it. FWIW, we can handle non leading columns or gaps in columns, but we don't by default today. The reason is that we don't know the cardinality of these missing columns, so don't know if doing a skip scan would be better or worse that a skip scan. When we start collecting histogram information, we can start to change this.
          Hide
          maryannxue Maryann Xue added a comment -

          I see. So the below code is now what handles partial match for IN clause, right?

                      // We can only optimize a row value constructor that is fully qualified
                      if (childSlot.getPKSpan() > 1 && !isFullyQualified(childSlot.getPKSpan())) {
                          // Just return a key part that has the min/max of the IN list, but doesn't
                          // extract the IN list expression.
                          return newKeyParts(childSlot, (Expression)null, Collections.singletonList(
                                  KeyRange.getKeyRange(
                                          ByteUtil.copyKeyBytesIfNecessary(node.getMinKey()), true,
                                          ByteUtil.copyKeyBytesIfNecessary(node.getMaxKey()), true)), null);
                      }
          
          Show
          maryannxue Maryann Xue added a comment - I see. So the below code is now what handles partial match for IN clause, right? // We can only optimize a row value constructor that is fully qualified if (childSlot.getPKSpan() > 1 && !isFullyQualified(childSlot.getPKSpan())) { // Just return a key part that has the min/max of the IN list, but doesn't // extract the IN list expression. return newKeyParts(childSlot, (Expression) null , Collections.singletonList( KeyRange.getKeyRange( ByteUtil.copyKeyBytesIfNecessary(node.getMinKey()), true , ByteUtil.copyKeyBytesIfNecessary(node.getMaxKey()), true )), null ); }
          Hide
          maryannxue Maryann Xue added a comment -

          I will open another issue to cover this improvement.

          Show
          maryannxue Maryann Xue added a comment - I will open another issue to cover this improvement.
          Hide
          jamestaylor James Taylor added a comment -

          Yes, that's the code. +1 for new JIRA. We allow this slotSpan array to be passed through to the skip scan now. If we have a case like WHERE (c0, c1) IN ((?,?),(?,?)), then the row keys will span two slots instead of one. So in this case the slotSpan array would be [1], as it defines the number of columns that the row key covers minus one.

          Show
          jamestaylor James Taylor added a comment - Yes, that's the code. +1 for new JIRA. We allow this slotSpan array to be passed through to the skip scan now. If we have a case like WHERE (c0, c1) IN ((?,?),(?,?)), then the row keys will span two slots instead of one. So in this case the slotSpan array would be [1] , as it defines the number of columns that the row key covers minus one.
          Hide
          maryannxue Maryann Xue added a comment -

          James Taylor Just realized one thing: Some child-to-parent joins will cover a full range of PKs, for example, an employee basic info table joining a an employee payroll table on employee_id. Thus, this optimization would be an extra overhead. So I'd like to add a switch and I'm thinking of two options:
          1. hint
          2. a limit on the size of the value set, say, if too many point look-ups, we simply do a full scan.

          What do you think? either of them or both? if (2), what default limit value would be reasonable?

          Show
          maryannxue Maryann Xue added a comment - James Taylor Just realized one thing: Some child-to-parent joins will cover a full range of PKs, for example, an employee basic info table joining a an employee payroll table on employee_id. Thus, this optimization would be an extra overhead. So I'd like to add a switch and I'm thinking of two options: 1. hint 2. a limit on the size of the value set, say, if too many point look-ups, we simply do a full scan. What do you think? either of them or both? if (2), what default limit value would be reasonable?
          Hide
          jamestaylor James Taylor added a comment -

          For now maybe a hint is enough to disable this optimization. For (2), we really should drive off of stats when we have them b/c as you've alluded to, tough to really have a meaningful default.

          Show
          jamestaylor James Taylor added a comment - For now maybe a hint is enough to disable this optimization. For (2), we really should drive off of stats when we have them b/c as you've alluded to, tough to really have a meaningful default.
          Hide
          maryannxue Maryann Xue added a comment -

          How abt apply some "smart" decision, (like if the right table has a where clause we enable this optimization), and meanwhile have two hints: force-join-scan-range-optimization and no-join-scan-range-optimization. Sounds good?

          Show
          maryannxue Maryann Xue added a comment - How abt apply some "smart" decision, (like if the right table has a where clause we enable this optimization), and meanwhile have two hints: force-join-scan-range-optimization and no-join-scan-range-optimization. Sounds good?
          Hide
          jamestaylor James Taylor added a comment -

          Yes, the decision of using/not using will be gated by the join key corresponding to the leading pk col(s). Then I think we can get by with a single no-join-scan-range-optimization that would disable using this optimization even if we can use it.

          Show
          jamestaylor James Taylor added a comment - Yes, the decision of using/not using will be gated by the join key corresponding to the leading pk col(s). Then I think we can get by with a single no-join-scan-range-optimization that would disable using this optimization even if we can use it.
          Hide
          maryannxue Maryann Xue added a comment -

          Since we don't have stats, it is difficult to judge whether the join key can be narrowed to a small set of values (compared to the entire rowkey numbers). So, again take PHOENIX-889 for example, I think it would quite make sense if we just guess if the RHS (hash cache) will be narrowed down by looking if it's got any filters. And if we don't do (2) for now, think we might still need to keep those two hints.

          Show
          maryannxue Maryann Xue added a comment - Since we don't have stats, it is difficult to judge whether the join key can be narrowed to a small set of values (compared to the entire rowkey numbers). So, again take PHOENIX-889 for example, I think it would quite make sense if we just guess if the RHS (hash cache) will be narrowed down by looking if it's got any filters. And if we don't do (2) for now, think we might still need to keep those two hints.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user maryannxue opened a pull request:

          https://github.com/apache/phoenix/pull/7

          PHOENIX-852 Optimize child/parent foreign key joins

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/maryannxue/phoenix master

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/phoenix/pull/7.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #7


          commit c89d815f64bba2cd85a2b03ed200338f9a47f63e
          Author: maryannxue <maryannxue@apache.org>
          Date: 2014-08-19T03:04:00Z

          PHOENIX-852 Optimize child/parent foreign key joins


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user maryannxue opened a pull request: https://github.com/apache/phoenix/pull/7 PHOENIX-852 Optimize child/parent foreign key joins You can merge this pull request into a Git repository by running: $ git pull https://github.com/maryannxue/phoenix master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/phoenix/pull/7.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7 commit c89d815f64bba2cd85a2b03ed200338f9a47f63e Author: maryannxue <maryannxue@apache.org> Date: 2014-08-19T03:04:00Z PHOENIX-852 Optimize child/parent foreign key joins
          Hide
          jamestaylor James Taylor added a comment -

          What's your thinking for putting this in the 3.1/4.1 release? Too risky, or low risk, Maryann Xue? Is a multi-part partial key match cause problems, or would it just fallback to the slower mechanism (i.e. PK of c1,c2,c3 with a join key of c1,c2)?

          Show
          jamestaylor James Taylor added a comment - What's your thinking for putting this in the 3.1/4.1 release? Too risky, or low risk, Maryann Xue ? Is a multi-part partial key match cause problems, or would it just fallback to the slower mechanism (i.e. PK of c1,c2,c3 with a join key of c1,c2)?
          Hide
          maryannxue Maryann Xue added a comment - - edited

          Think it will be ok. My only concern is when there are too many key values, and each value will be wrapped as a LiteralExpression to construct a InListExpression, there might be too much object creation overhead; or if the key space is pretty much covered by those values, the point lookup would be unnecessary but introducing more overhead. So don't know if it would be safest just to disable it by default? The current algorithm, if no hint is present, is to turn on this optimization when the RHS has filters (say, in general cases, has where clauses). For example, like: select * from A join B on A.id = B.id where B.b = 'x' .

          Show
          maryannxue Maryann Xue added a comment - - edited Think it will be ok. My only concern is when there are too many key values, and each value will be wrapped as a LiteralExpression to construct a InListExpression, there might be too much object creation overhead; or if the key space is pretty much covered by those values, the point lookup would be unnecessary but introducing more overhead. So don't know if it would be safest just to disable it by default? The current algorithm, if no hint is present, is to turn on this optimization when the RHS has filters (say, in general cases, has where clauses). For example, like: select * from A join B on A.id = B.id where B.b = 'x' .
          Hide
          jamestaylor James Taylor added a comment -

          I see - so without a filter, there's no need for using the skip scan because the matching rows on the LHS will be contiguous, right? Now I understand why you need two hints.

          Orthogonal to this change, what about the child PK -> parent FK case, like the following?

          select * from A join B on A.fk = B.id
          

          Assume that there's an index on A.fk. Do we add a A.fk IS NOT NULL filter to the LHS query? That would help if A is much bigger than B and there are lots of null values for A.fk.

          And related to this, what about the partial key match scenario when there's no where clause? For example, assume table A has PK of (a, b) and table B has PK of (a), and you have a query like this one:

          select * from A join B on A.a = B.a
          

          Does the LHS scan have a start row equal to the min value in the RHS scan? I suppose this would cover the null case too.

          Show
          jamestaylor James Taylor added a comment - I see - so without a filter, there's no need for using the skip scan because the matching rows on the LHS will be contiguous, right? Now I understand why you need two hints. Orthogonal to this change, what about the child PK -> parent FK case, like the following? select * from A join B on A.fk = B.id Assume that there's an index on A.fk. Do we add a A.fk IS NOT NULL filter to the LHS query? That would help if A is much bigger than B and there are lots of null values for A.fk. And related to this, what about the partial key match scenario when there's no where clause? For example, assume table A has PK of (a, b) and table B has PK of (a), and you have a query like this one: select * from A join B on A.a = B.a Does the LHS scan have a start row equal to the min value in the RHS scan? I suppose this would cover the null case too.
          Hide
          maryannxue Maryann Xue added a comment -

          I see - so without a filter, there's no need for using the skip scan because the matching rows on the LHS will be contiguous, right?

          Yeah, that's just an assumption from some common cases. But I do think there are a lot of exceptions, we can't really make any nice guesses without stats. So I think for now it might just be better that users enable this optimization explicitly (with hint).
          Exceptions to my assumption, for example, can be: 1) even with a filter, the key values from the RHS table might still be a lot; 2) or sometimes without a filter, the two tables by themselves vary in size quite significantly and this optimization should be turned on.

          With the two cases you mentioned above, currently no. But that's a good reminder. How about we do this:

          By default, we construct a BETWEEN-AND clause instead of an IN clause, so that in the "too many key values" situation, this would do no harm but in some other cases this would help. And only construct IN clauses when the hint is present or in future when stats are available. What do you think?

          Show
          maryannxue Maryann Xue added a comment - I see - so without a filter, there's no need for using the skip scan because the matching rows on the LHS will be contiguous, right? Yeah, that's just an assumption from some common cases. But I do think there are a lot of exceptions, we can't really make any nice guesses without stats. So I think for now it might just be better that users enable this optimization explicitly (with hint). Exceptions to my assumption, for example, can be: 1) even with a filter, the key values from the RHS table might still be a lot; 2) or sometimes without a filter, the two tables by themselves vary in size quite significantly and this optimization should be turned on. With the two cases you mentioned above, currently no. But that's a good reminder. How about we do this: By default, we construct a BETWEEN-AND clause instead of an IN clause, so that in the "too many key values" situation, this would do no harm but in some other cases this would help. And only construct IN clauses when the hint is present or in future when stats are available. What do you think?
          Hide
          jamestaylor James Taylor added a comment -

          +1

          Show
          jamestaylor James Taylor added a comment - +1
          Hide
          maryannxue Maryann Xue added a comment -

          One question, is it safe to use BETWEEN-AND if there are nulls?

          Show
          maryannxue Maryann Xue added a comment - One question, is it safe to use BETWEEN-AND if there are nulls?
          Hide
          jamestaylor James Taylor added a comment -

          Yes, nulls would never be included. I can't remember if BETWEEN is inclusive or exclusive on each side - just confirm that.

          Show
          jamestaylor James Taylor added a comment - Yes, nulls would never be included. I can't remember if BETWEEN is inclusive or exclusive on each side - just confirm that.
          Hide
          maryannxue Maryann Xue added a comment -

          Good, thanks! There's actually no between expression. I'll just do >= min and <= max.

          Show
          maryannxue Maryann Xue added a comment - Good, thanks! There's actually no between expression. I'll just do >= min and <= max.
          Hide
          jamestaylor James Taylor added a comment -

          Ah right - I forgot - I compile BETWEEN AND into >= and <= at parse time.

          Show
          jamestaylor James Taylor added a comment - Ah right - I forgot - I compile BETWEEN AND into >= and <= at parse time.
          Hide
          jamestaylor James Taylor added a comment -

          Thanks for the quick turn around on the patch, Maryann Xue. Instead of adding new hints, how about if we either name them similar to the ones we have or even use them directly if possible. For example, we have these two:

                  /**
                   * Forces a range scan to be used to process the query.
                   */
                  RANGE_SCAN,
                  /**
                   * Forces a skip scan to be used to process the query.
                   */
                  SKIP_SCAN,
          

          We could just use those to drive this decision too, but then this would apply to the RHS and LHS query, so perhaps that's not so good. A quick and dirty way would be to tack something on to these, like RANGE_SCAN_HASH_JOIN and SKIP_SCAN_HASH_JOIN. Another way might be to allow these hints to be scoped to a table/alias, like SKIP_SCAN(my_table). Not sure how to communicate that it doesn't apply to the RHS table, but does apply to the join processing. Maybe just the quick-and-dirty way for consistency?

          Thoughts?

          Show
          jamestaylor James Taylor added a comment - Thanks for the quick turn around on the patch, Maryann Xue . Instead of adding new hints, how about if we either name them similar to the ones we have or even use them directly if possible. For example, we have these two: /** * Forces a range scan to be used to process the query. */ RANGE_SCAN, /** * Forces a skip scan to be used to process the query. */ SKIP_SCAN, We could just use those to drive this decision too, but then this would apply to the RHS and LHS query, so perhaps that's not so good. A quick and dirty way would be to tack something on to these, like RANGE_SCAN_HASH_JOIN and SKIP_SCAN_HASH_JOIN. Another way might be to allow these hints to be scoped to a table/alias, like SKIP_SCAN(my_table). Not sure how to communicate that it doesn't apply to the RHS table, but does apply to the join processing. Maybe just the quick-and-dirty way for consistency? Thoughts?
          Hide
          maryannxue Maryann Xue added a comment -

          well, think the quick-and-dirty way will be enough for now. and RANGE_SCAN_HASH_JOIN as a default choice.

          Show
          maryannxue Maryann Xue added a comment - well, think the quick-and-dirty way will be enough for now. and RANGE_SCAN_HASH_JOIN as a default choice.
          Hide
          jamestaylor James Taylor added a comment - - edited

          In the absence of any hints, I think a skip scan should be done when possible (i.e. if the leading pk columns are used, basically determined by the ScanRanges.useSkipScan() after WhereOptimizer does it's thing). If a skip scan is possible and the RANGE_SCAN_HASH_JOIN hint is present, then do a range scan. If the SKIP_SCAN_HASH_JOIN is present, then do a skip scan if some PK keys are being joined (doesn't make sense otherwise). This is how the SKIP_SCAN and RANGE_SCAN hints work now.

          What about the case of a partial PK match. For example, where there's a PK of (c1,c2,c3) and only c1 and c2 are matched? Does that just depend on how it's handled by WhereOptimizer, so when we handle that case there, we'll get it for free here?

          If all unit tests pass and you feel this is isolated enough to just joins, then feel free to commit.

          Show
          jamestaylor James Taylor added a comment - - edited In the absence of any hints, I think a skip scan should be done when possible (i.e. if the leading pk columns are used, basically determined by the ScanRanges.useSkipScan() after WhereOptimizer does it's thing). If a skip scan is possible and the RANGE_SCAN_HASH_JOIN hint is present, then do a range scan. If the SKIP_SCAN_HASH_JOIN is present, then do a skip scan if some PK keys are being joined (doesn't make sense otherwise). This is how the SKIP_SCAN and RANGE_SCAN hints work now. What about the case of a partial PK match. For example, where there's a PK of (c1,c2,c3) and only c1 and c2 are matched? Does that just depend on how it's handled by WhereOptimizer, so when we handle that case there, we'll get it for free here? If all unit tests pass and you feel this is isolated enough to just joins, then feel free to commit.
          Hide
          maryannxue Maryann Xue added a comment -

          Since using skip scan or range scan in joins largely depends on the RHS table data and plus we can have stats to do pretty good judgement later on, I do prefer having a separate set of hints and independent code logic for this use.

          So is it good enough that by default we do BETWEEN-AND for full key match (e.g. c1,c2,c3 matched in c1,c2,c3), but only IN clause if the SKIP_SCAN_HASH_JOIN hint is on; and by default do IN clause for partial key match (e.g. c1,c2 matched in c1,c2,c3), but only BETWEEN-AND clause if RANGE_SCAN_HASH_JOIN hint is on?

          Show
          maryannxue Maryann Xue added a comment - Since using skip scan or range scan in joins largely depends on the RHS table data and plus we can have stats to do pretty good judgement later on, I do prefer having a separate set of hints and independent code logic for this use. So is it good enough that by default we do BETWEEN-AND for full key match (e.g. c1,c2,c3 matched in c1,c2,c3), but only IN clause if the SKIP_SCAN_HASH_JOIN hint is on; and by default do IN clause for partial key match (e.g. c1,c2 matched in c1,c2,c3), but only BETWEEN-AND clause if RANGE_SCAN_HASH_JOIN hint is on?
          Hide
          maryannxue Maryann Xue added a comment -

          Suppose the child table has PK c1,c2,c3, and the parent has PK c1,c2, and the (c1,c2) values from the child table fully occupies the key space of (c1,c2) in the parent table, and c3 is about 10-20 different values to each identical (c1,c2), how much impact would there be on performance if we do skip scan for this case?

          Show
          maryannxue Maryann Xue added a comment - Suppose the child table has PK c1,c2,c3, and the parent has PK c1,c2, and the (c1,c2) values from the child table fully occupies the key space of (c1,c2) in the parent table, and c3 is about 10-20 different values to each identical (c1,c2), how much impact would there be on performance if we do skip scan for this case?
          Hide
          jamestaylor James Taylor added a comment -

          If it's ok with you, let's hold off on committing this change until we figure this out completely. We can always get it in a patch release shortly afterwards.

          The perf difference between a scan with fully qualified keys using an IN clause versus a range scan is big. The scenario you mentioned is less big - it likely depends on how much data (%-wise) is filtered out. mujtaba would be the best one to answer that. We should graph that out, though.

          So is it good enough that by default we do BETWEEN-AND for full key match (e.g. c1,c2,c3 matched in c1,c2,c3), but only IN clause if the SKIP_SCAN_HASH_JOIN hint is on.

          What about the case where the RHS has been filtered down a lot and you have a fully qualified key? Then a full scan over the LHS will be much worse than a skip scan driven by the keys formed through the RHS rows. I think this may be the most common case.

          Show
          jamestaylor James Taylor added a comment - If it's ok with you, let's hold off on committing this change until we figure this out completely. We can always get it in a patch release shortly afterwards. The perf difference between a scan with fully qualified keys using an IN clause versus a range scan is big. The scenario you mentioned is less big - it likely depends on how much data (%-wise) is filtered out. mujtaba would be the best one to answer that. We should graph that out, though. So is it good enough that by default we do BETWEEN-AND for full key match (e.g. c1,c2,c3 matched in c1,c2,c3), but only IN clause if the SKIP_SCAN_HASH_JOIN hint is on. What about the case where the RHS has been filtered down a lot and you have a fully qualified key? Then a full scan over the LHS will be much worse than a skip scan driven by the keys formed through the RHS rows. I think this may be the most common case.
          Hide
          maryannxue Maryann Xue added a comment -

          What about the case where the RHS has been filtered down a lot and you have a fully qualified key? Then a full scan over the LHS will be much worse than a skip scan driven by the keys formed through the RHS rows. I think this may be the most common case.

          I don't think there's a silver bullet to this problem here before we have stats, and I assume the goal right now is trying to be relatively conservative at this stage. so why don't we just check-in now and go with "by default we do BETWEEN-AND for full key match (e.g. c1,c2,c3 matched in c1,c2,c3), but only IN clause if the SKIP_SCAN_HASH_JOIN hint is on."? At least people can start using this feature optimizing their queries and in some cases they'll have to be aware of the hints to do even better.

          Show
          maryannxue Maryann Xue added a comment - What about the case where the RHS has been filtered down a lot and you have a fully qualified key? Then a full scan over the LHS will be much worse than a skip scan driven by the keys formed through the RHS rows. I think this may be the most common case. I don't think there's a silver bullet to this problem here before we have stats, and I assume the goal right now is trying to be relatively conservative at this stage. so why don't we just check-in now and go with "by default we do BETWEEN-AND for full key match (e.g. c1,c2,c3 matched in c1,c2,c3), but only IN clause if the SKIP_SCAN_HASH_JOIN hint is on."? At least people can start using this feature optimizing their queries and in some cases they'll have to be aware of the hints to do even better.
          Hide
          jamestaylor James Taylor added a comment -

          I think we should do some perf testing to figure it out first. I do think the common case is the one I mentioned, as it's a classic case for a hash join: two big tables A and B. You have a query that joins them, but filters B dramatically. Then you join back to A to get some info through a parent or child foreign key. Perf between a full scan over A versus a skip scan over the full PK when you're only matching say 1-5% of the table will be dramatic. I don't think a skip scan that's even over 50% of the table will be much slower than a full scan over it.

          Let's get 3.1/4.1 out - the RC is being built now. Then perhaps mujtaba can do some perf testing. I'd like to start doing monthly point releases.

          Show
          jamestaylor James Taylor added a comment - I think we should do some perf testing to figure it out first. I do think the common case is the one I mentioned, as it's a classic case for a hash join: two big tables A and B. You have a query that joins them, but filters B dramatically. Then you join back to A to get some info through a parent or child foreign key. Perf between a full scan over A versus a skip scan over the full PK when you're only matching say 1-5% of the table will be dramatic. I don't think a skip scan that's even over 50% of the table will be much slower than a full scan over it. Let's get 3.1/4.1 out - the RC is being built now. Then perhaps mujtaba can do some perf testing. I'd like to start doing monthly point releases.
          Hide
          maryannxue Maryann Xue added a comment -

          Yeah, I'm ok with checking in later, after the release. But IMHO, an optimization should not penalize the worse cases even if they could be not that common.

          Show
          maryannxue Maryann Xue added a comment - Yeah, I'm ok with checking in later, after the release. But IMHO, an optimization should not penalize the worse cases even if they could be not that common.
          Hide
          jamestaylor James Taylor added a comment -

          It'll always be the case that an optimization choice was wrong, even if we have stats. We just have to choose what we think is the best choice and (for now) fallback on hints to override.

          I think for the default case, we can have a compromise solution. If the RHS table has a filter and a skip scan can be done on the LHS, then we default to the skip scan. If the RHS has no filter, then we default to a range scan.

          Show
          jamestaylor James Taylor added a comment - It'll always be the case that an optimization choice was wrong, even if we have stats. We just have to choose what we think is the best choice and (for now) fallback on hints to override. I think for the default case, we can have a compromise solution. If the RHS table has a filter and a skip scan can be done on the LHS, then we default to the skip scan. If the RHS has no filter, then we default to a range scan.
          Hide
          maryannxue Maryann Xue added a comment -

          Agreed

          Show
          maryannxue Maryann Xue added a comment - Agreed
          Hide
          maryannxue Maryann Xue added a comment -

          Added 2 hints: RANGE_SCAN_HASH_JOIN, SKIP_SCAN_HASH_JOIN

          1) When RANGE_SCAN_HASH_JOIN is specified, we do BETWEEN-AND clause for RHS values.

          2) When no hint is available and RHS does not have filters, we do BETWEEN-AND clause for RHS values.

          3) When no hint is available and RHS has filters, we do IN clause for RHS values, and leave the decision of skip-scan vs. range-scan to WhereOptimizer.setScanFilter() and ScanRanges.useSkipScanFilter().

          4) When SKIP_SCAN_HASH_JOIN is available, we do IN clause for RHS values, and force use of skip-scan in WhereOptimizer.setScanFilter().

          Show
          maryannxue Maryann Xue added a comment - Added 2 hints: RANGE_SCAN_HASH_JOIN, SKIP_SCAN_HASH_JOIN 1) When RANGE_SCAN_HASH_JOIN is specified, we do BETWEEN-AND clause for RHS values. 2) When no hint is available and RHS does not have filters, we do BETWEEN-AND clause for RHS values. 3) When no hint is available and RHS has filters, we do IN clause for RHS values, and leave the decision of skip-scan vs. range-scan to WhereOptimizer.setScanFilter() and ScanRanges.useSkipScanFilter(). 4) When SKIP_SCAN_HASH_JOIN is available, we do IN clause for RHS values, and force use of skip-scan in WhereOptimizer.setScanFilter().
          Hide
          jamestaylor James Taylor added a comment -

          +1. Looks great, Maryann Xue. Have you done any local perf testing on this? It'll probably be pretty dramatic when this kicks in.

          Show
          jamestaylor James Taylor added a comment - +1. Looks great, Maryann Xue . Have you done any local perf testing on this? It'll probably be pretty dramatic when this kicks in.
          Hide
          jamestaylor James Taylor added a comment -

          One other question. In HashCacheClient, can you remind me what this does (maybe add a comment?):

          -    private void serialize(ImmutableBytesWritable ptr, ResultIterator iterator, TupleProjector projector, long estimatedSize, List<Expression> onExpressions) throws SQLException {
          +    private void serialize(ImmutableBytesWritable ptr, ResultIterator iterator, TupleProjector projector, long estimatedSize, List<Expression> onExpressions, Expression keyRangeRhsExpression, List<ImmutableBytesWritable> keyRangeRhsValues) throws SQLException {
                   long maxSize = serverCache.getConnection().getQueryServices().getProps().getLong(QueryServices.MAX_SERVER_CACHE_SIZE_ATTRIB, QueryServicesOptions.DEFAULT_MAX_SERVER_CACHE_SIZE);
                   estimatedSize = Math.min(estimatedSize, maxSize);
                   if (estimatedSize > Integer.MAX_VALUE) {
          @@ -105,6 +105,13 @@ public class HashCacheClient  {
                           if (baOut.size() > maxSize) {
                               throw new MaxServerCacheSizeExceededException("Size of hash cache (" + baOut.size() + " bytes) exceeds the maximum allowed size (" + maxSize + " bytes)");
                           }
          +                if (keyRangeRhsExpression != null) {
          +                    ImmutableBytesWritable value = new ImmutableBytesWritable();
          +                    keyRangeRhsExpression.reset();
          +                    if (keyRangeRhsExpression.evaluate(result, value)) {
          +                        keyRangeRhsValues.add(value);
          +                    }
          +                }
                           nRows++;
                       }
          
          Show
          jamestaylor James Taylor added a comment - One other question. In HashCacheClient, can you remind me what this does (maybe add a comment?): - private void serialize(ImmutableBytesWritable ptr, ResultIterator iterator, TupleProjector projector, long estimatedSize, List<Expression> onExpressions) throws SQLException { + private void serialize(ImmutableBytesWritable ptr, ResultIterator iterator, TupleProjector projector, long estimatedSize, List<Expression> onExpressions, Expression keyRangeRhsExpression, List<ImmutableBytesWritable> keyRangeRhsValues) throws SQLException { long maxSize = serverCache.getConnection().getQueryServices().getProps().getLong(QueryServices.MAX_SERVER_CACHE_SIZE_ATTRIB, QueryServicesOptions.DEFAULT_MAX_SERVER_CACHE_SIZE); estimatedSize = Math .min(estimatedSize, maxSize); if (estimatedSize > Integer .MAX_VALUE) { @@ -105,6 +105,13 @@ public class HashCacheClient { if (baOut.size() > maxSize) { throw new MaxServerCacheSizeExceededException( "Size of hash cache (" + baOut.size() + " bytes) exceeds the maximum allowed size (" + maxSize + " bytes)" ); } + if (keyRangeRhsExpression != null ) { + ImmutableBytesWritable value = new ImmutableBytesWritable(); + keyRangeRhsExpression.reset(); + if (keyRangeRhsExpression.evaluate(result, value)) { + keyRangeRhsValues.add(value); + } + } nRows++; }
          Hide
          jamestaylor James Taylor added a comment -

          And instead of the code below, it'd be better if you called into WhereOptimizer to get this information. There are cases beyond a simple column reference that we'd be able to optimize (like a SUBSTR(rowKeyCol, 1, 3), or even a row value constructor construct):

          +    private Pair<Expression, Expression> extractKeyRangeExpressions(StatementContext context, PTable table, JoinType type, final List<Expression> joinExpressions, final List<Expression> hashExpressions) {
          +        if (type != JoinType.Inner)
          +            return new Pair<Expression, Expression>(null, null);
          +        
          +        List<Integer> rowkeyColumnIndexes = Lists.newArrayList();
          +        for (int i = 0; i < joinExpressions.size(); i++) {
          +            Expression joinExpression = joinExpressions.get(i);
          +            if (joinExpression instanceof RowKeyColumnExpression) {
          +                rowkeyColumnIndexes.add(i);
          +            }
          +        }
          +        Collections.sort(rowkeyColumnIndexes, new Comparator<Integer>() {
          +            @Override
          +            public int compare(Integer l, Integer r) {
          +                return ((RowKeyColumnExpression) joinExpressions.get(l)).getPosition() - ((RowKeyColumnExpression) joinExpressions.get(r)).getPosition();
          +            }
          +        });
          +        int positionOffset = (table.getBucketNum() ==null ? 0 : 1) + (context.getConnection().getTenantId() != null && table.isMultiTenant() ? 1 : 0) + (table.getViewIndexId() == null ? 0 : 1);
          +        int position = 0;
          +        for (Integer index : rowkeyColumnIndexes) {
          +            RowKeyColumnExpression exp = (RowKeyColumnExpression) joinExpressions.get(index);
          +            if (exp.getPosition() != position + positionOffset) {
          +                break;
          +            }
          +            position++;
          +        }
          +        
          +        if (position == 0)
          +            return new Pair<Expression, Expression>(null, null);
          +        
          +        if (position == 1)
          +            return new Pair<Expression, Expression>(joinExpressions.get(rowkeyColumnIndexes.get(0)), hashExpressions.get(rowkeyColumnIndexes.get(0)));
          +        
          +        List<Expression> lChildren = Lists.newArrayList();
          +        List<Expression> rChildren = Lists.newArrayList();
          +        for (int i = 0; i < position; i++) {
          +            Integer index = rowkeyColumnIndexes.get(i);
          +            lChildren.add(joinExpressions.get(index));
          +            rChildren.add(hashExpressions.get(index));
          +        }
          +        
          +        return new Pair<Expression, Expression>(new RowValueConstructorExpression(lChildren, false), new RowValueConstructorExpression(rChildren, false));
          +    }
          +    
          
          Show
          jamestaylor James Taylor added a comment - And instead of the code below, it'd be better if you called into WhereOptimizer to get this information. There are cases beyond a simple column reference that we'd be able to optimize (like a SUBSTR(rowKeyCol, 1, 3), or even a row value constructor construct): + private Pair<Expression, Expression> extractKeyRangeExpressions(StatementContext context, PTable table, JoinType type, final List<Expression> joinExpressions, final List<Expression> hashExpressions) { + if (type != JoinType.Inner) + return new Pair<Expression, Expression>( null , null ); + + List< Integer > rowkeyColumnIndexes = Lists.newArrayList(); + for ( int i = 0; i < joinExpressions.size(); i++) { + Expression joinExpression = joinExpressions.get(i); + if (joinExpression instanceof RowKeyColumnExpression) { + rowkeyColumnIndexes.add(i); + } + } + Collections.sort(rowkeyColumnIndexes, new Comparator< Integer >() { + @Override + public int compare( Integer l, Integer r) { + return ((RowKeyColumnExpression) joinExpressions.get(l)).getPosition() - ((RowKeyColumnExpression) joinExpressions.get(r)).getPosition(); + } + }); + int positionOffset = (table.getBucketNum() == null ? 0 : 1) + (context.getConnection().getTenantId() != null && table.isMultiTenant() ? 1 : 0) + (table.getViewIndexId() == null ? 0 : 1); + int position = 0; + for ( Integer index : rowkeyColumnIndexes) { + RowKeyColumnExpression exp = (RowKeyColumnExpression) joinExpressions.get(index); + if (exp.getPosition() != position + positionOffset) { + break ; + } + position++; + } + + if (position == 0) + return new Pair<Expression, Expression>( null , null ); + + if (position == 1) + return new Pair<Expression, Expression>(joinExpressions.get(rowkeyColumnIndexes.get(0)), hashExpressions.get(rowkeyColumnIndexes.get(0))); + + List<Expression> lChildren = Lists.newArrayList(); + List<Expression> rChildren = Lists.newArrayList(); + for ( int i = 0; i < position; i++) { + Integer index = rowkeyColumnIndexes.get(i); + lChildren.add(joinExpressions.get(index)); + rChildren.add(hashExpressions.get(index)); + } + + return new Pair<Expression, Expression>( new RowValueConstructorExpression(lChildren, false ), new RowValueConstructorExpression(rChildren, false )); + } +
          Hide
          maryannxue Maryann Xue added a comment -

          One other question. In HashCacheClient, can you remind me what this does (maybe add a comment?):

          Ok, will do. This is to evaluate the value set for InExpression in the same iteration where we do hash-table building.

          And instead of the code below, it'd be better if you called into WhereOptimizer to get this information. There are cases beyond a simple column reference that we'd be able to optimize (like a SUBSTR(rowKeyCol, 1, 3), or even a row value constructor construct):

          The reason why I do that now is to sort the join keys in a fashion to organize them into a RowValueConstructor with as many leading key parts as possible. for example, c1 is the first join key and c0 is second, that way we'll be able to get a (c0,c1) for the final dynamic filter. Yeah, I also understand there are situations you mentioned above. So any way we can cover these both situations best?

          Show
          maryannxue Maryann Xue added a comment - One other question. In HashCacheClient, can you remind me what this does (maybe add a comment?): Ok, will do. This is to evaluate the value set for InExpression in the same iteration where we do hash-table building. And instead of the code below, it'd be better if you called into WhereOptimizer to get this information. There are cases beyond a simple column reference that we'd be able to optimize (like a SUBSTR(rowKeyCol, 1, 3), or even a row value constructor construct): The reason why I do that now is to sort the join keys in a fashion to organize them into a RowValueConstructor with as many leading key parts as possible. for example, c1 is the first join key and c0 is second, that way we'll be able to get a (c0,c1) for the final dynamic filter. Yeah, I also understand there are situations you mentioned above. So any way we can cover these both situations best?
          Hide
          maryannxue Maryann Xue added a comment -

          I had not read the new WhereOptimizer code when I wrote that part.
          One way is to let WhereOptimizer.KeyExpressionVisitor visit each join key respectively and order those join keys by their key part starting position, or do something similar to "andKeySlots()"?

          In "newRowValueConstructorKeyParts()", there is this code: Does it mean if a nested row value constructor cannot be handled here? So we have to flatten them after constructing a new row value constructor for qualified join keys, right?

                          if (childExtractNodes.size() != 1 || childExtractNodes.get(0) != rvc.getChildren().get(i)) {
                              break;
                          }
          

          And a more complicated case, for example, we have join keys: SUBSTR(rowKeyCol0, 1, 3) and rowKeyCol1, would it make sense to construct a RowValueConstructor(SUBSTR(rowKeyCol0, 1, 3), rowKeyCol1)? Will WhereOptimizer be able to handle that?

          Show
          maryannxue Maryann Xue added a comment - I had not read the new WhereOptimizer code when I wrote that part. One way is to let WhereOptimizer.KeyExpressionVisitor visit each join key respectively and order those join keys by their key part starting position, or do something similar to "andKeySlots()"? In "newRowValueConstructorKeyParts()", there is this code: Does it mean if a nested row value constructor cannot be handled here? So we have to flatten them after constructing a new row value constructor for qualified join keys, right? if (childExtractNodes.size() != 1 || childExtractNodes.get(0) != rvc.getChildren().get(i)) { break ; } And a more complicated case, for example, we have join keys: SUBSTR(rowKeyCol0, 1, 3) and rowKeyCol1, would it make sense to construct a RowValueConstructor(SUBSTR(rowKeyCol0, 1, 3), rowKeyCol1)? Will WhereOptimizer be able to handle that?
          Hide
          jamestaylor James Taylor added a comment -

          My top level point is to make sure we have a common API in WhereOptimizer as otherwise it'll be difficult to maintain.

          Just refactor WhereOptimizer.pushKeyExpressionsToScan() to return the info you need. The main thing it produces is a ScanRanges which will give you everything you need. If ScanRanges.useSkipScan() is false, then you won't/can't do the optimization.

          My first thought is to have something like this:

              public static ScanRanges pushKeyExpressionsToScan(StatementContext context, FilterableStatement statement,
                      Expression whereClause, Set<Expression> extractNodes) {
          

          This function wouldn't call context.setScanRanges(), but just return it instead. And it wouldn't do this bit at the end:

                  if (whereClause == null) {
                      return null;
                  } else {
                      return whereClause.accept(new RemoveExtractedNodesVisitor(extractNodes));
                  }
          

          Instead you'd maintain a different function with the original signature that would do this part:

                  context.setScanRanges(scanRanges);
                  if (whereClause == null) {
                      return null;
                  } else {
                      return whereClause.accept(new RemoveExtractedNodesVisitor(extractNodes));
                  }
          

          and change ScanRanges to pass in the minMaxRange in the constructor instead and context.setScanRanges() can access it from ScanRanges instead.

          Show
          jamestaylor James Taylor added a comment - My top level point is to make sure we have a common API in WhereOptimizer as otherwise it'll be difficult to maintain. Just refactor WhereOptimizer.pushKeyExpressionsToScan() to return the info you need. The main thing it produces is a ScanRanges which will give you everything you need. If ScanRanges.useSkipScan() is false, then you won't/can't do the optimization. My first thought is to have something like this: public static ScanRanges pushKeyExpressionsToScan(StatementContext context, FilterableStatement statement, Expression whereClause, Set<Expression> extractNodes) { This function wouldn't call context.setScanRanges(), but just return it instead. And it wouldn't do this bit at the end: if (whereClause == null ) { return null ; } else { return whereClause.accept( new RemoveExtractedNodesVisitor(extractNodes)); } Instead you'd maintain a different function with the original signature that would do this part: context.setScanRanges(scanRanges); if (whereClause == null ) { return null ; } else { return whereClause.accept( new RemoveExtractedNodesVisitor(extractNodes)); } and change ScanRanges to pass in the minMaxRange in the constructor instead and context.setScanRanges() can access it from ScanRanges instead.
          Hide
          maryannxue Maryann Xue added a comment -

          With more than one qualified join key, like pkCol0 and pkCol1, what I'm trying to do is not "where pkCol0 in ... and pkCol1 in ..." but instead "where (pkCol0, pkCol) in ...". These two are different, and with the first where clause, there would be a lot of meaningless point lookups. But I don't think RowValueConstructor can handle unordered key parts, like (pkCol1, pkCol0). Am I right?

          Show
          maryannxue Maryann Xue added a comment - With more than one qualified join key, like pkCol0 and pkCol1, what I'm trying to do is not "where pkCol0 in ... and pkCol1 in ..." but instead "where (pkCol0, pkCol) in ...". These two are different, and with the first where clause, there would be a lot of meaningless point lookups. But I don't think RowValueConstructor can handle unordered key parts, like (pkCol1, pkCol0). Am I right?
          Hide
          jamestaylor James Taylor added a comment -

          Good point. You'll need to sort the join key on PK position, but not do anything else. The WhereOptimizer will do the rest for you.

          Show
          jamestaylor James Taylor added a comment - Good point. You'll need to sort the join key on PK position, but not do anything else. The WhereOptimizer will do the rest for you.
          Hide
          maryannxue Maryann Xue added a comment -

          Yes, I will sort the qualified join keys in pk position order and create a new RowValueConstructor to include them (if more than one join key qualified).
          Since we don't have actually ranges until runtime (when the right-hand-side operand will be evaluated), what I'm trying to do at compile time (with extractKeyRangeExpressions()) is to make sure there are meaningful join keys with which we can optimize at runtime and if yes to pick out those one or more join keys and organize them in a way that WhereOptimizer can make use of to generate ranges at runtime. With only one join key, the problem seems simple and straightforward. But with multiple join keys, we have to carefully select those that can help us most.
          There are three steps here:
          1) to pick out those potentially helpful join key expressions, for which i assume WhereOptimizer.KeyExpressionVisitor would return not null KeySlots.
          2) to make a combination of join key expressions selected in 1) that can hopefully generate most meaningful key ranges.
          3) to verify that the final expression we have constructed will be useful to WhereOptimizer at runtime.

          Step 2 is trickier, and it really depends on what cases we can handle with RowValueConstructor now. What I'm thinking now is:
          a. make use of the KeySlots returned by WhereOptimizer.KeyExpressionVisitor to get pk position and span
          b. order them by pk position
          c. try constructing RowValueConstructor with the first N keys and get the largest N that KeyExpressionVisitor that would return a KeySlots for.
          d. the final expression we'll potentially use at runtime is "RowValueConstructor(sortedJoinKey1, sortedJoinKey2,... sortedJoinKeyN)"
          For example, we have join keys pkCol0, pkCol1, pkCol3, and the largest N will be 2, and the final expression will be (pkCol0, pkCol1).
          With a better knowledge of what cases of RowValueConstructor that WhereOptimizer can handle, I can have a more sophisticated algorithm, like taking out some duplicated key slots.

          Step 3 is like part of the logic from pushKeyExpressionsToScan(), but a little different. Like I said we don't have actually ranges at compile time, we'll have to extract the "verification" part of that function.

          Show
          maryannxue Maryann Xue added a comment - Yes, I will sort the qualified join keys in pk position order and create a new RowValueConstructor to include them (if more than one join key qualified). Since we don't have actually ranges until runtime (when the right-hand-side operand will be evaluated), what I'm trying to do at compile time (with extractKeyRangeExpressions()) is to make sure there are meaningful join keys with which we can optimize at runtime and if yes to pick out those one or more join keys and organize them in a way that WhereOptimizer can make use of to generate ranges at runtime. With only one join key, the problem seems simple and straightforward. But with multiple join keys, we have to carefully select those that can help us most. There are three steps here: 1) to pick out those potentially helpful join key expressions, for which i assume WhereOptimizer.KeyExpressionVisitor would return not null KeySlots. 2) to make a combination of join key expressions selected in 1) that can hopefully generate most meaningful key ranges. 3) to verify that the final expression we have constructed will be useful to WhereOptimizer at runtime. Step 2 is trickier, and it really depends on what cases we can handle with RowValueConstructor now. What I'm thinking now is: a. make use of the KeySlots returned by WhereOptimizer.KeyExpressionVisitor to get pk position and span b. order them by pk position c. try constructing RowValueConstructor with the first N keys and get the largest N that KeyExpressionVisitor that would return a KeySlots for. d. the final expression we'll potentially use at runtime is "RowValueConstructor(sortedJoinKey1, sortedJoinKey2,... sortedJoinKeyN)" For example, we have join keys pkCol0, pkCol1, pkCol3, and the largest N will be 2, and the final expression will be (pkCol0, pkCol1). With a better knowledge of what cases of RowValueConstructor that WhereOptimizer can handle, I can have a more sophisticated algorithm, like taking out some duplicated key slots. Step 3 is like part of the logic from pushKeyExpressionsToScan(), but a little different. Like I said we don't have actually ranges at compile time, we'll have to extract the "verification" part of that function.
          Hide
          maryannxue Maryann Xue added a comment -

          Did a local performance test:

          Group1:
          CREATE TABLE T_1000 (
          mypk VARCHAR(10) NOT NULL PRIMARY KEY,
          CF.column1 varchar(10),
          CF.column2 varchar(10),
          CF.column3 varchar(10));
          CREATE TABLE T_5000000 (
          mypk VARCHAR(10) NOT NULL PRIMARY KEY,
          CF.column1 varchar(10),
          CF.column2 varchar(10),
          CF.column3 varchar(10));
          T_1000:
          mypk: string value of 0 ~ 999
          column3: string value of random int between 0 ~ 9999
          T_5000000:
          mypk: string value of 0 ~ 4999999

          Query Before After Scan Type
          select count(*) from T_5000000 t1 join T_1000 t2 on t1.mypk = t2.column3; 8.1 8.2 RANGE
          select count(*) from T_5000000 t1 join T_1000 t2 on t1.mypk = t2.column3 and t2.column3 != '9859'; 8.1 0.40 SKIP
          select count(*) from T_5000000 t1 join T_1000 t2 on t1.mypk = t2.column3 and t2.column3 = '9859'; 7.9 0.23 SKIP

          Group2:
          CREATE TABLE T_1000_INT (
          mypk INTEGER NOT NULL PRIMARY KEY,
          CF.column1 INTEGER,
          CF.column2 INTEGER,
          CF.column3 INTEGER);
          CREATE TABLE T_5000000_INT (
          mypk INTEGER NOT NULL PRIMARY KEY,
          CF.column1 INTEGER,
          CF.column2 INTEGER,
          CF.column3 INTEGER);
          T_1000:
          mypk: 0 ~ 999
          column3: random int between 0 ~ 9999
          T_5000000:
          mypk: 0 ~ 4999999

          Query Before After Scan Type
          select count(*) from T_5000000_INT t1 join T_1000_INT t2 on t1.mypk = t2.column3; 8.1 0.28 RANGE
          select count(*) from T_5000000_INT t1 join T_1000_INT t2 on t1.mypk = t2.column3 and t2.column3 != 6768; 8.1 0.41 SKIP
          select count(*) from T_5000000_INT t1 join T_1000_INT t2 on t1.mypk = t2.column3 and t2.column3 = 6768; 7.9 0.23 SKIP

          The reason why I did two groups here is, as you can see, the first query in group 1 did not benefit from the optimization because by default this query goes range scan optimization by default (due to no filters) and the range turns out to be close to full range in string order. But with group two, where we have a natural integer order, the range is narrowed down to just ~10000 rows.

          Show
          maryannxue Maryann Xue added a comment - Did a local performance test: Group1: CREATE TABLE T_1000 ( mypk VARCHAR(10) NOT NULL PRIMARY KEY, CF.column1 varchar(10), CF.column2 varchar(10), CF.column3 varchar(10)); CREATE TABLE T_5000000 ( mypk VARCHAR(10) NOT NULL PRIMARY KEY, CF.column1 varchar(10), CF.column2 varchar(10), CF.column3 varchar(10)); T_1000: mypk: string value of 0 ~ 999 column3: string value of random int between 0 ~ 9999 T_5000000: mypk: string value of 0 ~ 4999999 Query Before After Scan Type select count(*) from T_5000000 t1 join T_1000 t2 on t1.mypk = t2.column3; 8.1 8.2 RANGE select count(*) from T_5000000 t1 join T_1000 t2 on t1.mypk = t2.column3 and t2.column3 != '9859'; 8.1 0.40 SKIP select count(*) from T_5000000 t1 join T_1000 t2 on t1.mypk = t2.column3 and t2.column3 = '9859'; 7.9 0.23 SKIP Group2: CREATE TABLE T_1000_INT ( mypk INTEGER NOT NULL PRIMARY KEY, CF.column1 INTEGER, CF.column2 INTEGER, CF.column3 INTEGER); CREATE TABLE T_5000000_INT ( mypk INTEGER NOT NULL PRIMARY KEY, CF.column1 INTEGER, CF.column2 INTEGER, CF.column3 INTEGER); T_1000: mypk: 0 ~ 999 column3: random int between 0 ~ 9999 T_5000000: mypk: 0 ~ 4999999 Query Before After Scan Type select count(*) from T_5000000_INT t1 join T_1000_INT t2 on t1.mypk = t2.column3; 8.1 0.28 RANGE select count(*) from T_5000000_INT t1 join T_1000_INT t2 on t1.mypk = t2.column3 and t2.column3 != 6768; 8.1 0.41 SKIP select count(*) from T_5000000_INT t1 join T_1000_INT t2 on t1.mypk = t2.column3 and t2.column3 = 6768; 7.9 0.23 SKIP The reason why I did two groups here is, as you can see, the first query in group 1 did not benefit from the optimization because by default this query goes range scan optimization by default (due to no filters) and the range turns out to be close to full range in string order. But with group two, where we have a natural integer order, the range is narrowed down to just ~10000 rows.
          Hide
          jamestaylor James Taylor added a comment -

          That's a nice improvement, Maryann Xue!

          Regarding the constants to use (since you don't have these yet), I don't think it should matter which ones you use as long as they're not null. You might need to supply two values (to prevent it being turned into an equality expression), but if you invoke the WhereOptimizer directly, I don't think you'll even need to do this.

          FYI, Kyle Buzsaki is working on moving the minMaxRange into ScanRanges (which I believe may help). I'm hoping you can get what you need from the ScanRanges produced by calling WhereOptimizer.pushKeyExpressionsToScan(). If the ScanRanges.getRanges() is populated with the constant expression you supplied, you know it is optimized. You'll likely need to add an access for the slotSpan member variable too - that'll tell you how many columns were optimized.

          Show
          jamestaylor James Taylor added a comment - That's a nice improvement, Maryann Xue ! Regarding the constants to use (since you don't have these yet), I don't think it should matter which ones you use as long as they're not null. You might need to supply two values (to prevent it being turned into an equality expression), but if you invoke the WhereOptimizer directly, I don't think you'll even need to do this. FYI, Kyle Buzsaki is working on moving the minMaxRange into ScanRanges (which I believe may help). I'm hoping you can get what you need from the ScanRanges produced by calling WhereOptimizer.pushKeyExpressionsToScan(). If the ScanRanges.getRanges() is populated with the constant expression you supplied, you know it is optimized. You'll likely need to add an access for the slotSpan member variable too - that'll tell you how many columns were optimized.
          Hide
          jamestaylor James Taylor added a comment -

          For ScanRanges, I'd create a new method like the following:

          public int getPkColumnSpan() {
              return ScanUtil.calculateSlotSpan(ranges, slotSpan);
          }
          

          This will return the number of PK columns that have been optimized by the WhereOptimizer.

          For the constants, you just need to use any valid value given the PDataType. If you want, feel free to add a PDataType.getSampleValue() method that'll return that. You'd need to create a where clause with an expression like (joinKeyCol1, joinKeyCol2, ...) IN ((sampleVal1, sampleVal2, ...),(sampleVal1, sampleVal2)). I think you'll need to use two values in the IN clause or it'll compile it to an = expression instead.

          Show
          jamestaylor James Taylor added a comment - For ScanRanges, I'd create a new method like the following: public int getPkColumnSpan() { return ScanUtil.calculateSlotSpan(ranges, slotSpan); } This will return the number of PK columns that have been optimized by the WhereOptimizer. For the constants, you just need to use any valid value given the PDataType. If you want, feel free to add a PDataType.getSampleValue() method that'll return that. You'd need to create a where clause with an expression like (joinKeyCol1, joinKeyCol2, ...) IN ((sampleVal1, sampleVal2, ...),(sampleVal1, sampleVal2)). I think you'll need to use two values in the IN clause or it'll compile it to an = expression instead.
          Hide
          maryannxue Maryann Xue added a comment -

          Thanks a lot for the information, James Taylor! Think that will work well.

          Show
          maryannxue Maryann Xue added a comment - Thanks a lot for the information, James Taylor ! Think that will work well.
          Hide
          maryannxue Maryann Xue added a comment -

          Moved logic of getting key expression combination to WhereOptimizer.

          Show
          maryannxue Maryann Xue added a comment - Moved logic of getting key expression combination to WhereOptimizer.
          Hide
          jamestaylor James Taylor added a comment -

          Looks good, Maryann Xue. See my comment on PHOENIX-1220 - it'd be good to adjust your getSampleValue call to take that into account prior to checking this in. +1 after that.

          Show
          jamestaylor James Taylor added a comment - Looks good, Maryann Xue . See my comment on PHOENIX-1220 - it'd be good to adjust your getSampleValue call to take that into account prior to checking this in. +1 after that.
          Hide
          maryannxue Maryann Xue added a comment -

          Think the getSampleValue() itself has no problem, plus it also provides two other signatures as "getSampleValue(maxLength, arrayLength)" and "getSampleValue(maxLength)", in which "maxLength" is only useful for VARCHAR, CHAR, VARBINARY, BINARY and their corresponding array types, and if not specified we'll just give a default value "1". I just need to adjust my PDataTypeTest.testGetSampleValue() to use the "toObject()" method with parameter "maxLength".

          Show
          maryannxue Maryann Xue added a comment - Think the getSampleValue() itself has no problem, plus it also provides two other signatures as "getSampleValue(maxLength, arrayLength)" and "getSampleValue(maxLength)", in which "maxLength" is only useful for VARCHAR, CHAR, VARBINARY, BINARY and their corresponding array types, and if not specified we'll just give a default value "1". I just need to adjust my PDataTypeTest.testGetSampleValue() to use the "toObject()" method with parameter "maxLength".
          Hide
          jamestaylor James Taylor added a comment -

          I see. Note that for VARCHAR and VARBINARY, the maxLength is the max size of the value so I doubt you'd need to use it for those types. Any way, this all looks very good, so please commit it when you're ready.

          Show
          jamestaylor James Taylor added a comment - I see. Note that for VARCHAR and VARBINARY, the maxLength is the max size of the value so I doubt you'd need to use it for those types. Any way, this all looks very good, so please commit it when you're ready.
          Hide
          maryannxue Maryann Xue added a comment -

          Updated PDataTypeTest.testGetSampleValue()

          Show
          maryannxue Maryann Xue added a comment - Updated PDataTypeTest.testGetSampleValue()
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Phoenix | Master | Hadoop1 #353 (See https://builds.apache.org/job/Phoenix-master-hadoop1/353/)
          PHOENIX-852 Optimize child/parent foreign key joins (maryannxue: rev 27a6ccef31b3fa7fd4ca4a4e9cb9bc01ae498238)

          • phoenix-core/src/test/java/org/apache/phoenix/schema/PDataTypeTest.java
          • phoenix-core/src/main/java/org/apache/phoenix/parse/HintNode.java
          • phoenix-core/src/main/java/org/apache/phoenix/schema/PArrayDataType.java
          • phoenix-core/src/main/java/org/apache/phoenix/compile/ScanRanges.java
          • phoenix-core/src/main/java/org/apache/phoenix/schema/PDataType.java
          • phoenix-core/src/main/java/org/apache/phoenix/compile/JoinCompiler.java
          • phoenix-core/src/main/java/org/apache/phoenix/compile/QueryCompiler.java
          • phoenix-core/src/main/java/org/apache/phoenix/execute/HashJoinPlan.java
          • phoenix-core/src/main/java/org/apache/phoenix/compile/WhereOptimizer.java
          • phoenix-core/src/main/java/org/apache/phoenix/join/HashCacheClient.java
          • phoenix-core/src/it/java/org/apache/phoenix/end2end/HashJoinIT.java
          • phoenix-core/src/main/java/org/apache/phoenix/compile/WhereCompiler.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Phoenix | Master | Hadoop1 #353 (See https://builds.apache.org/job/Phoenix-master-hadoop1/353/ ) PHOENIX-852 Optimize child/parent foreign key joins (maryannxue: rev 27a6ccef31b3fa7fd4ca4a4e9cb9bc01ae498238) phoenix-core/src/test/java/org/apache/phoenix/schema/PDataTypeTest.java phoenix-core/src/main/java/org/apache/phoenix/parse/HintNode.java phoenix-core/src/main/java/org/apache/phoenix/schema/PArrayDataType.java phoenix-core/src/main/java/org/apache/phoenix/compile/ScanRanges.java phoenix-core/src/main/java/org/apache/phoenix/schema/PDataType.java phoenix-core/src/main/java/org/apache/phoenix/compile/JoinCompiler.java phoenix-core/src/main/java/org/apache/phoenix/compile/QueryCompiler.java phoenix-core/src/main/java/org/apache/phoenix/execute/HashJoinPlan.java phoenix-core/src/main/java/org/apache/phoenix/compile/WhereOptimizer.java phoenix-core/src/main/java/org/apache/phoenix/join/HashCacheClient.java phoenix-core/src/it/java/org/apache/phoenix/end2end/HashJoinIT.java phoenix-core/src/main/java/org/apache/phoenix/compile/WhereCompiler.java
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Phoenix | 3.0 | Hadoop1 #203 (See https://builds.apache.org/job/Phoenix-3.0-hadoop1/203/)
          PHOENIX-852 Optimize child/parent foreign key joins (maryannxue: rev be6465a89578fbb96c3324bdf88b44218acf5e80)

          • phoenix-core/src/main/java/org/apache/phoenix/compile/WhereOptimizer.java
          • phoenix-core/src/main/java/org/apache/phoenix/join/HashCacheClient.java
          • phoenix-core/src/main/java/org/apache/phoenix/compile/JoinCompiler.java
          • phoenix-core/src/main/java/org/apache/phoenix/schema/PDataType.java
          • phoenix-core/src/main/java/org/apache/phoenix/compile/ScanRanges.java
          • phoenix-core/src/test/java/org/apache/phoenix/schema/PDataTypeTest.java
          • phoenix-core/src/main/java/org/apache/phoenix/schema/PArrayDataType.java
          • phoenix-core/src/main/java/org/apache/phoenix/parse/HintNode.java
          • phoenix-core/src/it/java/org/apache/phoenix/end2end/HashJoinIT.java
          • phoenix-core/src/main/java/org/apache/phoenix/execute/HashJoinPlan.java
          • phoenix-core/src/main/java/org/apache/phoenix/compile/WhereCompiler.java
          • phoenix-core/src/main/java/org/apache/phoenix/compile/QueryCompiler.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Phoenix | 3.0 | Hadoop1 #203 (See https://builds.apache.org/job/Phoenix-3.0-hadoop1/203/ ) PHOENIX-852 Optimize child/parent foreign key joins (maryannxue: rev be6465a89578fbb96c3324bdf88b44218acf5e80) phoenix-core/src/main/java/org/apache/phoenix/compile/WhereOptimizer.java phoenix-core/src/main/java/org/apache/phoenix/join/HashCacheClient.java phoenix-core/src/main/java/org/apache/phoenix/compile/JoinCompiler.java phoenix-core/src/main/java/org/apache/phoenix/schema/PDataType.java phoenix-core/src/main/java/org/apache/phoenix/compile/ScanRanges.java phoenix-core/src/test/java/org/apache/phoenix/schema/PDataTypeTest.java phoenix-core/src/main/java/org/apache/phoenix/schema/PArrayDataType.java phoenix-core/src/main/java/org/apache/phoenix/parse/HintNode.java phoenix-core/src/it/java/org/apache/phoenix/end2end/HashJoinIT.java phoenix-core/src/main/java/org/apache/phoenix/execute/HashJoinPlan.java phoenix-core/src/main/java/org/apache/phoenix/compile/WhereCompiler.java phoenix-core/src/main/java/org/apache/phoenix/compile/QueryCompiler.java
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Phoenix | 4.0 | Hadoop2 #88 (See https://builds.apache.org/job/Phoenix-4.0-hadoop2/88/)
          PHOENIX-852 Optimize child/parent foreign key joins (maryannxue: rev c39a05e289f915725d2b439e819ccd3a07dec56b)

          • phoenix-core/src/test/java/org/apache/phoenix/schema/PDataTypeTest.java
          • phoenix-core/src/main/java/org/apache/phoenix/schema/PArrayDataType.java
          • phoenix-core/src/it/java/org/apache/phoenix/end2end/HashJoinIT.java
          • phoenix-core/src/main/java/org/apache/phoenix/parse/HintNode.java
          • phoenix-core/src/main/java/org/apache/phoenix/compile/WhereOptimizer.java
          • phoenix-core/src/main/java/org/apache/phoenix/compile/QueryCompiler.java
          • phoenix-core/src/main/java/org/apache/phoenix/compile/JoinCompiler.java
          • phoenix-core/src/main/java/org/apache/phoenix/compile/WhereCompiler.java
          • phoenix-core/src/main/java/org/apache/phoenix/compile/ScanRanges.java
          • phoenix-core/src/main/java/org/apache/phoenix/join/HashCacheClient.java
          • phoenix-core/src/main/java/org/apache/phoenix/execute/HashJoinPlan.java
          • phoenix-core/src/main/java/org/apache/phoenix/schema/PDataType.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Phoenix | 4.0 | Hadoop2 #88 (See https://builds.apache.org/job/Phoenix-4.0-hadoop2/88/ ) PHOENIX-852 Optimize child/parent foreign key joins (maryannxue: rev c39a05e289f915725d2b439e819ccd3a07dec56b) phoenix-core/src/test/java/org/apache/phoenix/schema/PDataTypeTest.java phoenix-core/src/main/java/org/apache/phoenix/schema/PArrayDataType.java phoenix-core/src/it/java/org/apache/phoenix/end2end/HashJoinIT.java phoenix-core/src/main/java/org/apache/phoenix/parse/HintNode.java phoenix-core/src/main/java/org/apache/phoenix/compile/WhereOptimizer.java phoenix-core/src/main/java/org/apache/phoenix/compile/QueryCompiler.java phoenix-core/src/main/java/org/apache/phoenix/compile/JoinCompiler.java phoenix-core/src/main/java/org/apache/phoenix/compile/WhereCompiler.java phoenix-core/src/main/java/org/apache/phoenix/compile/ScanRanges.java phoenix-core/src/main/java/org/apache/phoenix/join/HashCacheClient.java phoenix-core/src/main/java/org/apache/phoenix/execute/HashJoinPlan.java phoenix-core/src/main/java/org/apache/phoenix/schema/PDataType.java
          Hide
          jamestaylor James Taylor added a comment -

          Excellent work, Maryann Xue

          Show
          jamestaylor James Taylor added a comment - Excellent work, Maryann Xue
          Hide
          enis Enis Soztutar added a comment -

          Bulk close of all issues that has been resolved in a released version.

          Show
          enis Enis Soztutar added a comment - Bulk close of all issues that has been resolved in a released version.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user maryannxue closed the pull request at:

          https://github.com/apache/phoenix/pull/7

          Show
          githubbot ASF GitHub Bot added a comment - Github user maryannxue closed the pull request at: https://github.com/apache/phoenix/pull/7

            People

            • Assignee:
              maryannxue Maryann Xue
              Reporter:
              jamestaylor James Taylor
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development