Uploaded image for project: 'Phoenix'
  1. Phoenix
  2. PHOENIX-258

Use skip scan when SELECT DISTINCT on leading row key column(s)

    Details

    • Type: Task
    • Status: Closed
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.8.0
    • Labels:
      None
    • old issue number:
      127

      Description

      create table(a varchar(32) not null, date date not null constraint pk primary key(a,date))

      [["PLAN"],["CLIENT PARALLEL 94-WAY FULL SCAN OVER foo"],[" SERVER AGGREGATE INTO ORDERED DISTINCT ROWS BY [a]"],["CLIENT MERGE SORT"]]

      We should skip scan.

      1. 258.txt
        5 kB
        Lars Hofhansl
      2. 258-v1.txt
        6 kB
        Lars Hofhansl
      3. 258-v10.txt
        29 kB
        Lars Hofhansl
      4. 258-v11.txt
        32 kB
        Lars Hofhansl
      5. 258-v12.txt
        38 kB
        Lars Hofhansl
      6. 258-v13.txt
        41 kB
        Lars Hofhansl
      7. 258-v14.txt
        45 kB
        Lars Hofhansl
      8. 258-v15.txt
        44 kB
        Lars Hofhansl
      9. 258-v16.txt
        45 kB
        Lars Hofhansl
      10. 258-v17.txt
        44 kB
        Lars Hofhansl
      11. 258-v2.txt
        8 kB
        Lars Hofhansl
      12. 258-v3.txt
        9 kB
        Lars Hofhansl
      13. 258-v4.txt
        17 kB
        Lars Hofhansl
      14. 258-v5.txt
        28 kB
        Lars Hofhansl
      15. 258-v6.txt
        27 kB
        Lars Hofhansl
      16. 258-v7.txt
        26 kB
        Lars Hofhansl
      17. 258-v8.txt
        28 kB
        Lars Hofhansl
      18. 258-v9.txt
        28 kB
        Lars Hofhansl
      19. 258-WIP.txt
        5 kB
        Lars Hofhansl
      20. DistinctFixedPrefixFilter.java
        2 kB
        Lars Hofhansl
      21. in-clause.png
        7 kB
        Nick Dimiduk

        Issue Links

          Activity

          Hide
          pctony Tony Stevenson added a comment -

          Comment:jamestaylor:04/05/13 09:04:14 PM:

          This query will be executed by aggregating in-place as the scan is being run (as opposed to holding the distinct groups in memory for each region). You need to look at all rows. How are you thinking a skip scan would help?

          Show
          pctony Tony Stevenson added a comment - Comment:jamestaylor:04/05/13 09:04:14 PM: This query will be executed by aggregating in-place as the scan is being run (as opposed to holding the distinct groups in memory for each region). You need to look at all rows. How are you thinking a skip scan would help?
          Hide
          pctony Tony Stevenson added a comment -

          Comment:ryang-sfdc:04/05/13 10:00:52 PM:

          I think you don't need to look at all rows. Once you've seen (a='abc',date='2013-04-01') you can hint hbase to skip ahead to (a='abd',date='1970-01-01'), given that we are grouping by a.

          Show
          pctony Tony Stevenson added a comment - Comment:ryang-sfdc:04/05/13 10:00:52 PM: I think you don't need to look at all rows. Once you've seen (a='abc',date='2013-04-01') you can hint hbase to skip ahead to (a='abd',date='1970-01-01'), given that we are grouping by a.
          Hide
          pctony Tony Stevenson added a comment -

          Comment:jamestaylor:04/05/13 10:11:20 PM:

          Good idea. So from 'abc\0', you could jump to 'abc\1' which would put you past all the column values of 'abc'. If the cardinality of the first column is low, that's likely more efficient. We need stats to determine when this is more efficient than just doing a full table scan.

          Show
          pctony Tony Stevenson added a comment - Comment:jamestaylor:04/05/13 10:11:20 PM: Good idea. So from 'abc\0', you could jump to 'abc\1' which would put you past all the column values of 'abc'. If the cardinality of the first column is low, that's likely more efficient. We need stats to determine when this is more efficient than just doing a full table scan.
          Hide
          pctony Tony Stevenson added a comment -

          Comment:jamestaylor:06/17/13 09:52:53 PM:

          This optimization can only be done if you're not using any aggregation functions. If you're using aggregation functions, you need to visit every row to do the aggregation. It's somewhat more applicable to SELECT DISTINCT (which turns into exactly this - an aggregate query with no aggregate functions). It's also a slight tweak to the SkipScanFilter too, since we'd want the first value to pass the filter and then we'd want to skip duplicates to get to the next value. I don't think it'd be too hard to adapt the skip scan and it'll make a big performance difference if the cardinality is low.

          Make sense, @ryang-sfdc ?

          Show
          pctony Tony Stevenson added a comment - Comment:jamestaylor:06/17/13 09:52:53 PM: This optimization can only be done if you're not using any aggregation functions. If you're using aggregation functions, you need to visit every row to do the aggregation. It's somewhat more applicable to SELECT DISTINCT (which turns into exactly this - an aggregate query with no aggregate functions). It's also a slight tweak to the SkipScanFilter too, since we'd want the first value to pass the filter and then we'd want to skip duplicates to get to the next value. I don't think it'd be too hard to adapt the skip scan and it'll make a big performance difference if the cardinality is low. Make sense, @ryang-sfdc ?
          Hide
          pctony Tony Stevenson added a comment -

          Comment:ryang-sfdc:06/17/13 09:52:53 PM:

          mentioned

          Show
          pctony Tony Stevenson added a comment - Comment:ryang-sfdc:06/17/13 09:52:53 PM: mentioned
          Hide
          pctony Tony Stevenson added a comment -

          Comment:jamestaylor:06/18/13 12:37:03 AM:

          Thinking out loud here a bit, but another complication even for the SELECT DISTINCT case is if the WHERE clause doesn't match a given row. This would be wrong because you'd be skipping over a bunch of rows that might match the WHERE clause filter. I think we can potentially work around this by putting the SkipScanFilter after the filter for the WHERE clause. Then the skip scan will only see rows that have passed and can safely jump ahead to the next distinct row.

          Show
          pctony Tony Stevenson added a comment - Comment:jamestaylor:06/18/13 12:37:03 AM: Thinking out loud here a bit, but another complication even for the SELECT DISTINCT case is if the WHERE clause doesn't match a given row. This would be wrong because you'd be skipping over a bunch of rows that might match the WHERE clause filter. I think we can potentially work around this by putting the SkipScanFilter after the filter for the WHERE clause. Then the skip scan will only see rows that have passed and can safely jump ahead to the next distinct row.
          Hide
          jamestaylor James Taylor added a comment -

          See PHOENIX-2797 for a good description and rationale.

          Show
          jamestaylor James Taylor added a comment - See PHOENIX-2797 for a good description and rationale.
          Hide
          jamestaylor James Taylor added a comment -

          Lars Hofhansl - you ok if we assign this to you (as you've been dabbling with it recently)?

          FYI, to get the perf gain of the skip scan, I believe you'll need to pass a new boolean to the SkipScanFilter that indicates it's being used for DISTINCT. Otherwise, rather than skipping all duplicate rows, it's going to include them. So a tweak like this in SkipScanFilter.navigate():

                  // First check to see if we're in-range until we reach our end key
                  if (endKeyLength > 0) {
                      if (!this.isDistinct && Bytes.compareTo(currentKey, offset, length, endKey, 0, endKeyLength) < 0) {
                          return getIncludeReturnCode();
                      }
          

          Also, if there's a filter from the WHERE clause (i.e. any remaining filtering that was left over after computing the start/stop row of the scan), I suspect you won't be able to perform this optimization. In that case, you still need to traverse the rows first, before aggregating as you wouldn't know which of the duplicate rows match or don't match without looking at them all.

          Show
          jamestaylor James Taylor added a comment - Lars Hofhansl - you ok if we assign this to you (as you've been dabbling with it recently)? FYI, to get the perf gain of the skip scan, I believe you'll need to pass a new boolean to the SkipScanFilter that indicates it's being used for DISTINCT. Otherwise, rather than skipping all duplicate rows, it's going to include them. So a tweak like this in SkipScanFilter.navigate(): // First check to see if we're in-range until we reach our end key if (endKeyLength > 0) { if (! this .isDistinct && Bytes.compareTo(currentKey, offset, length, endKey, 0, endKeyLength) < 0) { return getIncludeReturnCode(); } Also, if there's a filter from the WHERE clause (i.e. any remaining filtering that was left over after computing the start/stop row of the scan), I suspect you won't be able to perform this optimization. In that case, you still need to traverse the rows first, before aggregating as you wouldn't know which of the duplicate rows match or don't match without looking at them all.
          Hide
          lhofhansl Lars Hofhansl added a comment - - edited

          If this means nobody else will be looking at it, it's probably not a good idea. I have so little engineering time these days.

          Have been playing around with the SkipScanFilter, but that's not working. We need a filter, perhaps similar to it that can move on after it has seen the requested distinct prefix if the key once, the SkipScanFilter is a "match" filter, while a key matches a range it will include it.

          Edit: And that's exactly what you say above as well.

          Show
          lhofhansl Lars Hofhansl added a comment - - edited If this means nobody else will be looking at it, it's probably not a good idea. I have so little engineering time these days. Have been playing around with the SkipScanFilter, but that's not working. We need a filter, perhaps similar to it that can move on after it has seen the requested distinct prefix if the key once, the SkipScanFilter is a "match" filter, while a key matches a range it will include it. Edit: And that's exactly what you say above as well.
          Hide
          jamestaylor James Taylor added a comment -

          I don't think we need another filter for this optimization - we just need a very small tweak to the SkipScanFilter to be used in this new context (see above).

          Show
          jamestaylor James Taylor added a comment - I don't think we need another filter for this optimization - we just need a very small tweak to the SkipScanFilter to be used in this new context (see above).
          Hide
          lhofhansl Lars Hofhansl added a comment -

          It's a little more complex than that. We need to increment (and return seek hint) the key-prefix after we've seen it once, and only if we see the same key-prefix again.

          I'll park the test patch here - NOT WORKING, yet.

          Show
          lhofhansl Lars Hofhansl added a comment - It's a little more complex than that. We need to increment (and return seek hint) the key-prefix after we've seen it once, and only if we see the same key-prefix again. I'll park the test patch here - NOT WORKING, yet.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          WIP-Does not work... Just parking it.

          Show
          lhofhansl Lars Hofhansl added a comment - WIP-Does not work... Just parking it.
          Hide
          jamestaylor James Taylor added a comment - - edited

          Lars Hofhansl - had any spare cycles to work on this and if so want to attach your latest WIP patch? There's a GSoC student, Haoran Zhang, who has this on his list of JIRAs, but it's perhaps #3 on his list so he likely won't get to it right away.

          Show
          jamestaylor James Taylor added a comment - - edited Lars Hofhansl - had any spare cycles to work on this and if so want to attach your latest WIP patch? There's a GSoC student, Haoran Zhang , who has this on his list of JIRAs, but it's perhaps #3 on his list so he likely won't get to it right away.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Some more experiments. Added a simple Filter to HBase that only let's distinct prefixes pass.

          Then I loaded some data with an 15 char prefix. I started with 16 distinct prefixes and then added more and more rows for each.

          With just about 1k values per prefix that filter is faster. 40ms vs 23ms.
          With 16k values, it's 380ms vs 23ms.
          With 64k values it's 1.4s vs 23ms.
          With 512k value it's 6.5s vs 26ms.
          And so on. The Skip-Scan is more or less constant since the number of seeks is what counts.
          Need to test more of course with more prefixes, etc.

          The filter is a hack of course, prefix length hardcode, a fixed length, etc. Now the task is to integrate this with the Phoenix Schema.

          But definitely worth doing.

          Show
          lhofhansl Lars Hofhansl added a comment - Some more experiments. Added a simple Filter to HBase that only let's distinct prefixes pass. Then I loaded some data with an 15 char prefix. I started with 16 distinct prefixes and then added more and more rows for each. With just about 1k values per prefix that filter is faster. 40ms vs 23ms. With 16k values, it's 380ms vs 23ms. With 64k values it's 1.4s vs 23ms. With 512k value it's 6.5s vs 26ms. And so on. The Skip-Scan is more or less constant since the number of seeks is what counts. Need to test more of course with more prefixes, etc. The filter is a hack of course, prefix length hardcode, a fixed length, etc. Now the task is to integrate this with the Phoenix Schema. But definitely worth doing.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          I'll attempt a combined patch. Should be just using a RowSchema iterator and calling next once with position=0 and extraSpan of the length of the distinct prefix of the row key. That would match the current key, which I can then increment in the filter.

          Show
          lhofhansl Lars Hofhansl added a comment - I'll attempt a combined patch. Should be just using a RowSchema iterator and calling next once with position=0 and extraSpan of the length of the distinct prefix of the row key. That would match the current key, which I can then increment in the filter.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Nick Dimiduk, FYI. You mentioned you had some stats. Wanna include them here?

          Show
          lhofhansl Lars Hofhansl added a comment - Nick Dimiduk , FYI. You mentioned you had some stats. Wanna include them here?
          Hide
          ndimiduk Nick Dimiduk added a comment -

          Yes indeed. Here's a rudimentary test result – my leading column has a total cardinality of 7 values. These whisker plots show the query times when I do or do not provide those 7 values via an 'in' clause. The query itself is a audit kind of query, doing a group-by/count of events bucketed per hour. Providing the values and thus informing the skip scanner makes for a noticeable improvement in overall query time. I ran the queries 10 times each, alternating between one than the other.

          Show
          ndimiduk Nick Dimiduk added a comment - Yes indeed. Here's a rudimentary test result – my leading column has a total cardinality of 7 values. These whisker plots show the query times when I do or do not provide those 7 values via an 'in' clause. The query itself is a audit kind of query, doing a group-by/count of events bucketed per hour. Providing the values and thus informing the skip scanner makes for a noticeable improvement in overall query time. I ran the queries 10 times each, alternating between one than the other.
          Hide
          ndimiduk Nick Dimiduk added a comment -

          In my particular case, itemizing the leading column also allows phoenix to take advantage of more of the rowkey as an index, so without the 'in' clause, my explain is

          CLIENT 51739-CHUNK PARALLEL 51739-WAY RANGE SCAN
          

          while including the 'in' clause

          CLIENT 995-CHUNK PARALLEL 995-WAY SKIP SCAN ON 140 RANGES
          
          Show
          ndimiduk Nick Dimiduk added a comment - In my particular case, itemizing the leading column also allows phoenix to take advantage of more of the rowkey as an index, so without the 'in' clause, my explain is CLIENT 51739-CHUNK PARALLEL 51739-WAY RANGE SCAN while including the 'in' clause CLIENT 995-CHUNK PARALLEL 995-WAY SKIP SCAN ON 140 RANGES
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Work in progress patch. Not sure it even compiles. Just parking it here.

          Show
          lhofhansl Lars Hofhansl added a comment - Work in progress patch. Not sure it even compiles. Just parking it here.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Here's a patch that works. Tests with

          • fixed length single field prefix
          • fixed length multiple field prefix
          • variable length single field prefix
          • variable length multiple field prefix

          Everything seems to work.

          Todo:

          • explain plan
          • unit tests

          Please have a look.

          Show
          lhofhansl Lars Hofhansl added a comment - Here's a patch that works. Tests with fixed length single field prefix fixed length multiple field prefix variable length single field prefix variable length multiple field prefix Everything seems to work. Todo: explain plan unit tests Please have a look.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          -v2 adds EXPLAIN

          Show
          lhofhansl Lars Hofhansl added a comment - -v2 adds EXPLAIN
          Hide
          lhofhansl Lars Hofhansl added a comment -

          -v3 observes that the same optimization is possible for queries like
          SELECT f1, ... FROM table GROUP BY g1, g2, g3, ... HAVING ...

          As long as f1, f2, ..., are literal or column expressions only. And g1, g2, g3 are ordered in key order.

          I'll now add tests.

          Show
          lhofhansl Lars Hofhansl added a comment - -v3 observes that the same optimization is possible for queries like SELECT f1, ... FROM table GROUP BY g1, g2, g3, ... HAVING ... As long as f1, f2, ..., are literal or column expressions only. And g1, g2, g3 are ordered in key order. I'll now add tests.
          Hide
          lhofhansl Lars Hofhansl added a comment -
          • g1, g2, g3 do not even have to be ordered in key order
          • HAVING is fine as long as there are no aggragates
          • this can be combined with WHERE clauses, and everything works correctly
          Show
          lhofhansl Lars Hofhansl added a comment - g1, g2, g3 do not even have to be ordered in key order HAVING is fine as long as there are no aggragates this can be combined with WHERE clauses, and everything works correctly
          Hide
          lhofhansl Lars Hofhansl added a comment -
          • -v4 has some basic tests.
          • It also deals with the scenario of a fixed length key prefix that cannot be incremented any further (in which case we're done with the scan)
          • Doesn't use this optimization unless the number of keys parts used in the group by or distinct is less than the number of keys in the table (if it was the same we'd necessary return all rows anyway)

          Tomorrow I will add some query tests, and then this should be good to go.

          We could even turn this on always (not just when the SKIP_SCAN hint is given).

          Show
          lhofhansl Lars Hofhansl added a comment - -v4 has some basic tests. It also deals with the scenario of a fixed length key prefix that cannot be incremented any further (in which case we're done with the scan) Doesn't use this optimization unless the number of keys parts used in the group by or distinct is less than the number of keys in the table (if it was the same we'd necessary return all rows anyway) Tomorrow I will add some query tests, and then this should be good to go. We could even turn this on always (not just when the SKIP_SCAN hint is given).
          Hide
          lhofhansl Lars Hofhansl added a comment -

          The actual patch with the test.

          Show
          lhofhansl Lars Hofhansl added a comment - The actual patch with the test.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Could someone have a look? James Taylor, Samarth Jain, Thomas D'Silva?

          Show
          lhofhansl Lars Hofhansl added a comment - Could someone have a look? James Taylor , Samarth Jain , Thomas D'Silva ?
          Hide
          lhofhansl Lars Hofhansl added a comment -
          • adds a IT test
          • makes behavior the default, unless RANGE_SCAN is hinted
          Show
          lhofhansl Lars Hofhansl added a comment - adds a IT test makes behavior the default, unless RANGE_SCAN is hinted
          Hide
          lhofhansl Lars Hofhansl added a comment -

          James Taylor where/how is COUNT(DISTINCT ...) compiled?

          I find that COUNT(DISTINCT...) does not benefit from this (takes minutes) whereas SELECT COUNT FROM (SELECT DISTINCT ...) does (takes a few ms).

          Looks like COUNT(DISTINCT...) is compiled to a completely different plan.

          Show
          lhofhansl Lars Hofhansl added a comment - James Taylor where/how is COUNT(DISTINCT ...) compiled? I find that COUNT(DISTINCT...) does not benefit from this (takes minutes) whereas SELECT COUNT FROM (SELECT DISTINCT ...) does (takes a few ms). Looks like COUNT(DISTINCT...) is compiled to a completely different plan.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Some more tests.

          Show
          lhofhansl Lars Hofhansl added a comment - Some more tests.
          Hide
          jamestaylor James Taylor added a comment -

          This is awesome, Lars Hofhansl! Thanks so much for working through this. Here's some feedback:

          • In BaseResultIterators when you decide whether or not you can use the distinct filter, you can just check if context.getAggregateManager().isEmpty() as this means there are no aggregate functions in use. No need to loop through all the expressions. We need to go through every row if we're calculating some aggregate across the rows, but otherwise don't (and as you've discovered, a DISTINCT ends up as a GROUP BY.
          • Also, I think it'd be better if OrderPreservingTracker remembered how many columns where traversed over rather than the check you have for cols < plan.getTableRef().getTable().getRowKeySchema().getFieldCount(). You can do that easily by adding a new orderPreservingColumnCount member variable to OrderPreservingTracker and set it at the bottom of the isOrderPreserving() to prevPos + prevSlotSpan. Then you can carry it through to the GroupBy object in GroupBy.compile().
          • Add an IT or other test for a group by of a RVC, as that'll demonstrate the need for the above. For example:
            SELECT (pk1,pk2) FROM t GROUP BY (pk1,pk2);
            

            That would have a single expression, but still be optimizable.

          • Minor nit, by why DistinctFixedPrefixFilter as the filter name instead of DistinctPrefixFilter? Why the word "Fixed" there?
          • For the DistinctFixedPrefixFilter.getNextCellHint() method, you only need to call ByteUtil.nextKey() in the fixed length case, as adding the zero byte is essentially getting the next key.
          • For the DistinctFixedPrefixFilter.getNextCellHint() method, I believe you'll need to tweak the logic if the scan is a reverse scan. I think instead of calling nextKey, you'd want to call previousKey, and instead of adding a zero byte, you need to BaseScannerRegionObserver.getReversedRow() - move it to ByteUtil instead. I've noticed HBase has the same hacky code in ClientScanner.createClosestRowBefore() - I wish HBase would've handled this different for reverse scans as it's really not 100% reliable. meh. . Anyway, to detect on the client-side so you can push through the filter if it's a reverse scan, use the ScanUtil.isReversed() method, as we don't set it on the Scan, but do it ourselves in our coprocessor for a variety of reasons. An IT or lower level test would be good for this.
          • So when there's a WHERE clause, everything works out - HBase is pretty smart about that. What about the case where say the first row wouldn't match our boolean filter for the where clause, but the second one would? Would the distinct filter jump past it? For example:
            SELECT DISTINCT p1, p2 FROM t WHERE p3 = 'foo'
            

            With rows of p3='bar' followed by p3='foo' and both p1 and p2 being the same for those rows (and all three being pk columns). Would be good to have a test like this - how do you know when you can jump and when you can't as seems like it'd rely on what the other filters do.

          Show
          jamestaylor James Taylor added a comment - This is awesome, Lars Hofhansl ! Thanks so much for working through this. Here's some feedback: In BaseResultIterators when you decide whether or not you can use the distinct filter, you can just check if context.getAggregateManager().isEmpty() as this means there are no aggregate functions in use. No need to loop through all the expressions. We need to go through every row if we're calculating some aggregate across the rows, but otherwise don't (and as you've discovered, a DISTINCT ends up as a GROUP BY. Also, I think it'd be better if OrderPreservingTracker remembered how many columns where traversed over rather than the check you have for cols < plan.getTableRef().getTable().getRowKeySchema().getFieldCount() . You can do that easily by adding a new orderPreservingColumnCount member variable to OrderPreservingTracker and set it at the bottom of the isOrderPreserving() to prevPos + prevSlotSpan. Then you can carry it through to the GroupBy object in GroupBy.compile(). Add an IT or other test for a group by of a RVC, as that'll demonstrate the need for the above. For example: SELECT (pk1,pk2) FROM t GROUP BY (pk1,pk2); That would have a single expression, but still be optimizable. Minor nit, by why DistinctFixedPrefixFilter as the filter name instead of DistinctPrefixFilter? Why the word "Fixed" there? For the DistinctFixedPrefixFilter.getNextCellHint() method, you only need to call ByteUtil.nextKey() in the fixed length case, as adding the zero byte is essentially getting the next key. For the DistinctFixedPrefixFilter.getNextCellHint() method, I believe you'll need to tweak the logic if the scan is a reverse scan. I think instead of calling nextKey, you'd want to call previousKey, and instead of adding a zero byte, you need to BaseScannerRegionObserver.getReversedRow() - move it to ByteUtil instead. I've noticed HBase has the same hacky code in ClientScanner.createClosestRowBefore() - I wish HBase would've handled this different for reverse scans as it's really not 100% reliable. meh. . Anyway, to detect on the client-side so you can push through the filter if it's a reverse scan, use the ScanUtil.isReversed() method, as we don't set it on the Scan, but do it ourselves in our coprocessor for a variety of reasons. An IT or lower level test would be good for this. So when there's a WHERE clause, everything works out - HBase is pretty smart about that. What about the case where say the first row wouldn't match our boolean filter for the where clause, but the second one would? Would the distinct filter jump past it? For example: SELECT DISTINCT p1, p2 FROM t WHERE p3 = 'foo' With rows of p3='bar' followed by p3='foo' and both p1 and p2 being the same for those rows (and all three being pk columns). Would be good to have a test like this - how do you know when you can jump and when you can't as seems like it'd rely on what the other filters do.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Thanks James Taylor

          -v7 has the follwing:

          • Renamed to DistinctPrefixFilter. Fixed was relating to the fact that we have a fixed prefix length (in terms of the number of fields)
          • uses context.getAggregateManager().isEmpty()
          • avoid nextKey for variable length trailing prefixes.
          • With "reverse" you mean explicit ORDER BY? Or the key being declared reversed because it was defined such in the table? I either case everything should fine. In case of ORDER BY it is sorted after the fact. If declared in the table, the key is reversed.
          • The WHERE example works fine, since the distinct filter is last. I.e. the WHERE is evaluated first, then post filtered by the distinct filter. I.e. the distinct filter operates on the filtered values. That actually has the interested effect that the benefits of the distinct filter are reduce the more selective the WHERE clause is.
          • I don't quite follow the RCV example. Why would that not just work? cols < plan.getTableRef().getTable().getRowKeySchema().getFieldCount() is just an extra optimization to avoid the filter in that case.
          Show
          lhofhansl Lars Hofhansl added a comment - Thanks James Taylor -v7 has the follwing: Renamed to DistinctPrefixFilter. Fixed was relating to the fact that we have a fixed prefix length (in terms of the number of fields) uses context.getAggregateManager().isEmpty() avoid nextKey for variable length trailing prefixes. With "reverse" you mean explicit ORDER BY? Or the key being declared reversed because it was defined such in the table? I either case everything should fine. In case of ORDER BY it is sorted after the fact. If declared in the table, the key is reversed. The WHERE example works fine, since the distinct filter is last. I.e. the WHERE is evaluated first, then post filtered by the distinct filter. I.e. the distinct filter operates on the filtered values. That actually has the interested effect that the benefits of the distinct filter are reduce the more selective the WHERE clause is. I don't quite follow the RCV example. Why would that not just work? cols < plan.getTableRef().getTable().getRowKeySchema().getFieldCount() is just an extra optimization to avoid the filter in that case.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Adding a RVC test

          Show
          lhofhansl Lars Hofhansl added a comment - Adding a RVC test
          Hide
          jamestaylor James Taylor added a comment -

          Thanks for the updates, Lars Hofhansl. That's a good trick with the ordering of the filters and it makes sense the the more selective the filter, the less optimization you'd get.

          With "reverse" you mean explicit ORDER BY

          Yes, like this SELECT prefix1 FROM t GROUP BY prefix1 ORDER BY prefix1 DESC. In this case, the client-side sort is optimized out and a reverse scan will be run instead, so your DistinctPrefixFilter will need to generate the seek hint differently. The way to check on the client is like this (not as I mentioned before): plan.getOrderBy() == OrderBy.REV_ROW_KEY_ORDER_BY

          I don't quite follow the RCV example

          The col variable is used both for the optimization and to determine how many slots to use during the running of the DistinctPrefixFilter. In the RVC example, the cols would turn out to be just 1, since there'll be single expression (the RVC expression), but it spans two slots. Also, there are other weird cases possible, like this:

          SELECT prefix11 FROM t GROUP BY prefix1, TRUNC(prefix1)
          

          In that case, col would be set to 2, but it really should be 1. By letting OrderPreservingTracker track it for you, you'll handle all the weird cases (both to turn off the usage of the filter and to set the number of slots correctly)

          Show
          jamestaylor James Taylor added a comment - Thanks for the updates, Lars Hofhansl . That's a good trick with the ordering of the filters and it makes sense the the more selective the filter, the less optimization you'd get. With "reverse" you mean explicit ORDER BY Yes, like this SELECT prefix1 FROM t GROUP BY prefix1 ORDER BY prefix1 DESC . In this case, the client-side sort is optimized out and a reverse scan will be run instead, so your DistinctPrefixFilter will need to generate the seek hint differently. The way to check on the client is like this (not as I mentioned before): plan.getOrderBy() == OrderBy.REV_ROW_KEY_ORDER_BY I don't quite follow the RCV example The col variable is used both for the optimization and to determine how many slots to use during the running of the DistinctPrefixFilter. In the RVC example, the cols would turn out to be just 1, since there'll be single expression (the RVC expression), but it spans two slots. Also, there are other weird cases possible, like this: SELECT prefix11 FROM t GROUP BY prefix1, TRUNC(prefix1) In that case, col would be set to 2, but it really should be 1. By letting OrderPreservingTracker track it for you, you'll handle all the weird cases (both to turn off the usage of the filter and to set the number of slots correctly)
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Oh I see what you mean. Yeah the optimization would not work on reversed scans.
          Also, I find that I need to do the nextKey even for variable sized fields, because otherwise we'll seek to the current row again.

          New patch in a few.

          Show
          lhofhansl Lars Hofhansl added a comment - Oh I see what you mean. Yeah the optimization would not work on reversed scans. Also, I find that I need to do the nextKey even for variable sized fields, because otherwise we'll seek to the current row again. New patch in a few.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Wow... Reverse scans are slow. The speedup for a reverse DISTINCT or GROUP BY is over 10000x faster.
          And on top of that the default code (without my optimization) returns the wrong result (possibly because of an internal timeout)!

          Show
          lhofhansl Lars Hofhansl added a comment - Wow... Reverse scans are slow. The speedup for a reverse DISTINCT or GROUP BY is over 10000x faster. And on top of that the default code (without my optimization) returns the wrong result (possibly because of an internal timeout)!
          Hide
          lhofhansl Lars Hofhansl added a comment -

          -v9

          • re-enables optimization for variable size fields (needed the nextKey, the tests caught it)
          • enables optimization for reverse scans (filters already have a flag for that in HBase - see patch)
          Show
          lhofhansl Lars Hofhansl added a comment - -v9 re-enables optimization for variable size fields (needed the nextKey, the tests caught it) enables optimization for reverse scans (filters already have a flag for that in HBase - see patch)
          Hide
          lhofhansl Lars Hofhansl added a comment -

          -v10

          • unit test for reverse scans

          The RVC case is still weird and confusing.

          SELECT DISTINCT (prefix1, prefix2) FROM t or SELECT (prefix1, prefix2) FROM t GROUP BY (prefix1, prefix2) seem to return only a single row in all cases (optimization or not).

          Show
          lhofhansl Lars Hofhansl added a comment - -v10 unit test for reverse scans The RVC case is still weird and confusing. SELECT DISTINCT (prefix1, prefix2) FROM t or SELECT (prefix1, prefix2) FROM t GROUP BY (prefix1, prefix2) seem to return only a single row in all cases (optimization or not).
          Hide
          jamestaylor James Taylor added a comment -

          This is looking very good, Lars Hofhansl. A couple of minor items:

          The number of distinct columns needs to be calculated by OrderPreservingTracker instead of assuming that it's the number of group by expressions, as that won't be correct in all cases (for RVCs and for group by expressions that reference the same pk column multiple times). It's a pretty straightforward change, as the value is already being calculated, it's just not remembered and carried through to the GroupBy object. Here's what needs to be done:

          • In OrderPreservingTracker, add an orderPreservingColumnCount member variable and accessor.
          • In OrderPreservingTracker.isOrderPreserving(), before the return, set orderPreservingColumnCount = prevPos + prevSlotSpan.
          • In GroupByCompiler.GroupBy, add an orderPreservingColumnCount member variable and accessor.
          • In GroupByBuilder add a setOrderPreservingColumnCount() method.
          • Instead of returning this in GroupBy.compile(), return a new GroupBy instance that's a copy of the old one, except call builder.setOrderPreservingColumnCount(tracker.getOrderPreservingColumnCount()) to transfer over the column count (maybe make a GroupByBuilder(GroupBy) constructor). This will carry forward the orderPreservingColumnCount and make it accessible during compilation.
          • In BaseResultIterators, instead of int cols = plan.getGroupBy().getKeyExpressions().size() do int cols = plan.getGroupBy().getOrderPreservingColumnCount()
          • Add DistinctPrefixFilter.getOrderPreservingColumnCount(), and add asserts in planner tests that the value was calculated correctly (since this is calculated at compile time, you don't need an IT test for this). A good couple of tests would be:
            SELECT prefix11 FROM t GROUP BY prefix1, TRUNC(prefix1); // orderPreservingColumnCount should be 1
            SELECT (prefix11, prefix2) FROM t GROUP BY (prefix11, prefix2); // orderPreservingColumnCount should be 2
            

          For the previousKey logic in DistinctPrefixFilter, you need to add trailing 0xFF bytes to ensure you're not skipping back too far. You can just grab BaseScannerRegionObserver.getReversedRow(), rename it to getClosestRowBefore() and move it to ByteUtil, as this does the right thing. FWIW, HBase has similar code in ClientScanner.createClosestRowBefore(). It's ugly and unfortunate, but I'm sure the reverse scan implementors had a good reason to not make the start row exclusive for reverse scans.

          If there are still issues with a group by RVC, please file a separate JIRA.

          Show
          jamestaylor James Taylor added a comment - This is looking very good, Lars Hofhansl . A couple of minor items: The number of distinct columns needs to be calculated by OrderPreservingTracker instead of assuming that it's the number of group by expressions, as that won't be correct in all cases (for RVCs and for group by expressions that reference the same pk column multiple times). It's a pretty straightforward change, as the value is already being calculated, it's just not remembered and carried through to the GroupBy object. Here's what needs to be done: In OrderPreservingTracker, add an orderPreservingColumnCount member variable and accessor. In OrderPreservingTracker.isOrderPreserving(), before the return, set orderPreservingColumnCount = prevPos + prevSlotSpan. In GroupByCompiler.GroupBy, add an orderPreservingColumnCount member variable and accessor. In GroupByBuilder add a setOrderPreservingColumnCount() method. Instead of returning this in GroupBy.compile(), return a new GroupBy instance that's a copy of the old one, except call builder.setOrderPreservingColumnCount(tracker.getOrderPreservingColumnCount()) to transfer over the column count (maybe make a GroupByBuilder(GroupBy) constructor). This will carry forward the orderPreservingColumnCount and make it accessible during compilation. In BaseResultIterators, instead of int cols = plan.getGroupBy().getKeyExpressions().size() do int cols = plan.getGroupBy().getOrderPreservingColumnCount() Add DistinctPrefixFilter.getOrderPreservingColumnCount(), and add asserts in planner tests that the value was calculated correctly (since this is calculated at compile time, you don't need an IT test for this). A good couple of tests would be: SELECT prefix11 FROM t GROUP BY prefix1, TRUNC(prefix1); // orderPreservingColumnCount should be 1 SELECT (prefix11, prefix2) FROM t GROUP BY (prefix11, prefix2); // orderPreservingColumnCount should be 2 For the previousKey logic in DistinctPrefixFilter, you need to add trailing 0xFF bytes to ensure you're not skipping back too far. You can just grab BaseScannerRegionObserver.getReversedRow(), rename it to getClosestRowBefore() and move it to ByteUtil, as this does the right thing. FWIW, HBase has similar code in ClientScanner.createClosestRowBefore(). It's ugly and unfortunate, but I'm sure the reverse scan implementors had a good reason to not make the start row exclusive for reverse scans. If there are still issues with a group by RVC, please file a separate JIRA.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Thanks for the full description! I'll add that code later today or tomorrow.

          As for the RCVs... When I have a 3-part key (say p1, p2, p3) and do SELECT DISTINCT (p1, p2) FROM t, shouldn't I get more than 1 result back (assuming there are multiple rows for a combination of p1 and p2)? There seems to be something at odds here. This always returns a single result. This happens with or without this optimizations. I just do not know how to verify correct behavior, when I do not understand that behavior.

          For the previousKey logic in DistinctPrefixFilter, you need to add trailing 0xFF bytes to ensure you're not skipping back too far.

          The code's doing that: nullPad + previousKey (and still makes one copy of the key only). I think that's all that's needed since we're not looking for a startrow, but rather the next seek backwards. Since scanning inside a row is forward this should all be cool.

          Show
          lhofhansl Lars Hofhansl added a comment - Thanks for the full description! I'll add that code later today or tomorrow. As for the RCVs... When I have a 3-part key (say p1, p2, p3) and do SELECT DISTINCT (p1, p2) FROM t, shouldn't I get more than 1 result back (assuming there are multiple rows for a combination of p1 and p2)? There seems to be something at odds here. This always returns a single result. This happens with or without this optimizations. I just do not know how to verify correct behavior, when I do not understand that behavior. For the previousKey logic in DistinctPrefixFilter, you need to add trailing 0xFF bytes to ensure you're not skipping back too far. The code's doing that: nullPad + previousKey (and still makes one copy of the key only). I think that's all that's needed since we're not looking for a startrow, but rather the next seek backwards. Since scanning inside a row is forward this should all be cool.
          Hide
          jamestaylor James Taylor added a comment -

          Thanks, Lars Hofhansl. Please file a separate bug for the RVC issue - sounds like something is wrong here unrelated to your work. The following two queries should return the same number of rows:

          SELECT DISTINCT (p1, p2) FROM t GROUP BY (p1, p2)
          SELECT DISTINCT p1, p2 FROM t GROUP BY p1, p2
          

          With the result set you get back for the first query, you'd only be able to select the var binary value through resultSet.getBytes(1), while the second one would have two separate expressions for each column and maintain the type information. It's definitely an edge case.

          The logic for calculating the seek next hint for a reverse scan with a variable length last field isn't quite correct, though it's another edge case. Let's say you have the following three rows: a, a\0xFF, a\0xFF\0xFF, b. If you're at b doing a reverse scan, your seek next hint would be a\0xFF which would skip too far, skipping a\0xFF\0xFF. You have to pad with some number of arbitrary 0xFF bytes (and there'd always be the case of not adding enough, but it's the best we can do). The only time this can happen is if the last field is VARBINARY, as 0xFF isn't a valid byte for VARCHAR or DECIMAL.

          Make sense?

          Show
          jamestaylor James Taylor added a comment - Thanks, Lars Hofhansl . Please file a separate bug for the RVC issue - sounds like something is wrong here unrelated to your work. The following two queries should return the same number of rows: SELECT DISTINCT (p1, p2) FROM t GROUP BY (p1, p2) SELECT DISTINCT p1, p2 FROM t GROUP BY p1, p2 With the result set you get back for the first query, you'd only be able to select the var binary value through resultSet.getBytes(1), while the second one would have two separate expressions for each column and maintain the type information. It's definitely an edge case. The logic for calculating the seek next hint for a reverse scan with a variable length last field isn't quite correct, though it's another edge case. Let's say you have the following three rows: a, a\0xFF, a\0xFF\0xFF, b. If you're at b doing a reverse scan, your seek next hint would be a\0xFF which would skip too far, skipping a\0xFF\0xFF. You have to pad with some number of arbitrary 0xFF bytes (and there'd always be the case of not adding enough, but it's the best we can do). The only time this can happen is if the last field is VARBINARY, as 0xFF isn't a valid byte for VARCHAR or DECIMAL. Make sense?
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Makes sense. Thanks for the case where it would not work. Hmm... So there is no way to be correct in the reverse scenario in all cases.
          I'll look at getReversedRow() again, I think I get now what it's doing.

          I'll file a separate issue for the RVC issue (and not do anything here for that).

          Show
          lhofhansl Lars Hofhansl added a comment - Makes sense. Thanks for the case where it would not work. Hmm... So there is no way to be correct in the reverse scenario in all cases. I'll look at getReversedRow() again, I think I get now what it's doing. I'll file a separate issue for the RVC issue (and not do anything here for that).
          Hide
          lhofhansl Lars Hofhansl added a comment -

          -v11 fixes the previous key issue.

          Did it slightly differently. I expose the max padding size from ScanUtil and use that to nullPad, then previousKey will automatically do the right thing, with just one copy needed.

          Will now add the correct slot counting and then this should be good to go.

          Show
          lhofhansl Lars Hofhansl added a comment - -v11 fixes the previous key issue. Did it slightly differently. I expose the max padding size from ScanUtil and use that to nullPad, then previousKey will automatically do the right thing, with just one copy needed. Will now add the correct slot counting and then this should be good to go.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          -v12

          Does correct column count accounting.
          This should be close to done now.

          Show
          lhofhansl Lars Hofhansl added a comment - -v12 Does correct column count accounting. This should be close to done now.
          Hide
          jamestaylor James Taylor added a comment -

          +1. Looks very good, Lars Hofhansl. Couple of optional items, completely up to you:

          • Add IT test where a reverse scan will be used (saw some lower level tests, but maybe I missed the IT test for this?)
          • Remove the storing of the orderPreserving boolean and derive it from the orderPreservingColumnCount.
          Show
          jamestaylor James Taylor added a comment - +1. Looks very good, Lars Hofhansl . Couple of optional items, completely up to you: Add IT test where a reverse scan will be used (saw some lower level tests, but maybe I missed the IT test for this?) Remove the storing of the orderPreserving boolean and derive it from the orderPreservingColumnCount.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Lemme do those two. Was thinking on the orderPreservingColumnCount/orderPreserving as well.
          Fill post a final patch and run all tests.

          Show
          lhofhansl Lars Hofhansl added a comment - Lemme do those two. Was thinking on the orderPreservingColumnCount/orderPreserving as well. Fill post a final patch and run all tests.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Hmm... isOrderPreserving is calculated like this:
          isOrderPreserving &= entry.orderPreserving != OrderPreserving.NO && prevOrderPreserving == OrderPreserving.YES && (pos == prevPos || pos - prevSlotSpan == prevPos || hasEqualityConstraints(prevPos+prevSlotSpan, pos);
          and it is also set in the track(...) method. Not sure I want to mess with that - and convince myself that everything works exactly the same.

          Show
          lhofhansl Lars Hofhansl added a comment - Hmm... isOrderPreserving is calculated like this: isOrderPreserving &= entry.orderPreserving != OrderPreserving.NO && prevOrderPreserving == OrderPreserving.YES && (pos == prevPos || pos - prevSlotSpan == prevPos || hasEqualityConstraints(prevPos+prevSlotSpan, pos) ; and it is also set in the track(...) method. Not sure I want to mess with that - and convince myself that everything works exactly the same.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Turns out there's a problem for fixed length fields when scanning backwards too:
          Say we have rows 1,2,3. Scanning backwards we'll see 3 first. Then we'll seek to 2. That is right before 2 (not after), so the next key seen would be 1 (not 2).
          I think in that case I simply want to 0-pad the seek key. So I would seek to beginning of a row 2\x00.

          Show
          lhofhansl Lars Hofhansl added a comment - Turns out there's a problem for fixed length fields when scanning backwards too: Say we have rows 1,2,3. Scanning backwards we'll see 3 first. Then we'll seek to 2. That is right before 2 (not after), so the next key seen would be 1 (not 2). I think in that case I simply want to 0-pad the seek key. So I would seek to beginning of a row 2\x00.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Nope. The next part will start with something >= 0x00. So need to seek to 2\xff to be safe.

          Show
          lhofhansl Lars Hofhansl added a comment - Nope. The next part will start with something >= 0x00. So need to seek to 2\xff to be safe.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          -v13

          • fixes reverse scans for fixed length key parts
          • more tests for reverse scans

          Please have a careful look. This stuff is tricky.
          Should be good to go now.

          Show
          lhofhansl Lars Hofhansl added a comment - -v13 fixes reverse scans for fixed length key parts more tests for reverse scans Please have a careful look. This stuff is tricky. Should be good to go now.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Actually before we pull the trigger I want to add an integration test for variable length fields.
          Will do tomorrow.

          Show
          lhofhansl Lars Hofhansl added a comment - Actually before we pull the trigger I want to add an integration test for variable length fields. Will do tomorrow.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          James Taylor, VARBINARY can only be last in a PK, right? If so, the padding issue above is not actually an issue, since we would only optimize a distinct or a group by if the number of keys used is strictly less then the number of key columns. Do you agree? If so, I can remove the ugly special case code.

          Show
          lhofhansl Lars Hofhansl added a comment - James Taylor , VARBINARY can only be last in a PK, right? If so, the padding issue above is not actually an issue, since we would only optimize a distinct or a group by if the number of keys used is strictly less then the number of key columns. Do you agree? If so, I can remove the ugly special case code.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Pretty happy with -v14

          Just need to know about the VARBINARY issue, and then this is good to go!

          Show
          lhofhansl Lars Hofhansl added a comment - Pretty happy with -v14 Just need to know about the VARBINARY issue, and then this is good to go!
          Hide
          lhofhansl Lars Hofhansl added a comment -

          So specifically:

          You have to pad with some number of arbitrary 0xFF bytes (and there'd always be the case of not adding enough, but it's the best we can do). The only time this can happen is if the last field is VARBINARY, as 0xFF isn't a valid byte for VARCHAR or DECIMAL.

          But since VARBINARY cannot be last in a multipart key and optimization doesn't happen unless we're using strictly less keys than the table's keys, in practice this can never happen.
          Right?

          Show
          lhofhansl Lars Hofhansl added a comment - So specifically: You have to pad with some number of arbitrary 0xFF bytes (and there'd always be the case of not adding enough, but it's the best we can do). The only time this can happen is if the last field is VARBINARY, as 0xFF isn't a valid byte for VARCHAR or DECIMAL. But since VARBINARY cannot be last in a multipart key and optimization doesn't happen unless we're using strictly less keys than the table's keys, in practice this can never happen. Right?
          Hide
          lhofhansl Lars Hofhansl added a comment -

          "cannot" should "can only"!

          Show
          lhofhansl Lars Hofhansl added a comment - "cannot" should "can only"!
          Hide
          jamestaylor James Taylor added a comment -

          Yes, good point, Lars Hofhansl. The reverse scan case is tricky. Padding with a single 0xFF isn't necessarily enough, as if the next field is a BIGINT, you'd need 8 0xFF bytes, and if the field after that is an INTEGER, you'd need 4 more 0xFF bytes. Since we have the schema information, we can compute the necessary padding in the constructor. You'd just loop through the fields in the RowKeySchema and add the field.getByteSize() while field.isFixedLength() is true. Then add one more to the sum you get and you'll be good in all cases.

          Show
          jamestaylor James Taylor added a comment - Yes, good point, Lars Hofhansl . The reverse scan case is tricky. Padding with a single 0xFF isn't necessarily enough, as if the next field is a BIGINT, you'd need 8 0xFF bytes, and if the field after that is an INTEGER, you'd need 4 more 0xFF bytes. Since we have the schema information, we can compute the necessary padding in the constructor. You'd just loop through the fields in the RowKeySchema and add the field.getByteSize() while field.isFixedLength() is true. Then add one more to the sum you get and you'll be good in all cases.
          Hide
          jamestaylor James Taylor added a comment -

          (and yes, you're correct - don't need to worry about VARBINARY case for the reasons you mentioned, but would be good to doc that). Just need to calculate the correct padding to use as mentioned above.

          Show
          jamestaylor James Taylor added a comment - (and yes, you're correct - don't need to worry about VARBINARY case for the reasons you mentioned, but would be good to doc that). Just need to calculate the correct padding to use as mentioned above.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          The reverse scan case is tricky. Padding with a single 0xFF isn't necessarily enough, as if the next field is a BIGINT, you'd need 8 0xFF bytes, and if the field after that is an INTEGER, you'd need 4 more 0xFF bytes.

          I don't think that is needed. I ran through this with the debugger. Let's say we have a two part prefix. INT/INT, and let's say the values are 1/1, 2/1, and 3/1.
          In the reverse case we'll 3/1. So we'll seek to 2\0xFF. It does not matter what follows, since we're guaranteed to see at least 2 as the next first part of the key. After that we'll seek 1\0xFF. This seems to be correct in all cases.

          Show
          lhofhansl Lars Hofhansl added a comment - The reverse scan case is tricky. Padding with a single 0xFF isn't necessarily enough, as if the next field is a BIGINT, you'd need 8 0xFF bytes, and if the field after that is an INTEGER, you'd need 4 more 0xFF bytes. I don't think that is needed. I ran through this with the debugger. Let's say we have a two part prefix. INT/INT, and let's say the values are 1/1, 2/1, and 3/1. In the reverse case we'll 3/1. So we'll seek to 2\0xFF. It does not matter what follows, since we're guaranteed to see at least 2 as the next first part of the key. After that we'll seek 1\0xFF. This seems to be correct in all cases.
          Hide
          jamestaylor James Taylor added a comment -

          What if you have these three rows (assume INT, 2 byte SMALLINT for schema): 1/0xFF0x00, 1/0xFF0xFF, 2

          Would the seek next hint skip the second row? The fixed length types can have any bytes, including 0xFF bytes.

          Show
          jamestaylor James Taylor added a comment - What if you have these three rows (assume INT, 2 byte SMALLINT for schema): 1/0xFF0x00, 1/0xFF0xFF, 2 Would the seek next hint skip the second row? The fixed length types can have any bytes, including 0xFF bytes.
          Hide
          jamestaylor James Taylor added a comment -

          I think what we need is an int[] indexed by the slot position to get the correct padding to use because the padding necessary depends on the slot position. For example, if we have a schema like VARCHAR, INT, INT, we need to pad for the 2nd and 3rd slot position. We don't need to consider the 1st position, because we'll always have an actual key to base our seek next hint on. For each subsequent slot position, we need to know the continuous fixed length position from that point forward.

          Show
          jamestaylor James Taylor added a comment - I think what we need is an int[] indexed by the slot position to get the correct padding to use because the padding necessary depends on the slot position. For example, if we have a schema like VARCHAR, INT, INT, we need to pad for the 2nd and 3rd slot position. We don't need to consider the 1st position, because we'll always have an actual key to base our seek next hint on. For each subsequent slot position, we need to know the continuous fixed length position from that point forward .
          Hide
          lhofhansl Lars Hofhansl added a comment -

          What if you have these three rows (assume INT, 2 byte SMALLINT for schema): 1/0xFF0x00, 1/0xFF0xFF, 2

          You mean a distinct/groupby on the first part? We'd want it to skip the 2nd row in that case. I can't think of a case where that would go wrong.
          We'd see and output 2, then seek to 1/0xFF. Next we'll see 1/0xFF0xFF and output 1, then seek to 0/0xFF.
          Because of the extra padding we'd never miss anything we need to see. And it doesn't matter what the key part is; as long as we seek after the next key (in the reversed case) we're good. That only works because the size is fixed.

          Show
          lhofhansl Lars Hofhansl added a comment - What if you have these three rows (assume INT, 2 byte SMALLINT for schema): 1/0xFF0x00, 1/0xFF0xFF, 2 You mean a distinct/groupby on the first part? We'd want it to skip the 2nd row in that case. I can't think of a case where that would go wrong. We'd see and output 2, then seek to 1/0xFF. Next we'll see 1/0xFF0xFF and output 1, then seek to 0/0xFF. Because of the extra padding we'd never miss anything we need to see. And it doesn't matter what the key part is; as long as we seek after the next key (in the reversed case) we're good. That only works because the size is fixed.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          For example, if we have a schema like VARCHAR, INT, INT, we need to pad for the 2nd and 3rd slot position

          I think we only need to pad whatever the last distinct/groupby position is. So in the schema above if the distinct is on key part1, we'd pad the VARCHAR, if the distinct is part1, part2, we'd pad the INT if needed. (we wouldn't optimize this for distinct over all parts)
          Pretty sure this works correctly in all cases now.

          Show
          lhofhansl Lars Hofhansl added a comment - For example, if we have a schema like VARCHAR, INT, INT, we need to pad for the 2nd and 3rd slot position I think we only need to pad whatever the last distinct/groupby position is. So in the schema above if the distinct is on key part1, we'd pad the VARCHAR, if the distinct is part1, part2, we'd pad the INT if needed. (we wouldn't optimize this for distinct over all parts) Pretty sure this works correctly in all cases now.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          -v15 removes the VARBINARY logic from DistinctPrefixFilter.

          Should be good to go... Unless James convinced me that the fixed length reverse logic is not correct.

          Show
          lhofhansl Lars Hofhansl added a comment - -v15 removes the VARBINARY logic from DistinctPrefixFilter. Should be good to go... Unless James convinced me that the fixed length reverse logic is not correct.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          just fixes a spelling mistake in the comments

          Show
          lhofhansl Lars Hofhansl added a comment - just fixes a spelling mistake in the comments
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Actually, so here's a case that does not work: 1, 2/0xff0xff, 3, 3. When we see the 2nd 3, we'll seek to 2/0xff, which will miss the row with 2.
          Fix coming.

          Show
          lhofhansl Lars Hofhansl added a comment - Actually, so here's a case that does not work: 1, 2/0xff0xff, 3, 3. When we see the 2nd 3, we'll seek to 2/0xff, which will miss the row with 2. Fix coming.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          -v16 fixes the cornercase - and makes it simpler!

          Show
          lhofhansl Lars Hofhansl added a comment - -v16 fixes the cornercase - and makes it simpler!
          Hide
          lhofhansl Lars Hofhansl added a comment -

          BTW... Here's the HBase logic in FilterList to AND filters together:

          ...
                  if (filter.filterAllRemaining()) {
                    return ReturnCode.NEXT_ROW;
                  }
                  ReturnCode code = filter.filterKeyValue(v);
                  switch (code) {
                  // Override INCLUDE and continue to evaluate.
                  case INCLUDE_AND_NEXT_COL:
                    rc = ReturnCode.INCLUDE_AND_NEXT_COL;
                  case INCLUDE:
                    transformed = filter.transformCell(transformed);
                    continue;
                  case SEEK_NEXT_USING_HINT:
                    seekHintFilter = filter;
                    return code;
                  default:
                    return code;
                  }
          ...
          return rc;
          

          So this will combine the filters correctly as long as they are ordered correctly. For example a SKIP will cause the FilterList to SKIP, only if earlier filters INCLUDE* can a later filter cause a SEEK.

          Unless I hear objections I am going to commit this to 4.x tonight or tomorrow.

          Show
          lhofhansl Lars Hofhansl added a comment - BTW... Here's the HBase logic in FilterList to AND filters together: ... if (filter.filterAllRemaining()) { return ReturnCode.NEXT_ROW; } ReturnCode code = filter.filterKeyValue(v); switch (code) { // Override INCLUDE and continue to evaluate. case INCLUDE_AND_NEXT_COL: rc = ReturnCode.INCLUDE_AND_NEXT_COL; case INCLUDE: transformed = filter.transformCell(transformed); continue ; case SEEK_NEXT_USING_HINT: seekHintFilter = filter; return code; default : return code; } ... return rc; So this will combine the filters correctly as long as they are ordered correctly. For example a SKIP will cause the FilterList to SKIP, only if earlier filters INCLUDE* can a later filter cause a SEEK. Unless I hear objections I am going to commit this to 4.x tonight or tomorrow.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          I think what we need is an int[] indexed by the slot position to get the correct padding to use because the padding necessary depends on the slot position.

          The trick here for reverse case is to seek to the beginning of the first row with a specific row key. I realized that one can simply seek to the current rowkey prefix. That will place the "cursor" right before the first occurrence of this rowkey, which in the reverse case is exactly what we want. No padding or meddling with the key at all!

          Show
          lhofhansl Lars Hofhansl added a comment - I think what we need is an int[] indexed by the slot position to get the correct padding to use because the padding necessary depends on the slot position. The trick here for reverse case is to seek to the beginning of the first row with a specific row key. I realized that one can simply seek to the current rowkey prefix. That will place the "cursor" right before the first occurrence of this rowkey, which in the reverse case is exactly what we want. No padding or meddling with the key at all!
          Hide
          jamestaylor James Taylor added a comment -

          +1. Awesome job, Lars Hofhansl. I love the trick you thought of for the reverse scan case!

          Show
          jamestaylor James Taylor added a comment - +1. Awesome job, Lars Hofhansl . I love the trick you thought of for the reverse scan case!
          Hide
          lhofhansl Lars Hofhansl added a comment -

          -v17 is what I am going to commit

          • fixes some comments
          • removes the unnecessary changes to ScanUtil
          • adds a version to DistinctPrefixFilter and that is written to the writable... Not needed now, but good form to make it future prove
          Show
          lhofhansl Lars Hofhansl added a comment - -v17 is what I am going to commit fixes some comments removes the unnecessary changes to ScanUtil adds a version to DistinctPrefixFilter and that is written to the writable... Not needed now, but good form to make it future prove
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Pushed to 4.x-HBase-0.98, 4.x-HBase-1.0, 4.x-HBase-1.1, and master.

          (I do see a 4.x-HBase-1.x branch, but that seems to be dysfunctional)

          Show
          lhofhansl Lars Hofhansl added a comment - Pushed to 4.x-HBase-0.98, 4.x-HBase-1.0, 4.x-HBase-1.1, and master. (I do see a 4.x-HBase-1.x branch, but that seems to be dysfunctional)
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Phoenix-master #1241 (See https://builds.apache.org/job/Phoenix-master/1241/)
          PHOENIX-258 Use skip scan when SELECT DISTINCT on leading row key (larsh: rev eb275a5c6089b49bfa5263fe8a34e1a5e46bae97)

          • phoenix-core/src/test/java/org/apache/phoenix/filter/DistinctPrefixFilterTest.java
          • phoenix-core/src/main/java/org/apache/phoenix/filter/DistinctPrefixFilter.java
          • phoenix-core/src/main/java/org/apache/phoenix/iterate/BaseResultIterators.java
          • phoenix-core/src/it/java/org/apache/phoenix/end2end/DistinctPrefixFilterIT.java
          • phoenix-core/src/main/java/org/apache/phoenix/compile/OrderPreservingTracker.java
          • phoenix-core/src/main/java/org/apache/phoenix/compile/GroupByCompiler.java
          • phoenix-core/src/main/java/org/apache/phoenix/iterate/ExplainTable.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Phoenix-master #1241 (See https://builds.apache.org/job/Phoenix-master/1241/ ) PHOENIX-258 Use skip scan when SELECT DISTINCT on leading row key (larsh: rev eb275a5c6089b49bfa5263fe8a34e1a5e46bae97) phoenix-core/src/test/java/org/apache/phoenix/filter/DistinctPrefixFilterTest.java phoenix-core/src/main/java/org/apache/phoenix/filter/DistinctPrefixFilter.java phoenix-core/src/main/java/org/apache/phoenix/iterate/BaseResultIterators.java phoenix-core/src/it/java/org/apache/phoenix/end2end/DistinctPrefixFilterIT.java phoenix-core/src/main/java/org/apache/phoenix/compile/OrderPreservingTracker.java phoenix-core/src/main/java/org/apache/phoenix/compile/GroupByCompiler.java phoenix-core/src/main/java/org/apache/phoenix/iterate/ExplainTable.java
          Hide
          lhofhansl Lars Hofhansl added a comment -

          There are some test failures. Fix soon.

          Show
          lhofhansl Lars Hofhansl added a comment - There are some test failures. Fix soon.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Done.

          I might have inadvertently created a 4.0 and 4.x-HBase-1.x branch (seems I had these locally, but they didn't exist remotely)...? If those should not exist I'll remove them.

          Show
          lhofhansl Lars Hofhansl added a comment - Done. I might have inadvertently created a 4.0 and 4.x-HBase-1.x branch (seems I had these locally, but they didn't exist remotely)...? If those should not exist I'll remove them.
          Hide
          elserj Josh Elser added a comment -

          I might have inadvertently created a 4.0 and 4.x-HBase-1.x branch (seems I had these locally, but they didn't exist remotely)...? If those should not exist I'll remove them.

          Yeah, I just saw those two new branches:

          remote: Counting objects: 761, done.
          remote: Compressing objects: 100% (144/144), done.
          remote: Total 304 (delta 118), reused 285 (delta 107)
          Receiving objects: 100% (304/304), 32.24 KiB | 0 bytes/s, done.
          Resolving deltas: 100% (118/118), completed with 63 local objects.
          From https://git-wip-us.apache.org/repos/asf/phoenix
           * [new branch]      4.0        -> origin/4.0
             0032a3d..da39bd4  4.x-HBase-0.98 -> origin/4.x-HBase-0.98
             dfa3eec..65ee886  4.x-HBase-1.0 -> origin/4.x-HBase-1.0
             7b25276..cf551d5  4.x-HBase-1.1 -> origin/4.x-HBase-1.1
           * [new branch]      4.x-HBase-1.x -> origin/4.x-HBase-1.x
             700a941..826f0af  calcite    -> origin/calcite
             eb275a5..7dcf95a  master     -> origin/master
          

          I don't think they should be there. Just the 4.x-HBase-{0.98,1.0,1.1} and master branches.

          Show
          elserj Josh Elser added a comment - I might have inadvertently created a 4.0 and 4.x-HBase-1.x branch (seems I had these locally, but they didn't exist remotely)...? If those should not exist I'll remove them. Yeah, I just saw those two new branches: remote: Counting objects: 761, done. remote: Compressing objects: 100% (144/144), done. remote: Total 304 (delta 118), reused 285 (delta 107) Receiving objects: 100% (304/304), 32.24 KiB | 0 bytes/s, done. Resolving deltas: 100% (118/118), completed with 63 local objects. From https://git-wip-us.apache.org/repos/asf/phoenix * [new branch] 4.0 -> origin/4.0 0032a3d..da39bd4 4.x-HBase-0.98 -> origin/4.x-HBase-0.98 dfa3eec..65ee886 4.x-HBase-1.0 -> origin/4.x-HBase-1.0 7b25276..cf551d5 4.x-HBase-1.1 -> origin/4.x-HBase-1.1 * [new branch] 4.x-HBase-1.x -> origin/4.x-HBase-1.x 700a941..826f0af calcite -> origin/calcite eb275a5..7dcf95a master -> origin/master I don't think they should be there. Just the 4.x-HBase-{0.98,1.0,1.1} and master branches.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Phoenix-master #1242 (See https://builds.apache.org/job/Phoenix-master/1242/)
          PHOENIX-258 minor test fixes. (larsh: rev 7dcf95a40063a25917a68c56c68fe61a11a4ef8b)

          • phoenix-core/src/it/java/org/apache/phoenix/end2end/SortMergeJoinMoreIT.java
          • phoenix-core/src/it/java/org/apache/phoenix/end2end/index/IndexExpressionIT.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Phoenix-master #1242 (See https://builds.apache.org/job/Phoenix-master/1242/ ) PHOENIX-258 minor test fixes. (larsh: rev 7dcf95a40063a25917a68c56c68fe61a11a4ef8b) phoenix-core/src/it/java/org/apache/phoenix/end2end/SortMergeJoinMoreIT.java phoenix-core/src/it/java/org/apache/phoenix/end2end/index/IndexExpressionIT.java
          Hide
          lhofhansl Lars Hofhansl added a comment -

          There are some more tests failing, all test fixes I think due to the changed explain plans... I'll be studying y those later today.

          Show
          lhofhansl Lars Hofhansl added a comment - There are some more tests failing, all test fixes I think due to the changed explain plans... I'll be studying y those later today.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Deleted the two branches. Sorry about this. I think I had them locally for some reason. After I cherry-picked the changes for this fix into all the other branches I did a push --all and that created the remote branches... Anyway, let me know if this is not fixed, yet.

          Show
          lhofhansl Lars Hofhansl added a comment - Deleted the two branches. Sorry about this. I think I had them locally for some reason. After I cherry-picked the changes for this fix into all the other branches I did a push --all and that created the remote branches... Anyway, let me know if this is not fixed, yet.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Actually it's all good. master and -1.0 are back to normal. 1.1 failed due to a transient mini-cluster issue.

          Show
          lhofhansl Lars Hofhansl added a comment - Actually it's all good. master and -1.0 are back to normal. 1.1 failed due to a transient mini-cluster issue.
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Phoenix-master #1254 (See https://builds.apache.org/job/Phoenix-master/1254/)
          PHOENIX-2986 Some queries are misoptimized by PHOENIX-258. (larsh: rev af51d5607121c4924eb143c8eeb8b527581f9beb)

          • phoenix-core/src/it/java/org/apache/phoenix/end2end/DistinctPrefixFilterIT.java
          • phoenix-core/src/main/java/org/apache/phoenix/iterate/BaseResultIterators.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Phoenix-master #1254 (See https://builds.apache.org/job/Phoenix-master/1254/ ) PHOENIX-2986 Some queries are misoptimized by PHOENIX-258 . (larsh: rev af51d5607121c4924eb143c8eeb8b527581f9beb) phoenix-core/src/it/java/org/apache/phoenix/end2end/DistinctPrefixFilterIT.java phoenix-core/src/main/java/org/apache/phoenix/iterate/BaseResultIterators.java
          Hide
          mujtabachohan Mujtaba Chohan added a comment -

          Lars Hofhansl Tested on a table with 300M rows 230GB split in 78 regions on 16 node cluster. SELECT DISTINCT ORGANIZATION_ID FROM T with 2 distinct values for leading row key takes 90 seconds compared to 110 seconds for full scan count * aggregation.

          Explain for select distinct: 
          CLIENT 78-CHUNK  PARALLEL 78-WAY FULL SCAN OVER T 
          SERVER FILTER BY FIRST KEY ONLY 
          SERVER DISTINCT PREFIX FILTER OVER ORGANIZATION_ID 
          SERVER AGGREGATE INTO ORDERED DISTINCT ROWS BY ORGANIZATION_ID  
          CLIENT MERGE SORT    
          
          Show
          mujtabachohan Mujtaba Chohan added a comment - Lars Hofhansl Tested on a table with 300M rows 230GB split in 78 regions on 16 node cluster. SELECT DISTINCT ORGANIZATION_ID FROM T with 2 distinct values for leading row key takes 90 seconds compared to 110 seconds for full scan count * aggregation. Explain for select distinct: CLIENT 78-CHUNK PARALLEL 78-WAY FULL SCAN OVER T SERVER FILTER BY FIRST KEY ONLY SERVER DISTINCT PREFIX FILTER OVER ORGANIZATION_ID SERVER AGGREGATE INTO ORDERED DISTINCT ROWS BY ORGANIZATION_ID CLIENT MERGE SORT
          Hide
          jamestaylor James Taylor added a comment -

          Lars Hofhansl - I would have expected a bigger gain for only two distinct values than from 110 to 90 seconds. What about you?

          Show
          jamestaylor James Taylor added a comment - Lars Hofhansl - I would have expected a bigger gain for only two distinct values than from 110 to 90 seconds. What about you?
          Hide
          lhofhansl Lars Hofhansl added a comment -

          That does not at all match my experience. It only needs to do 2 (perhaps 3) seeks per chunk. So at most 234 seeks, which on top happen in parallel. That should take a few milli seconds.

          OK... Further it needs to seek each HFile/Memstore. Was the table just seeded? (Even then I'd expect maybe a few thousand seeks, and still in parallel)

          If there are many deleted rows, that might over shadow the savings here, but I doubt that's the case.

          How long does SELECT /*+ RANGE_SCAN */ DISTINCT ORGANIZATION_ID FROM T take?

          Show
          lhofhansl Lars Hofhansl added a comment - That does not at all match my experience. It only needs to do 2 (perhaps 3) seeks per chunk. So at most 234 seeks, which on top happen in parallel. That should take a few milli seconds. OK... Further it needs to seek each HFile/Memstore. Was the table just seeded? (Even then I'd expect maybe a few thousand seeks, and still in parallel) If there are many deleted rows, that might over shadow the savings here, but I doubt that's the case. How long does SELECT /*+ RANGE_SCAN */ DISTINCT ORGANIZATION_ID FROM T take?
          Hide
          ankit@apache.org Ankit Singhal added a comment -

          Bulk close of all issues that has been resolved in a released version.

          Show
          ankit@apache.org Ankit Singhal added a comment - Bulk close of all issues that has been resolved in a released version.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Chatted with James Taylor, and Mujtaba Chohan. Turns out there was a lot of deleted data in that table, causing 90s just to skip the deleted data. The time to run DISTINCT went to < 1s.

          Show
          lhofhansl Lars Hofhansl added a comment - Chatted with James Taylor , and Mujtaba Chohan . Turns out there was a lot of deleted data in that table, causing 90s just to skip the deleted data. The time to run DISTINCT went to < 1s.

            People

            • Assignee:
              lhofhansl Lars Hofhansl
              Reporter:
              ryang-sfdc ryang-sfdc
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development