Details

    • Type: Brainstorming
    • Status: Resolved
    • Priority: Major
    • Resolution: Invalid
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      In HBASE-9778 I basically punted the decision on whether doing repeated scanner.next() called instead of the issueing (re)seeks to the user.
      I think we can do better.

      One way do that is maintain simple stats of what the maximum number of versions we've seen for any row/col combination and store these in the HFile's metadata (just like the timerange, oldest Put, etc).

      Then we estimate fairly accurately whether we have to expect lots of versions (i.e. seek between columns is better) or not (in which case we'd issue repeated next()'s).

      1. 12311.txt
        40 kB
        Lars Hofhansl
      2. 12311-indexed-0.98.txt
        7 kB
        Lars Hofhansl
      3. 12311-indexed-0.98-v2.txt
        19 kB
        Lars Hofhansl
      4. 12311-v2.txt
        67 kB
        Lars Hofhansl
      5. 12311-v3.txt
        72 kB
        Lars Hofhansl
      6. CellStatTracker.java
        2 kB
        Lars Hofhansl

        Activity

        Hide
        lhofhansl Lars Hofhansl added a comment -

        Are there better/other ideas?
        The core issue we want to get to: If a (re)seek would get us to a different HFile block we should seek, otherwise repeated next()'s inside the block is (far) more efficient. This decision would need to be made efficient - without needing a bunch of compares of the row-key.

        Tracking the maximum number of versions seen would a possible and relatively cheap proxy to guess the likelihood of a seek getting us out of the current block.

        Show
        lhofhansl Lars Hofhansl added a comment - Are there better/other ideas? The core issue we want to get to: If a (re)seek would get us to a different HFile block we should seek, otherwise repeated next()'s inside the block is (far) more efficient. This decision would need to be made efficient - without needing a bunch of compares of the row-key. Tracking the maximum number of versions seen would a possible and relatively cheap proxy to guess the likelihood of a seek getting us out of the current block.
        Hide
        stack stack added a comment -

        Would metadata on block itself be OTT? Or metadata in the index? A few ints on cardinality... avg sizes?

        Otherwise, adding to the hfile metadata seems way to go. Write a pb so can be expanded as we want to record more?

        Show
        stack stack added a comment - Would metadata on block itself be OTT? Or metadata in the index? A few ints on cardinality... avg sizes? Otherwise, adding to the hfile metadata seems way to go. Write a pb so can be expanded as we want to record more?
        Hide
        lhofhansl Lars Hofhansl added a comment -

        I would need to get it by columnFamily/HStore (because I want to avoid the comparisons in SQM and down), on the blocks itself that'd be hard to do. It would be similar to maxSequenceId, etc. Just another things in the HFile trailer.

        Still wondering whether there's a better way, though. It would be nice if the upper layers could suggest a (re)seek and then at the Store (or maybe even StoreFile) we could decide how execute that, problem there would be to avoid multiple compares between the layers (which is the main reason why seeks are so expensive even when the blocks are already in the cache)

        Show
        lhofhansl Lars Hofhansl added a comment - I would need to get it by columnFamily/HStore (because I want to avoid the comparisons in SQM and down), on the blocks itself that'd be hard to do. It would be similar to maxSequenceId, etc. Just another things in the HFile trailer. Still wondering whether there's a better way, though. It would be nice if the upper layers could suggest a (re)seek and then at the Store (or maybe even StoreFile) we could decide how execute that, problem there would be to avoid multiple compares between the layers (which is the main reason why seeks are so expensive even when the blocks are already in the cache)
        Hide
        ndimiduk Nick Dimiduk added a comment -

        Why is it difficult to tag blocks with metadata? We've extended the block header before.

        Show
        ndimiduk Nick Dimiduk added a comment - Why is it difficult to tag blocks with metadata? We've extended the block header before.
        Hide
        lhofhansl Lars Hofhansl added a comment -

        It's not hard to record that information per block. It's hard to get that information out to where/when I need it.

        In the SQM I want to know how many versions (and columns actually) I can expect worst case. If it's maintained per block that has to be aggregated somewhere. If it is per store file it is easily aggregated per store.

        Show
        lhofhansl Lars Hofhansl added a comment - It's not hard to record that information per block. It's hard to get that information out to where/when I need it. In the SQM I want to know how many versions (and columns actually) I can expect worst case. If it's maintained per block that has to be aggregated somewhere. If it is per store file it is easily aggregated per store.
        Hide
        ndimiduk Nick Dimiduk added a comment -

        I see. Would be good to have more than just version stats – counts and value size distributions come to mind.

        Show
        ndimiduk Nick Dimiduk added a comment - I see. Would be good to have more than just version stats – counts and value size distributions come to mind.
        Hide
        lhofhansl Lars Hofhansl added a comment -

        Agreed. Maybe it's time to formalize/rationalize all the stats/metadata we keep around.

        Show
        lhofhansl Lars Hofhansl added a comment - Agreed. Maybe it's time to formalize/rationalize all the stats/metadata we keep around.
        Hide
        ndimiduk Nick Dimiduk added a comment -

        Indeed. It bothers me that Phoenix needs to maintain its own stats table.

        Show
        ndimiduk Nick Dimiduk added a comment - Indeed. It bothers me that Phoenix needs to maintain its own stats table.
        Hide
        apurtell Andrew Purtell added a comment -

        Well we had HBASE-7958 but it fizzled out. One issue seemed to be maintaining a stats table duplicates metrics reporting and metrics aggregation/history that will already be in place externally . So I proposed https://issues.apache.org/jira/browse/HBASE-7958?focusedCommentId=13997314&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13997314 not surfacing stats calculated when processing HFiles into a system table but instead keep them as internal metadata that HFile/HStore could get at. The proposal was "maintain a tree of statistic files in HDFS" but this information could be embedded in HFiles themselves. The information there could also be exported to the metrics subsystem. Should we revive that issue? Although per block HFile statistics is something new I think.

        Show
        apurtell Andrew Purtell added a comment - Well we had HBASE-7958 but it fizzled out. One issue seemed to be maintaining a stats table duplicates metrics reporting and metrics aggregation/history that will already be in place externally . So I proposed https://issues.apache.org/jira/browse/HBASE-7958?focusedCommentId=13997314&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13997314 not surfacing stats calculated when processing HFiles into a system table but instead keep them as internal metadata that HFile/HStore could get at. The proposal was "maintain a tree of statistic files in HDFS" but this information could be embedded in HFiles themselves. The information there could also be exported to the metrics subsystem. Should we revive that issue? Although per block HFile statistics is something new I think.
        Hide
        lhofhansl Lars Hofhansl added a comment -

        Andrew Purtell, I agree. These are per HFile stats that we should keep them there. This is markedly different from the type of stats that - for example - Phoenix keeps around and from that angle is not wrong to have Phoenix keep these metrics (and HBase's job is to provide the necessary plumbing to do that, as we do with coprocessors)

        In any case I can see this going both ways.

        Show
        lhofhansl Lars Hofhansl added a comment - Andrew Purtell , I agree. These are per HFile stats that we should keep them there. This is markedly different from the type of stats that - for example - Phoenix keeps around and from that angle is not wrong to have Phoenix keep these metrics (and HBase's job is to provide the necessary plumbing to do that, as we do with coprocessors) In any case I can see this going both ways.
        Hide
        lhofhansl Lars Hofhansl added a comment -

        In the end, I'm mostly looking for alternatives to speed up (re)seek. The core of the problem is that we always seek even if the seek will land us on the same block and hence the seek is actually unnecessary. There is a check for that (see HFileReaderV2.reseekTo), but that is too far down and we still need to recompare at the Store/SQM level.

        Having some simple stats about number of versions per column and/or columns per row would be a reasonable proxy. Ideally we'd need a need good guess on whether the seek would seek us out of the current bock with high likelihood. If so we seek if not we do next() a few time.

        So maybe a better stat would the number of bytes per column and per row. I.e. we'd sum up the sizes of the KVs of version for a column and all the KVs for a row and than keep max/avg stats about that. Then with knowing the HFile block size we can guess whether a seek to a column or row would propel us out of the block or not.

        With that is mind the metrics to keep would be something like avg/max COL_SIZE and avg/max ROW_SIZE. The former would inform whether we should SEEK_NEXT_COL the latter whether we should SEEK_NEXT_ROW - both instead of performing a series of SKIP.

        Presumably for most use cases the distribution of the col and row size would be pretty similar across the entire table (I know we have a bunch of such use cases).

        Show
        lhofhansl Lars Hofhansl added a comment - In the end, I'm mostly looking for alternatives to speed up (re)seek. The core of the problem is that we always seek even if the seek will land us on the same block and hence the seek is actually unnecessary. There is a check for that (see HFileReaderV2.reseekTo), but that is too far down and we still need to recompare at the Store/SQM level. Having some simple stats about number of versions per column and/or columns per row would be a reasonable proxy. Ideally we'd need a need good guess on whether the seek would seek us out of the current bock with high likelihood. If so we seek if not we do next() a few time. So maybe a better stat would the number of bytes per column and per row. I.e. we'd sum up the sizes of the KVs of version for a column and all the KVs for a row and than keep max/avg stats about that. Then with knowing the HFile block size we can guess whether a seek to a column or row would propel us out of the block or not. With that is mind the metrics to keep would be something like avg/max COL_SIZE and avg/max ROW_SIZE. The former would inform whether we should SEEK_NEXT_COL the latter whether we should SEEK_NEXT_ROW - both instead of performing a series of SKIP. Presumably for most use cases the distribution of the col and row size would be pretty similar across the entire table (I know we have a bunch of such use cases).
        Hide
        lhofhansl Lars Hofhansl added a comment -

        A tracker like this (to be hooked up in StoreFile.Writer).
        Quite simply... Just want to park it here. Had 15mins to spare on this.
        Need to add protobuf, etc.

        Show
        lhofhansl Lars Hofhansl added a comment - A tracker like this (to be hooked up in StoreFile.Writer). Quite simply... Just want to park it here. Had 15mins to spare on this. Need to add protobuf, etc.
        Hide
        lhofhansl Lars Hofhansl added a comment -

        W.I.P. patch. Just parking it. Does not work.
        Need to find a good way of passing this from Store to SQM. Could use ScanInfo, but that's used in some coprocs - then again it's marked with InterfaceAudience.Private.

        Show
        lhofhansl Lars Hofhansl added a comment - W.I.P. patch. Just parking it. Does not work. Need to find a good way of passing this from Store to SQM. Could use ScanInfo, but that's used in some coprocs - then again it's marked with InterfaceAudience.Private.
        Hide
        lhofhansl Lars Hofhansl added a comment -

        The logic would something like this:

        • replace SEEK_COL with SKIP when when the mean size of all columns in <= 1/2 HBlock and standard deviation is < 1 HBlock and the max size of any column is < 4 HFileBlocks.

        same for row:

        • replace SEEK_ROW when when the mean size of all rows in <= 1/2 HBlock and standard deviation is < 1 HBlocks and the max size of any row is < 4 HBlocks

        (I might not do the standard deviation part, not sure it's really needed)

        So we'll avoid using SKIPs when a SEEK will with some probably land outside of the current block. If that is the case we'll use SEEK_COL, SEEK_ROW as before. SEEK_WITH_HINT would always be executed as SEEK_WITH_HINT.

        Show
        lhofhansl Lars Hofhansl added a comment - The logic would something like this: replace SEEK_COL with SKIP when when the mean size of all columns in <= 1/2 HBlock and standard deviation is < 1 HBlock and the max size of any column is < 4 HFileBlocks. same for row: replace SEEK_ROW when when the mean size of all rows in <= 1/2 HBlock and standard deviation is < 1 HBlocks and the max size of any row is < 4 HBlocks (I might not do the standard deviation part, not sure it's really needed) So we'll avoid using SKIPs when a SEEK will with some probably land outside of the current block. If that is the case we'll use SEEK_COL, SEEK_ROW as before. SEEK_WITH_HINT would always be executed as SEEK_WITH_HINT.
        Hide
        lhofhansl Lars Hofhansl added a comment -

        Got another few minutes on this. Bit further now. Still not working at all. Again just want to park it.

        • collects standard deviation as well
        • knows how to combine trackers from multiple files - including the standard deviation correctly

        The only part left, really, is to hook this up in SQM.

        Once I get this working, I'll do some number and see whether this effort is worth it.

        Show
        lhofhansl Lars Hofhansl added a comment - Got another few minutes on this. Bit further now. Still not working at all. Again just want to park it. collects standard deviation as well knows how to combine trackers from multiple files - including the standard deviation correctly The only part left, really, is to hook this up in SQM. Once I get this working, I'll do some number and see whether this effort is worth it.
        Hide
        lhofhansl Lars Hofhansl added a comment -

        Everything hooked up. Not pretty really. Better be a good perf improvement, will test when I get to this next.

        Show
        lhofhansl Lars Hofhansl added a comment - Everything hooked up. Not pretty really. Better be a good perf improvement, will test when I get to this next.
        Hide
        lhofhansl Lars Hofhansl added a comment - - edited

        I've thought of another approach. StorefileScanners have the notion of a "next indexed key", that is next known key to seek to (i.e. beginning of a block). What if we took the next indexed key of the scanner that is on top of the (StoreFileScanner/MemstoreScanner) heap and only issue a seek if we would seek past that key? It's only a heuristic and that check would not come free, but assuming it likely that chunks of Cells will come from the same file, we'd have a fairly good indicator whether the seek will help. I have a 0.98 patch for that, and it improves things. As an example I've used a scan with a timerange. If the range is before all Cells (except one so that the files isn't ruled out) it's takes about 3.1s (we SKIP in that case) if the timerange falls after all Cells (again except one) it's 10.2s (we're seeking this time - see SQM.match).

        With the patch the first case is unchanged (3.1s), but the 2nd case it reduced to 4.5s, since can avoid the unnecessary seek in many cases.

        Show
        lhofhansl Lars Hofhansl added a comment - - edited I've thought of another approach. StorefileScanners have the notion of a "next indexed key", that is next known key to seek to (i.e. beginning of a block). What if we took the next indexed key of the scanner that is on top of the (StoreFileScanner/MemstoreScanner) heap and only issue a seek if we would seek past that key? It's only a heuristic and that check would not come free, but assuming it likely that chunks of Cells will come from the same file, we'd have a fairly good indicator whether the seek will help. I have a 0.98 patch for that, and it improves things. As an example I've used a scan with a timerange. If the range is before all Cells (except one so that the files isn't ruled out) it's takes about 3.1s (we SKIP in that case) if the timerange falls after all Cells (again except one) it's 10.2s (we're seeking this time - see SQM.match). With the patch the first case is unchanged (3.1s), but the 2nd case it reduced to 4.5s, since can avoid the unnecessary seek in many cases.
        Hide
        lhofhansl Lars Hofhansl added a comment -

        Here's a patch that illustrates the idea for 0.98. In store scanner when the SQM indicated we should seek, we check the nextIndexedKey (if available) and we would seek before that, we simply SKIP and let the SQM try again.

        The only annoying part is that we only an indexed key (i.e. row, family, column), which we are trying to get rid of. HFileReaderV2.AbstractScannerV2.reseekTo performs the same check to decide whether to seek or to retry on the same block, this just pulls the check up. We can probably remove that optimization from the AbstractScannerV2 now (and save a few more compares).

        Show
        lhofhansl Lars Hofhansl added a comment - Here's a patch that illustrates the idea for 0.98. In store scanner when the SQM indicated we should seek, we check the nextIndexedKey (if available) and we would seek before that, we simply SKIP and let the SQM try again. The only annoying part is that we only an indexed key (i.e. row, family, column), which we are trying to get rid of. HFileReaderV2.AbstractScannerV2.reseekTo performs the same check to decide whether to seek or to retry on the same block, this just pulls the check up. We can probably remove that optimization from the AbstractScannerV2 now (and save a few more compares).
        Hide
        lhofhansl Lars Hofhansl added a comment -

        Tested some more with KVs with 10, 100, and 10000 versions. The more version we have the more likely it becomes that we would seek past the next indexed key of the top scanner in the heap and hence keep the seek. So this appears to work fine with few and many versions, as well as few and many columns, and all cases it will estimate whether a SKIP or a SEEK is better.

        Show
        lhofhansl Lars Hofhansl added a comment - Tested some more with KVs with 10, 100, and 10000 versions. The more version we have the more likely it becomes that we would seek past the next indexed key of the top scanner in the heap and hence keep the seek. So this appears to work fine with few and many versions, as well as few and many columns, and all cases it will estimate whether a SKIP or a SEEK is better.
        Hide
        lhofhansl Lars Hofhansl added a comment -

        I might file a separate issue for this.

        Show
        lhofhansl Lars Hofhansl added a comment - I might file a separate issue for this.
        Hide
        lhofhansl Lars Hofhansl added a comment -

        Can now remove the lookahead hack I put in before.
        (Will test more, to make sure this is true in all cases)

        Show
        lhofhansl Lars Hofhansl added a comment - Can now remove the lookahead hack I put in before. (Will test more, to make sure this is true in all cases)
        Hide
        lhofhansl Lars Hofhansl added a comment -

        To be continued on HBASE-13109.

        Show
        lhofhansl Lars Hofhansl added a comment - To be continued on HBASE-13109 .
        Hide
        lhofhansl Lars Hofhansl added a comment -

        This is no longer needed. I added much better heuristics now to decide when we should SEEK and when we should SKIP.

        Show
        lhofhansl Lars Hofhansl added a comment - This is no longer needed. I added much better heuristics now to decide when we should SEEK and when we should SKIP.

          People

          • Assignee:
            Unassigned
            Reporter:
            lhofhansl Lars Hofhansl
          • Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development