Details

    • Type: Task
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 0.90.4
    • Fix Version/s: 0.94.0
    • Component/s: None
    • Labels:
      None
    • Tags:
      noob

      Description

      Dave Revell was asking on IRC today if there's a way to scan ranges of qualifiers within a row. That is, to be able to specify a start qualifier and an end qualifier so that the Get or Scan seeks directly to the first qualifier and stops at some point which can be predeterminate by a qualifier or simply a batch configuration (already exists).

      This is particularly useful for large rows with time-based qualifiers.

      Dave also mentioned that another popular database has such a feature that they call "column slices".

        Issue Links

          Activity

          Hide
          stack stack added a comment -

          Should include cf too? So we have setStartRow and setStopRow. We could add setStartColumnFamily, setStartQualifier and setStartTS... but that gets ugly quick. setStart(row, cf, qualifier, ts) and setStop?

          Show
          stack stack added a comment - Should include cf too? So we have setStartRow and setStopRow. We could add setStartColumnFamily, setStartQualifier and setStartTS... but that gets ugly quick. setStart(row, cf, qualifier, ts) and setStop?
          Hide
          jdcryans Jean-Daniel Cryans added a comment -

          Yeah well I wonder if one would even need that sort of facility to be cross-families. Maybe Dave would like to chime in?

          Show
          jdcryans Jean-Daniel Cryans added a comment - Yeah well I wonder if one would even need that sort of facility to be cross-families. Maybe Dave would like to chime in?
          Hide
          dave_revell Dave Revell added a comment -

          I don't have any strong feelings about whether this should be cross-column-families or not. I'll leave that question to wiser HBase folks than me.

          Show
          dave_revell Dave Revell added a comment - I don't have any strong feelings about whether this should be cross-column-families or not. I'll leave that question to wiser HBase folks than me.
          Hide
          stack stack added a comment -

          @J-D On cf, I'd say might as well though I see x-cf not so important. Should be able to spec a timerange though. I see that we can pass one now but we'd need to jigger it so it applied over the qualifier range or if only the one qualifier specified, w/i that qualifier; e.g. imagine a metric w/ 1M versions... you should be able to spec scanning 10k at at aim.

          Show
          stack stack added a comment - @J-D On cf, I'd say might as well though I see x-cf not so important. Should be able to spec a timerange though. I see that we can pass one now but we'd need to jigger it so it applied over the qualifier range or if only the one qualifier specified, w/i that qualifier; e.g. imagine a metric w/ 1M versions... you should be able to spec scanning 10k at at aim.
          Hide
          dave_revell Dave Revell added a comment -

          Assigned to myself, I'm going to give this a try. I'll try not to noob too badly.

          Show
          dave_revell Dave Revell added a comment - Assigned to myself, I'm going to give this a try. I'll try not to noob too badly.
          Hide
          dave_revell Dave Revell added a comment -

          I've started work on this and I'm having trouble deciding on the best interface. I can see a couple ways that it might work:

          1. Add Scan.setStartQualifier() and Scan.setEndQualifier(). What if the user also uses addColumn(), would we fetch those columns also?

          2. Add Scan.addColumnRange(byte[] start, byte[] end). Could the user specify multiple (possibly overlapping) ranges? Again, would this conflict with addColumn()?

          Any input would be appreciated.

          Show
          dave_revell Dave Revell added a comment - I've started work on this and I'm having trouble deciding on the best interface. I can see a couple ways that it might work: 1. Add Scan.setStartQualifier() and Scan.setEndQualifier(). What if the user also uses addColumn(), would we fetch those columns also? 2. Add Scan.addColumnRange(byte[] start, byte[] end). Could the user specify multiple (possibly overlapping) ranges? Again, would this conflict with addColumn()? Any input would be appreciated.
          Hide
          stack stack added a comment -

          On 1. I think it depends.

          addColumn can specify a (a) cf only, (b) a cf+qualifier, or a list of (c) cf+qualifiers.

          If (a), then the setStartQualifier/setEndQualifier would return items from within a CF between the specified qualifiers (do you need an 'inclusive/exclusive' flag or you just going to have them be both inclusive of specified qualifier?)

          On (b), i'd think that we'd just return the qualifier only IFF it was within the setStartQualifier/setEndQualifier bounds.

          For (c), we'd return members of the list that are inside the setStartQualifier/setEndQualifier bounds.

          I think I prefer 2. api to 1. api. It maps well to TimeRange. Its a bit more of a pain specifying 'start x to the end of the row' with api 2. but its palatable enough.

          Ditto on how api 2. plays with addColumn.

          Show
          stack stack added a comment - On 1. I think it depends. addColumn can specify a (a) cf only, (b) a cf+qualifier, or a list of (c) cf+qualifiers. If (a), then the setStartQualifier/setEndQualifier would return items from within a CF between the specified qualifiers (do you need an 'inclusive/exclusive' flag or you just going to have them be both inclusive of specified qualifier?) On (b), i'd think that we'd just return the qualifier only IFF it was within the setStartQualifier/setEndQualifier bounds. For (c), we'd return members of the list that are inside the setStartQualifier/setEndQualifier bounds. I think I prefer 2. api to 1. api. It maps well to TimeRange. Its a bit more of a pain specifying 'start x to the end of the row' with api 2. but its palatable enough. Ditto on how api 2. plays with addColumn.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          On the other hand API 1 maps better to setStartRow, setStopRow.
          +1 on everything else Stack said.

          Would scanning within a row imply start and end row set? I think it should. To the same row?
          (Or are we saying for all rows returned by the scan returns scan only within these column boundaries?)
          If so, maybe we can combine startRow and startQualifier, ditto for stopRow and endQualifier.

          Or we can start at RowX, QualifierA and scan to RowY, QualifierB? I.e. for all rows between RowX and RowY we'd return all column...

          Show
          lhofhansl Lars Hofhansl added a comment - On the other hand API 1 maps better to setStartRow, setStopRow. +1 on everything else Stack said. Would scanning within a row imply start and end row set? I think it should. To the same row? (Or are we saying for all rows returned by the scan returns scan only within these column boundaries?) If so, maybe we can combine startRow and startQualifier, ditto for stopRow and endQualifier. Or we can start at RowX, QualifierA and scan to RowY, QualifierB? I.e. for all rows between RowX and RowY we'd return all column...
          Hide
          stack stack added a comment -

          I like Lars observation of how setStartQualifier maps to setStartRow; i'd say that would tip me in favor of api 1 (mapping setStartRow/setEndRow seems like a better fit than mapping time range).

          Show
          stack stack added a comment - I like Lars observation of how setStartQualifier maps to setStartRow; i'd say that would tip me in favor of api 1 (mapping setStartRow/setEndRow seems like a better fit than mapping time range).
          Hide
          lhofhansl Lars Hofhansl added a comment -

          What about the other points? Are we seeing this as:
          1. A way to scan subset of many rows (would be inefficient as we need to reseek at the beginning of every new row)?
          or
          2. A scan strictly within a single row? (should enforce start/endRow set to the same value in this case.)
          or
          3. More control where to start and stop a scan? (leave the scan logic mostly the way it is, and allow a way to seek to a column of a row, and end at a different column in another row).

          #1 makes no sense (to me anyway, but maybe there's something I didn't see).
          #3 is a superset of #2. As described above, in this case we could start at RowX:QualA and scan to RowY:QualB, all rows between RowX and RowY would scan all Quals. Can be combined easily with timerange scans.

          I'd be in favor of #3. Or #2 if it turns out to be less work.

          Show
          lhofhansl Lars Hofhansl added a comment - What about the other points? Are we seeing this as: 1. A way to scan subset of many rows (would be inefficient as we need to reseek at the beginning of every new row)? or 2. A scan strictly within a single row? (should enforce start/endRow set to the same value in this case.) or 3. More control where to start and stop a scan? (leave the scan logic mostly the way it is, and allow a way to seek to a column of a row, and end at a different column in another row). #1 makes no sense (to me anyway, but maybe there's something I didn't see). #3 is a superset of #2. As described above, in this case we could start at RowX:QualA and scan to RowY:QualB, all rows between RowX and RowY would scan all Quals. Can be combined easily with timerange scans. I'd be in favor of #3. Or #2 if it turns out to be less work.
          Hide
          gqchen Jerry Chen added a comment -

          Should we consider implementing in filters? We already have a ColumnRangeFilter which does something similar to what we are discussing here.

          Show
          gqchen Jerry Chen added a comment - Should we consider implementing in filters? We already have a ColumnRangeFilter which does something similar to what we are discussing here.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          I think the idea is the speed improvement. It is already possible to specify which columns/versions you want in a scan, but the the scan always has to start at the beginning of the row, which might be inefficient for very wide rows.

          Show
          lhofhansl Lars Hofhansl added a comment - I think the idea is the speed improvement. It is already possible to specify which columns/versions you want in a scan, but the the scan always has to start at the beginning of the row, which might be inefficient for very wide rows.
          Hide
          gqchen Jerry Chen added a comment -

          For ColumnRangeFilter, I believe we seek to the beginning of the start column.

          Show
          gqchen Jerry Chen added a comment - For ColumnRangeFilter, I believe we seek to the beginning of the start column.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          You are right (just looked at the code ). Comes back to my question above then, what do we want to get out of this.
          Personally, we have need here to have finer control over where exactly to start a query and where to stop it.

          Show
          lhofhansl Lars Hofhansl added a comment - You are right (just looked at the code ). Comes back to my question above then, what do we want to get out of this. Personally, we have need here to have finer control over where exactly to start a query and where to stop it.
          Hide
          dave_revell Dave Revell added a comment -

          ColumnRangeFilter seems to make this ticket obsolete, at least for the use cases I had in mind when originally discussing this ticket. Thanks Lars and Jerry for noticing.

          So I'd be in favor of closing this ticket.

          Show
          dave_revell Dave Revell added a comment - ColumnRangeFilter seems to make this ticket obsolete, at least for the use cases I had in mind when originally discussing this ticket. Thanks Lars and Jerry for noticing. So I'd be in favor of closing this ticket.
          Hide
          stack stack added a comment -

          A bit of doc for the book

          Show
          stack stack added a comment - A bit of doc for the book
          Hide
          stack stack added a comment -

          Resolving at Dave's suggestion (Committed a bit of doc to the book into the "How do I.." section on how to do column slices in hbase

          Show
          stack stack added a comment - Resolving at Dave's suggestion (Committed a bit of doc to the book into the "How do I.." section on how to do column slices in hbase
          Hide
          lhofhansl Lars Hofhansl added a comment -

          I still think being able to start a scanner at a column would be a helpful addition. We had a use case for that with wide tables in one of our POCs, but then reworked the problem. Not sure how general this usecase would be, though.

          Show
          lhofhansl Lars Hofhansl added a comment - I still think being able to start a scanner at a column would be a helpful addition. We had a use case for that with wide tables in one of our POCs, but then reworked the problem. Not sure how general this usecase would be, though.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          +1 on doc patch.

          Show
          lhofhansl Lars Hofhansl added a comment - +1 on doc patch.
          Hide
          stack stack added a comment -

          Sorry Lars. Missed your note. Should we open a more specific issue?

          Show
          stack stack added a comment - Sorry Lars. Missed your note. Should we open a more specific issue?
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Only if there's a strong need. I think starting a scanner at a column (i.e. seeking to a column in a row) should not be difficult. Stopping at a column might be a bit more tricky, as you'd need know what to do with rows in between.

          Let's just let this rest

          Show
          lhofhansl Lars Hofhansl added a comment - Only if there's a strong need. I think starting a scanner at a column (i.e. seeking to a column in a row) should not be difficult. Stopping at a column might be a bit more tricky, as you'd need know what to do with rows in between. Let's just let this rest
          Hide
          hudson Hudson added a comment -

          Integrated in HBase-TRUNK #2460 (See https://builds.apache.org/job/HBase-TRUNK/2460/)
          HBASE-4256 Intra-row scanning (part deux)

          stack :
          Files :

          • /hbase/trunk/src/docbkx/book.xml
          Show
          hudson Hudson added a comment - Integrated in HBase-TRUNK #2460 (See https://builds.apache.org/job/HBase-TRUNK/2460/ ) HBASE-4256 Intra-row scanning (part deux) stack : Files : /hbase/trunk/src/docbkx/book.xml

            People

            • Assignee:
              dave_revell Dave Revell
              Reporter:
              jdcryans Jean-Daniel Cryans
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development