HBase
  1. HBase
  2. HBASE-52 [hbase] Add a means of scanning over all versions
  3. HBASE-33

Add a HTable get/obtainScanner method that retrieves all versions of a particular column and row between two timestamps

    Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.20.0
    • Component/s: Client
    • Labels:
      None

      Description

      The use case:

      • A weblog application for which rows are user ids and posts are stored in a single column, with post date specified by the cell's timestamp. The application would then need to be able to display all posts for the last week or month.
      • A feedfetcher for which rows are URLs and feed posts are stored in a single column with the post publish date or fetch time stored in the cell's timestamp. The application would then need to be able to display all posts for the last week or month.

      Proposed API:

      // Get all versions of the specified row and column whose timestamps are in [minTimestamp, maxTimestamp]
      SortedMap<long, byte[]> getTimestamps(Text row, Text column, long minTimestamp, long maxTimestamp);

      // Get all versions of the specified row and column whose timestamps are >= minTimestamp
      SortedMap<long, byte[]> getTimestamps(Text row, Text column, long minTimestamp);

      I'd be happy to take this on myself, as I need it for the above use cases before migrating my application over to HBase.

        Issue Links

          Activity

          Hide
          Jonathan Gray added a comment -

          Fixed as part of HBASE-1304 commit.

          Verified in passing unit test: org.apache.hadoop.hbase.client.TestClient.testJIRAs.jiraTest33()

          Show
          Jonathan Gray added a comment - Fixed as part of HBASE-1304 commit. Verified in passing unit test: org.apache.hadoop.hbase.client.TestClient.testJIRAs.jiraTest33()
          Hide
          stack added a comment -

          TODO: Confirm actually solved by HBASE-1304

          Show
          stack added a comment - TODO: Confirm actually solved by HBASE-1304
          Hide
          Alex Newman added a comment - - edited

          Sounds good, I have changed our storage strategy so moving this later is still fine.

          Show
          Alex Newman added a comment - - edited Sounds good, I have changed our storage strategy so moving this later is still fine.
          Hide
          Jim Kellerman added a comment -

          Moving to 0.3.0 since it is such a significant server side change.

          Show
          Jim Kellerman added a comment - Moving to 0.3.0 since it is such a significant server side change.
          Hide
          stack added a comment -

          I chatted with the customer who wanted this issue the most – powerset pipeline – and the lads said that they want 0.2 out more than they want this feature and that they can live with this issue being punted to 0.3.

          Show
          stack added a comment - I chatted with the customer who wanted this issue the most – powerset pipeline – and the lads said that they want 0.2 out more than they want this feature and that they can live with this issue being punted to 0.3.
          Hide
          Jean-Daniel Cryans added a comment -

          Very true Jim. +1

          Show
          Jean-Daniel Cryans added a comment - Very true Jim. +1
          Hide
          Jim Kellerman added a comment -

          While the proposed solution introduces minimal (and upward compatible) changes to the API, the changes on the server side will be fairly extensive:

          • changing the internal scanner API
          • changes to HRegion.HSanner, HStoreScanner, Memcache$MemcacheScanner and StoreFileScanner
          • regression test changes

          Given that we are trying to stabilize release 0.2.0 so that we can release it before Hadoop 0.18.0 is available, is this issue critical for HBase 0.2.0 or can it be postponed to HBase 0.3.0?

          Show
          Jim Kellerman added a comment - While the proposed solution introduces minimal (and upward compatible) changes to the API, the changes on the server side will be fairly extensive: changing the internal scanner API changes to HRegion.HSanner, HStoreScanner, Memcache$MemcacheScanner and StoreFileScanner regression test changes Given that we are trying to stabilize release 0.2.0 so that we can release it before Hadoop 0.18.0 is available, is this issue critical for HBase 0.2.0 or can it be postponed to HBase 0.3.0?
          Hide
          Jim Kellerman added a comment -

          Implementing a scanner that returns multiple Cells for a single column is going to force an API change. Google's API for scanners is somewhat different from HBase:

          Scanner scanner(T);
          ScanStream* stream;
          stream = scanner.FetchColumnFamily("anchor");
          stream->SetReturnAllVersions();
          scanner.Lookup("com.cnn.www");
          for (; !stream->Done(); stream->Next()) {
            printf("%s %s %lld %s\n",
              scanner.RowName(),
              stream->ColumnName(),
              stream->MicroTimestamp(),
              stream->Value());
          }
          

          In HBase, we currently cannot retrieve values for multiple timestamps for the same column:

          HTable t = new HTable(conf, "tableName");
          Scanner s = t.getScanner(columns, startRow, timestamp, filter);
          try {
            RowResult r = null;
            while ((r = s.next()) != null) {
              System.out.print(Bytes.toString(r.getRow));
              for (Map.Entry<byte[], Cell> column: r.entrySet()) {
                System.out.print(" " + Bytes.toString(column.getKey()));
                Cell c column.getValue();
                System.out.println(" " + c.getTimestamp() + " " + Bytes.toString(c.getValue()));
              }
            }
          } finally {
            s.close();
          }
          

          The problem is, how do we return multiple Cells per column, without seriously breaking the client API? Proposed solution:

          • make Cell implement Iterable
          • getValue() returns the "current" value
          • getTimestamp returns the "current" timestamp
          • hasNext() returns true if there are more values
          • next() advances to the "next" value/timestamp

          Initially the "current" value/timestamp point to the first timestamp/value, which preserves the current API.

          Comments?

          Show
          Jim Kellerman added a comment - Implementing a scanner that returns multiple Cells for a single column is going to force an API change. Google's API for scanners is somewhat different from HBase: Scanner scanner(T); ScanStream* stream; stream = scanner.FetchColumnFamily( "anchor" ); stream->SetReturnAllVersions(); scanner.Lookup( "com.cnn.www" ); for (; !stream->Done(); stream->Next()) { printf( "%s %s %lld %s\n" , scanner.RowName(), stream->ColumnName(), stream->MicroTimestamp(), stream->Value()); } In HBase, we currently cannot retrieve values for multiple timestamps for the same column: HTable t = new HTable(conf, "tableName" ); Scanner s = t.getScanner(columns, startRow, timestamp, filter); try { RowResult r = null ; while ((r = s.next()) != null ) { System .out.print(Bytes.toString(r.getRow)); for (Map.Entry< byte [], Cell> column: r.entrySet()) { System .out.print( " " + Bytes.toString(column.getKey())); Cell c column.getValue(); System .out.println( " " + c.getTimestamp() + " " + Bytes.toString(c.getValue())); } } } finally { s.close(); } The problem is, how do we return multiple Cells per column, without seriously breaking the client API? Proposed solution: make Cell implement Iterable getValue() returns the "current" value getTimestamp returns the "current" timestamp hasNext() returns true if there are more values next() advances to the "next" value/timestamp Initially the "current" value/timestamp point to the first timestamp/value, which preserves the current API. Comments?
          Hide
          stack added a comment -

          Giving it back to Jim now he's back

          Show
          stack added a comment - Giving it back to Jim now he's back
          Hide
          stack added a comment -

          Needed so J-D can finish up the region historian

          Show
          stack added a comment - Needed so J-D can finish up the region historian
          Hide
          stack added a comment -

          Adding scanner within a timestamp range to the 0.2 list (needed internally).

          Show
          stack added a comment - Adding scanner within a timestamp range to the 0.2 list (needed internally).
          Hide
          stack added a comment -

          So, sounds like you need to add another method to your wishlist John, an obtainScanner that takes upper and lower bounds on timestamp. Scanners already take filters. There is an example that takes a regex over rows. Filters can also work against columns. But scanners currently only return nearest to supplied timestamp; you need all nearest to a specified starting stamp and N before specified end timestamp.

          Show
          stack added a comment - So, sounds like you need to add another method to your wishlist John, an obtainScanner that takes upper and lower bounds on timestamp. Scanners already take filters. There is an example that takes a regex over rows. Filters can also work against columns. But scanners currently only return nearest to supplied timestamp; you need all nearest to a specified starting stamp and N before specified end timestamp.
          Hide
          John Henton added a comment -

          @Bryan: that's what I suspected, however I couldn't discern how to constrain my rows by those created between min and max timestamp.

          Show
          John Henton added a comment - @Bryan: that's what I suspected, however I couldn't discern how to constrain my rows by those created between min and max timestamp.
          Hide
          Bryan Duxbury added a comment -

          @John: It sounds like what you want is more like a scanner than what this issue is looking for. You can already do what you're talking about if you use a scanner.

          Show
          Bryan Duxbury added a comment - @John: It sounds like what you want is more like a scanner than what this issue is looking for. You can already do what you're talking about if you use a scanner.
          Hide
          John Henton added a comment -

          Is it possible to have the row and column parameters accept regex as well? For instance, inline with the user scenario above, I need to get all of the feed posts from pages within the domain "org\.apache\..*". And I'd also like to request the cells from a column family and not be restricted to a single column.

          Show
          John Henton added a comment - Is it possible to have the row and column parameters accept regex as well? For instance, inline with the user scenario above, I need to get all of the feed posts from pages within the domain "org\.apache\..*". And I'd also like to request the cells from a column family and not be restricted to a single column.
          Hide
          stack added a comment -

          I was going to ask if following works:

            public byte[][] get(Text row, Text column, long maxTimestamp, int numVersions)
          

          where numVersions is the number of items to display on a single page + 1 so you know if you should show the 'next' page button or not.

          But that you'd have to look at the results, sort and chop the ones not wanted... that seems sufficient reason to add the methods you propose.

          +1

          (Note you'll need to add long type to the HbaseMapWritable)

          Show
          stack added a comment - I was going to ask if following works: public byte [][] get(Text row, Text column, long maxTimestamp, int numVersions) where numVersions is the number of items to display on a single page + 1 so you know if you should show the 'next' page button or not. But that you'd have to look at the results, sort and chop the ones not wanted... that seems sufficient reason to add the methods you propose. +1 (Note you'll need to add long type to the HbaseMapWritable)

            People

            • Assignee:
              Jonathan Gray
              Reporter:
              Peter Dolan
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development