HBase
  1. HBase
  2. HBASE-2426

[Transactional Contrib] Introduce quick scanning row-based secondary indexes

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Release Note:
      Hide
      RowBasedIndexSpecification introduces secondary indices where base table rows with the same indexed column value have their row keys stored as column qualifiers on the same secondary index table row. The key for that row is the indexed column value from the base table.
      Show
      RowBasedIndexSpecification introduces secondary indices where base table rows with the same indexed column value have their row keys stored as column qualifiers on the same secondary index table row. The key for that row is the indexed column value from the base table.

      Description

      RowBasedIndexSpecification is a specialized IndexSpecification class for creating row-based secondary index tables. Base table rows with the same indexed column value have their row keys stored as column qualifiers on the same secondary index table row. The key for that row is the indexed column value from the base table. This allows to avoid expensive secondary index table scans and provides faster access for applications such as foreign key indexing or queries such as "find all table A rows whose familyA:columnB value is X". RowBasedIndexSpecification indices can be scanned using the API on RowBasedIndexedTable. The metadata for RowBasedIndexSpecification differ from IndexSpecification in that:

      • Only a single base table column can be indexed per RowBasedIndexSpecification. No additional columns are put in the index table.
        and
      • RowBasedIndexKeyGenerator, which constructs the index-row-key from the indexed column value in the original column, is always used.

      For a simple RowBasedIndexSpecification example, look at the TestRowBasedIndexedTable unit test in org.apache.hadoop.hbase.client.tableIndexed.

      To enable RowBasedIndexSpecification indexing, modify hbase-site.xml to turn on the
      IndexedRegionServer. This is done by setting

      • hbase.regionserver.class to org.apache.hadoop.hbase.ipc.IndexedRegionInterface and
      • hbase.regionserver.impl to org.apache.hadoop.hbase.regionserver.tableindexed.RowBasedIndexedRegionServer

        Activity

        Hide
        George P. Stathis added a comment -

        Attaching branch patch for review. When cleared and finalized, I will port it to the maven trunk as well.

        Show
        George P. Stathis added a comment - Attaching branch patch for review. When cleared and finalized, I will port it to the maven trunk as well.
        Hide
        George P. Stathis added a comment -

        This patch builds right on top of HBASE-2286. So the latter is required (it's already committed so it should not be a problem as long as folks SVN up the latest from the branch).

        Show
        George P. Stathis added a comment - This patch builds right on top of HBASE-2286 . So the latter is required (it's already committed so it should not be a problem as long as folks SVN up the latest from the branch).
        Hide
        Clint Morgan added a comment -

        Hey George, thanks for the patch.

        I have a question about how this improves performance over an
        index layout similar to the SimpleIndexKeyGenerator. I have the same
        requirements you mention above: namely I'd like to quickly finda all
        rows in table A which have a value for COL1 of 'X'.

        I build my index keys like <col1-value><sep><base-row-id> where <sep>
        is a special byte sequence that does not occur in column values or row
        keys. (Actually it can occur, if so I just escape it in the
        index-row). Lets say <sep> is '__' in the example below

        So if I have base rows:
        ROW | COL_A
        aaa | foo
        bbb | bar
        ccc | foo
        ddd | zoo

        Then my index would look like (just the rows are shown):
        bar__bbb
        foo__aaa
        foo__ccc
        zoo__ddd

        So for the query find all rows where COL_A == foo, I do an index scan
        starting at 'foo_' and ending at 'foo*' (where * is the byte after
        '_').

        This will only scan through only the two index rows I wanted. Looks
        like your patch will make it so rather than scanning two rows with on
        cell each I scan one row with two cells each. I'm not 100% sure on the
        specifics, but I think these two queries would generally be of the
        same order of performance.

        Do I understand things correctly? Is there a reason you could not use
        the existing index mechanism for your needs?

        I think we could do some work to make this pattern more obvious and
        usable with the current infrastructure, but I'm a bit hesitant to add yet
        another region/regionserver extension.

        George, what do you think?

        Slightly aside: When I read about AppEngine's index (a year ago or so), they said that they maintain N index rows for a single base row (1 per column being indexed). I've been wanting to rework this framework to support that as well, but it has not been a high priority as it would require a rewrite of our query stuff that uses the current indexing layer. The approach you take is the opposite: 1 index row for for N base rows. Not sure that really says anything, but ...

        Show
        Clint Morgan added a comment - Hey George, thanks for the patch. I have a question about how this improves performance over an index layout similar to the SimpleIndexKeyGenerator. I have the same requirements you mention above: namely I'd like to quickly finda all rows in table A which have a value for COL1 of 'X'. I build my index keys like <col1-value><sep><base-row-id> where <sep> is a special byte sequence that does not occur in column values or row keys. (Actually it can occur, if so I just escape it in the index-row). Lets say <sep> is '__' in the example below So if I have base rows: ROW | COL_A aaa | foo bbb | bar ccc | foo ddd | zoo Then my index would look like (just the rows are shown): bar__bbb foo__aaa foo__ccc zoo__ddd So for the query find all rows where COL_A == foo, I do an index scan starting at 'foo_ ' and ending at 'foo *' (where * is the byte after '_'). This will only scan through only the two index rows I wanted. Looks like your patch will make it so rather than scanning two rows with on cell each I scan one row with two cells each. I'm not 100% sure on the specifics, but I think these two queries would generally be of the same order of performance. Do I understand things correctly? Is there a reason you could not use the existing index mechanism for your needs? I think we could do some work to make this pattern more obvious and usable with the current infrastructure, but I'm a bit hesitant to add yet another region/regionserver extension. George, what do you think? Slightly aside: When I read about AppEngine's index (a year ago or so), they said that they maintain N index rows for a single base row (1 per column being indexed). I've been wanting to rework this framework to support that as well, but it has not been a high priority as it would require a rewrite of our query stuff that uses the current indexing layer. The approach you take is the opposite: 1 index row for for N base rows. Not sure that really says anything, but ...
        Hide
        George P. Stathis added a comment -

        Clint,

        Thank you for the feedback. After more consideration during this past week and after reviewing your comment, I think that I should retract any speed improvement claims regarding this patch. I started on this contrib focusing solely on scans but I naively neglected to account for the column iterations. If one takes everything in to account, I do agree that there should not be much of a speed difference between what's already there and the RowBasedIndexSpecification. Maybe if one thinks in terms of IO, fetching a row and then iterating in memory instead of scanning through files might have an edge, but I'm not quite sure about this either; I'm still new with this technology stack and I'm not sure if scanning through more rows means going though more files. Some actual performance tests should be run to see if that statement even holds (or someone more knowledgeable like you should set me straight ).

        So, at the very least, the JavaDoc should be amended to reflect this.

        As it turns out though, this contrib is definitely useful when used in conjunction with https://issues.apache.org/jira/browse/HBASE-2438. Since there is currently no reliable way to paginate through rows, a row based indexing approach can at least guarantee that the pages returned contain the number of rows requested. Our application does leverage pagination, so we will be able to use this, at least until a reliable row-based pagination comes along. After that, it may be six and half a dozen. One thing that the new contrib does not offer over the current solution is the ability to store additional column values in the index for further filtering. This might be a deal-breaker for some folks.

        Let me know what you think. If people don't have any use for this except for column-based pagination, maybe it's not worth adding.

        Show
        George P. Stathis added a comment - Clint, Thank you for the feedback. After more consideration during this past week and after reviewing your comment, I think that I should retract any speed improvement claims regarding this patch. I started on this contrib focusing solely on scans but I naively neglected to account for the column iterations. If one takes everything in to account, I do agree that there should not be much of a speed difference between what's already there and the RowBasedIndexSpecification. Maybe if one thinks in terms of IO, fetching a row and then iterating in memory instead of scanning through files might have an edge, but I'm not quite sure about this either; I'm still new with this technology stack and I'm not sure if scanning through more rows means going though more files. Some actual performance tests should be run to see if that statement even holds (or someone more knowledgeable like you should set me straight ). So, at the very least, the JavaDoc should be amended to reflect this. As it turns out though, this contrib is definitely useful when used in conjunction with https://issues.apache.org/jira/browse/HBASE-2438 . Since there is currently no reliable way to paginate through rows, a row based indexing approach can at least guarantee that the pages returned contain the number of rows requested. Our application does leverage pagination, so we will be able to use this, at least until a reliable row-based pagination comes along. After that, it may be six and half a dozen. One thing that the new contrib does not offer over the current solution is the ability to store additional column values in the index for further filtering. This might be a deal-breaker for some folks. Let me know what you think. If people don't have any use for this except for column-based pagination, maybe it's not worth adding.
        Hide
        stack added a comment -

        Bulk move of 0.20.5 issues into 0.21.0 after vote that we merge branch into TRUNK up on list.

        Show
        stack added a comment - Bulk move of 0.20.5 issues into 0.21.0 after vote that we merge branch into TRUNK up on list.
        Hide
        stack added a comment -

        Moved from 0.21 to 0.22 just after merge of old 0.20 branch into TRUNK.

        Show
        stack added a comment - Moved from 0.21 to 0.22 just after merge of old 0.20 branch into TRUNK.
        Hide
        stack added a comment -

        Transaction is no longer part of hbase so this issue no longer pertains (Reopen if I have it wrong George)

        Show
        stack added a comment - Transaction is no longer part of hbase so this issue no longer pertains (Reopen if I have it wrong George)
        Hide
        Gary Helmling added a comment -

        Dropping fix version. It's confusing to have marked against 0.92 when it's closed because transactional contrib was moved out.

        Show
        Gary Helmling added a comment - Dropping fix version. It's confusing to have marked against 0.92 when it's closed because transactional contrib was moved out.

          People

          • Assignee:
            Unassigned
            Reporter:
            George P. Stathis
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development