Uploaded image for project: 'Phoenix'
  1. Phoenix
  2. PHOENIX-427

Allow creating VIEWs on the same physical HBase table

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 3.0.0, 4.0.0
    • Fix Version/s: 3.0.0, 4.0.0
    • Labels:
      None
    • old issue number:
      296

      Description

      A single HBase table can comfortably grow to any size, and there are good reasons to have a relatively small number of them on a cluster (< thousands). Additionally, the fact that HBase can have an arbitrary set of column qualifiers in every row means that one physical table need not be limited to a single schema with a fixed set of columns. The 1-to-1 correspondance of schema <-> table in Phoenix is somewhat artificially limiting; it'd be nice if apps could share large physical tables across many logical phoenix schemas. (This is especially common in multi-tenant situations, which Phoenix already supports by virtue of accepting a tenant-id in the connection properties).

      I'd like to suggest giving Pheonix the ability to run one or more views (or tenant-specific tables) that actually use a single common physical HBase table. So when that tenant issues the query "SELECT * FROM foo_view", Phoenix translates this to really mean "SELECT c1, c2, c3 from foo_table".

      The structure of the primary key (row key) of all the views would probably need to the same. I can think of a couple ways to structure this from a syntax point of view:

      • Add a way of specifying the backing HBase table, either adding a WITH clause or a TABLESPACE clause in the CREATE TABLE statement, along the lines of what Postgres allows: http://www.postgresql.org/docs/9.1/static/sql-createtable.html
      • A tenant could have the ability to only alter an existing table, not create a new one. The base table would need to be created beforehand on a connection that doesn't specify a tenant id. This would define the row key columns and any base key value columns. Tenants would then be able to add/remove their own key value columns, and Phoenix would take care of combining the "base" columns with the "tenant" columns based on the tenantId specified at connection time.

      We'd also have to decide whether to automatically enforce row ownership by virtual table (by, for example, transparently including the tenant id in the rowkew) or just leave that up to the clients. (It might be weird if you get a bunch of rows back from your query that someone else inserted into another logical table, with all nulls for the columns you're expecting ...). This could also be done by including a hidden column in the row, or a bit in the rowkey, that indicates which virtual table the rows are part of.

        Activity

        Hide
        pctony Tony Stevenson added a comment -

        Comment:ryang-sfdc:06/26/13 05:17:09 PM:

        +1

        Show
        pctony Tony Stevenson added a comment - Comment:ryang-sfdc:06/26/13 05:17:09 PM: +1
        Hide
        pctony Tony Stevenson added a comment -

        Comment:apurtell:06/27/13 04:29:09 AM:

        > The 1-to-1 correspondance of schema <-> table in Phoenix is somewhat artificially limiting; it'd be nice if apps could share large physical tables across many logical phoenix schemas. (This is especially common in multi-tenant situations, which Phoenix already supports by virtue of accepting a tenant-id in the connection properties).

        This is a nice idea. If tenants can be segregated to single physical tables, it also opens up possibilities for pinning those tables to regionserver groups HBASE-6721(https://issues.apache.org/jira/browse/HBASE-6721), a feature I expect to see in an upcoming 0.94.

        Show
        pctony Tony Stevenson added a comment - Comment:apurtell:06/27/13 04:29:09 AM: > The 1-to-1 correspondance of schema <-> table in Phoenix is somewhat artificially limiting; it'd be nice if apps could share large physical tables across many logical phoenix schemas. (This is especially common in multi-tenant situations, which Phoenix already supports by virtue of accepting a tenant-id in the connection properties). This is a nice idea. If tenants can be segregated to single physical tables, it also opens up possibilities for pinning those tables to regionserver groups HBASE-6721 ( https://issues.apache.org/jira/browse/HBASE-6721 ), a feature I expect to see in an upcoming 0.94.
        Hide
        pctony Tony Stevenson added a comment -

        Comment:ivarley:06/27/13 04:33:36 AM:

        Ooh, nice suggestion Andrew. +1

        Show
        pctony Tony Stevenson added a comment - Comment:ivarley:06/27/13 04:33:36 AM: Ooh, nice suggestion Andrew. +1
        Hide
        pctony Tony Stevenson added a comment -

        Comment:elevine:07/31/13 03:03:35 PM:

        Unless there are objections, I'm planning on taking a stab at this in the next couple of days. @jamestaylor, any suggestions on implementation approaches? Thanks.

        Show
        pctony Tony Stevenson added a comment - Comment:elevine:07/31/13 03:03:35 PM: Unless there are objections, I'm planning on taking a stab at this in the next couple of days. @jamestaylor, any suggestions on implementation approaches? Thanks.
        Hide
        pctony Tony Stevenson added a comment -

        Comment:jamestaylor:07/31/13 03:03:35 PM:

        mentioned

        Show
        pctony Tony Stevenson added a comment - Comment:jamestaylor:07/31/13 03:03:35 PM: mentioned
        Hide
        pctony Tony Stevenson added a comment -

        Comment:jamestaylor:08/01/13 02:03:56 AM:

        Here's my first cut on how this could be implemented:

        • Add TENANT_ID as the leading part of the SYSTEM.TABLE row key. It could be defined as a nullable VARCHAR.
        • Create a conversion script that inserts a null byte to the row key of every row of the SYSTEM.TABLE. This would define the "base table".
        • Modify MetaDataEndPointImpl code to use the TenantId connection property to scan SYSTEM.TABLE. You'll need to pass this through as a new/additional argument to the endpoint coprocessor methods.
        • Add an HBASE_TABLE_NAME VARCHAR key value column to SYSTEM.TABLE. This would be null for the "base table" and populated for "derived" tables.
        • Optionally add a TABLE_ID key value column to SYSTEM.TABLE. This would be null for the "base table" and populated for "derived" tables with the "key prefix". If we're ok including the "logical" schema name plus table name in the row key of the physical table, we don't necessarily need this. But this would be the "condensed version" of this - we might even just generate it, given that we'll have sequence support shortly ( -18 ).
        • Make the above mentioned grammar changes to be able to define your "physical" HBase table in a CREATE TABLE statement. It could be as simple as a new property (like we've done with salting and immutable tables), then no grammar changes would be necessary.
        • Persist this data into SYSTEM.TABLE by passing it through the UPSERT command.
        • Cache this data in the PTable returned by the MetaDataEndPointImpl methods.
        • Modify MetaDataClient.createTable to check for this new property and enforce that no primary key is defined for these new types of derived tables. Note that we might be able to allow additional nullable row key columns to be added. We should check here that the base table row key conforms to what we need:
        • declares a tenant id column (as mentioned in -356) and a logical table id column (same idea as with tenant id so we don't have "hard coded" column names)
        • leads the row key with the tenant id column followed by logical table id column
        • Modify WhereOptimizer.pushKeyExpressionsToScan to automatically insert AND conditions for tenant id (based on the TenantId connection property plus the column name identified by -356) and logical table id (based on the table id that would be cached in the PTable plus again another column name defined at DDL time). This would be the "runtime" glue that'll cause your queries to run against the right set of data.
        Show
        pctony Tony Stevenson added a comment - Comment:jamestaylor:08/01/13 02:03:56 AM: Here's my first cut on how this could be implemented: Add TENANT_ID as the leading part of the SYSTEM.TABLE row key. It could be defined as a nullable VARCHAR. Create a conversion script that inserts a null byte to the row key of every row of the SYSTEM.TABLE. This would define the "base table". Modify MetaDataEndPointImpl code to use the TenantId connection property to scan SYSTEM.TABLE. You'll need to pass this through as a new/additional argument to the endpoint coprocessor methods. Add an HBASE_TABLE_NAME VARCHAR key value column to SYSTEM.TABLE. This would be null for the "base table" and populated for "derived" tables. Optionally add a TABLE_ID key value column to SYSTEM.TABLE. This would be null for the "base table" and populated for "derived" tables with the "key prefix". If we're ok including the "logical" schema name plus table name in the row key of the physical table, we don't necessarily need this. But this would be the "condensed version" of this - we might even just generate it, given that we'll have sequence support shortly ( -18 ). Make the above mentioned grammar changes to be able to define your "physical" HBase table in a CREATE TABLE statement. It could be as simple as a new property (like we've done with salting and immutable tables), then no grammar changes would be necessary. Persist this data into SYSTEM.TABLE by passing it through the UPSERT command. Cache this data in the PTable returned by the MetaDataEndPointImpl methods. Modify MetaDataClient.createTable to check for this new property and enforce that no primary key is defined for these new types of derived tables. Note that we might be able to allow additional nullable row key columns to be added. We should check here that the base table row key conforms to what we need: declares a tenant id column (as mentioned in -356) and a logical table id column (same idea as with tenant id so we don't have "hard coded" column names) leads the row key with the tenant id column followed by logical table id column Modify WhereOptimizer.pushKeyExpressionsToScan to automatically insert AND conditions for tenant id (based on the TenantId connection property plus the column name identified by -356) and logical table id (based on the table id that would be cached in the PTable plus again another column name defined at DDL time). This would be the "runtime" glue that'll cause your queries to run against the right set of data.
        Hide
        pctony Tony Stevenson added a comment -

        Comment:elevine:08/01/13 02:04:09 AM:

        assigned

        Show
        pctony Tony Stevenson added a comment - Comment:elevine:08/01/13 02:04:09 AM: assigned
        Hide
        pctony Tony Stevenson added a comment -

        Comment:elevine:08/05/13 10:21:28 PM:

        Tenant-specific indexes built on top of these "views" is the next logical step.

        Show
        pctony Tony Stevenson added a comment - Comment:elevine:08/05/13 10:21:28 PM: Tenant-specific indexes built on top of these "views" is the next logical step.
        Hide
        pctony Tony Stevenson added a comment -

        Comment:elevine:08/06/13 03:31:45 AM:

        @jamestaylor, how about using the same table name for both regular and tenant-specific tables? The presence/absence of leading TENANT_ID row key part in SYSTEM.TABLE would distinguish one from another. That way we can omit the extra HBASE_TABLE_NAME column and assume the name of tenant-specific table would be shared by its base table (which would have no TENANT_ID).

        Show
        pctony Tony Stevenson added a comment - Comment:elevine:08/06/13 03:31:45 AM: @jamestaylor, how about using the same table name for both regular and tenant-specific tables? The presence/absence of leading TENANT_ID row key part in SYSTEM.TABLE would distinguish one from another. That way we can omit the extra HBASE_TABLE_NAME column and assume the name of tenant-specific table would be shared by its base table (which would have no TENANT_ID).
        Hide
        pctony Tony Stevenson added a comment -

        Comment:jamestaylor:08/06/13 03:31:45 AM:

        mentioned

        Show
        pctony Tony Stevenson added a comment - Comment:jamestaylor:08/06/13 03:31:45 AM: mentioned
        Hide
        pctony Tony Stevenson added a comment -

        Comment:elevine:08/06/13 04:25:47 AM:

        Err... scratch that last comment. We have to have the extra column pointing back to the base table to support multiple "views" on the same base table per tenant. Thanks @ivarley for pointing it out.

        Show
        pctony Tony Stevenson added a comment - Comment:elevine:08/06/13 04:25:47 AM: Err... scratch that last comment. We have to have the extra column pointing back to the base table to support multiple "views" on the same base table per tenant. Thanks @ivarley for pointing it out.
        Hide
        pctony Tony Stevenson added a comment -

        Comment:ivarley:08/06/13 04:25:47 AM:

        mentioned

        Show
        pctony Tony Stevenson added a comment - Comment:ivarley:08/06/13 04:25:47 AM: mentioned
        Hide
        pctony Tony Stevenson added a comment -

        Comment:jamestaylor:11/27/13 01:09:43 AM:

        Implemented by @elevine and pulled into master. Fantastic job! I'll close this issue, as the bulk of the functionality is in and we have these more specific issues for follow up work:

        • Tweak syntax for specifying base tenant table ( -592 )
        • Tenant data isolation for DML and SELECT ( -356 )
        • Hide tenant id column in tenant-specific tables ( -490 )
        • Base tenant table PK must lead with tenant_id and tenant_type_id columns ( -452 )
        • Support secondary indexes on tenant-specific tables ( -491 )
        • Changes metadata caches to be LRU ( -493 )
        • Performance test multi-tenant implementation ( -508 )
        Show
        pctony Tony Stevenson added a comment - Comment:jamestaylor:11/27/13 01:09:43 AM: Implemented by @elevine and pulled into master. Fantastic job! I'll close this issue, as the bulk of the functionality is in and we have these more specific issues for follow up work: Tweak syntax for specifying base tenant table ( -592 ) Tenant data isolation for DML and SELECT ( -356 ) Hide tenant id column in tenant-specific tables ( -490 ) Base tenant table PK must lead with tenant_id and tenant_type_id columns ( -452 ) Support secondary indexes on tenant-specific tables ( -491 ) Changes metadata caches to be LRU ( -493 ) Performance test multi-tenant implementation ( -508 )
        Hide
        pctony Tony Stevenson added a comment -

        Comment:elevine:11/27/13 01:09:43 AM:

        mentioned

        Show
        pctony Tony Stevenson added a comment - Comment:elevine:11/27/13 01:09:43 AM: mentioned
        Hide
        pctony Tony Stevenson added a comment -

        Comment:jamestaylor:11/27/13 01:09:43 AM:

        closed

        Show
        pctony Tony Stevenson added a comment - Comment:jamestaylor:11/27/13 01:09:43 AM: closed
        Hide
        gabriel.reid Gabriel Reid added a comment -

        Bulk resolve of closed issues imported from GitHub. This status was reached by first re-opening all closed imported issues and then resolving them in bulk.

        Show
        gabriel.reid Gabriel Reid added a comment - Bulk resolve of closed issues imported from GitHub. This status was reached by first re-opening all closed imported issues and then resolving them in bulk.
        Hide
        enis Enis Soztutar added a comment -

        Bulk close of all issues that has been resolved in a released version.

        Show
        enis Enis Soztutar added a comment - Bulk close of all issues that has been resolved in a released version.

          People

          • Assignee:
            elevine elevine
            Reporter:
            ivarley Ian Varley
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development