A single HBase table can comfortably grow to any size, and there are good reasons to have a relatively small number of them on a cluster (< thousands). Additionally, the fact that HBase can have an arbitrary set of column qualifiers in every row means that one physical table need not be limited to a single schema with a fixed set of columns. The 1-to-1 correspondance of schema <-> table in Phoenix is somewhat artificially limiting; it'd be nice if apps could share large physical tables across many logical phoenix schemas. (This is especially common in multi-tenant situations, which Phoenix already supports by virtue of accepting a tenant-id in the connection properties).
I'd like to suggest giving Pheonix the ability to run one or more views (or tenant-specific tables) that actually use a single common physical HBase table. So when that tenant issues the query "SELECT * FROM foo_view", Phoenix translates this to really mean "SELECT c1, c2, c3 from foo_table".
The structure of the primary key (row key) of all the views would probably need to the same. I can think of a couple ways to structure this from a syntax point of view:
- Add a way of specifying the backing HBase table, either adding a WITH clause or a TABLESPACE clause in the CREATE TABLE statement, along the lines of what Postgres allows: http://www.postgresql.org/docs/9.1/static/sql-createtable.html
- A tenant could have the ability to only alter an existing table, not create a new one. The base table would need to be created beforehand on a connection that doesn't specify a tenant id. This would define the row key columns and any base key value columns. Tenants would then be able to add/remove their own key value columns, and Phoenix would take care of combining the "base" columns with the "tenant" columns based on the tenantId specified at connection time.
We'd also have to decide whether to automatically enforce row ownership by virtual table (by, for example, transparently including the tenant id in the rowkew) or just leave that up to the clients. (It might be weird if you get a bunch of rows back from your query that someone else inserted into another logical table, with all nulls for the columns you're expecting ...). This could also be done by including a hidden column in the row, or a bit in the rowkey, that indicates which virtual table the rows are part of.