Index: src/docbkx/book.xml =================================================================== --- src/docbkx/book.xml (revision 1147401) +++ src/docbkx/book.xml (working copy) @@ -261,6 +261,70 @@ that includes versioning. Take that into consideration when making your design, as well as block size for the ColumnFamily. +
+ + Secondary Indexes and Alternate Query Paths + + This section could also be titled "what if my table rowkey looks like this but I also want to query my table like that." + A common example on the dist-list is where a row-key is of the format "user-timestamp" but there are are reporting requirements on activity across users for certain + time ranges. Thus, selecting by user is easy because it is in the lead position of the key, but time is not. + + There is no single answer on the best way to handle this because it depends on... + + Number of users + Data size and data arrival rate + Flexibility of reporting requirements (e.g., completely ad-hoc date selection vs. pre-configured ranges) + Desired execution speed of query (e.g., 90 seconds may be reasonable to some for an ad-hoc report, whereas it may be too long for others) + + ... and solutions are also influenced by the size of the cluster and how much processing power you have to throw at the solution. + Common techniques are in sub-sections below. This is a comprehensive, but not exhaustive, list of approaches. + + It should not be a surprise that secondary indexes require additional cluster space and processing. + This is precisely what happens in an RDBMS because the act of creating an alternate index requires both space and processing cycles to update. RBDMS products + are more advanced in this regard to handle alternative index management out of the box. However, HBase scales better at larger data volumes, so this is a feature trade-off. + + Pay attention to when implementing any of these approaches. + Additionally, see the David Butler response in this dist-list thread HBase, mail # user - Stargate+hbase + +
+ + Filter Query + + Depending on the case, it may be appropriate to use . In this case, no secondary index is created. + +
+
+ + Periodic-Update Secondary Index + + A secondary index could be created in an other table which is periodically updated via a MapReduce job. The job could be executed intra-day, but depending on + load-strategy it could still potentially be out of sync with the main data table. + See for more information. +
+
+ + Dual-Write Secondary Index + + Another stragety is to build the secondary index while publishing data to the cluster (e.g., write to data table, write to index table). + If this is approach is taken after a data table already exists, then bootstrapping will be needed for the secondary index with a MapReduce job (see ). + There are a variety of middleware frameworks that could be employed for fire-and-forget processing, such as ActiveMQ. +
+
+ + Coprocessor Secondary Index + + Coprocessors act like RDBMS triggers. These are currently on TRUNK. + +
+
+ + Summary Tables + + Where time-ranges are very wide (e.g., year-long report) and where the data is voluminous, summary tables are a common approach. + These would be generated with MapReduce jobs inta another table. + See for more information. +
+
@@ -1576,8 +1640,7 @@ - For a useful introduction to the issues involved maintaining a secondary Index in a store like HBase, - see the David Butler message in this thread, HBase, mail # user - Stargate+hbase + See