Index: src/docbkx/book.xml
===================================================================
--- src/docbkx/book.xml (revision 1147401)
+++ src/docbkx/book.xml (working copy)
@@ -261,6 +261,70 @@
that includes versioning. Take that into consideration when making your design, as well as block size for the ColumnFamily.
+
+
+ Secondary Indexes and Alternate Query Paths
+
+ This section could also be titled "what if my table rowkey looks like this but I also want to query my table like that."
+ A common example on the dist-list is where a row-key is of the format "user-timestamp" but there are are reporting requirements on activity across users for certain
+ time ranges. Thus, selecting by user is easy because it is in the lead position of the key, but time is not.
+
+ There is no single answer on the best way to handle this because it depends on...
+
+ Number of users
+ Data size and data arrival rate
+ Flexibility of reporting requirements (e.g., completely ad-hoc date selection vs. pre-configured ranges)
+ Desired execution speed of query (e.g., 90 seconds may be reasonable to some for an ad-hoc report, whereas it may be too long for others)
+
+ ... and solutions are also influenced by the size of the cluster and how much processing power you have to throw at the solution.
+ Common techniques are in sub-sections below. This is a comprehensive, but not exhaustive, list of approaches.
+
+ It should not be a surprise that secondary indexes require additional cluster space and processing.
+ This is precisely what happens in an RDBMS because the act of creating an alternate index requires both space and processing cycles to update. RBDMS products
+ are more advanced in this regard to handle alternative index management out of the box. However, HBase scales better at larger data volumes, so this is a feature trade-off.
+
+ Pay attention to when implementing any of these approaches.
+ Additionally, see the David Butler response in this dist-list thread HBase, mail # user - Stargate+hbase
+
+
+
+ Filter Query
+
+ Depending on the case, it may be appropriate to use . In this case, no secondary index is created.
+
+
+
+
+ Periodic-Update Secondary Index
+
+ A secondary index could be created in an other table which is periodically updated via a MapReduce job. The job could be executed intra-day, but depending on
+ load-strategy it could still potentially be out of sync with the main data table.
+ See for more information.
+
+
+
+ Dual-Write Secondary Index
+
+ Another stragety is to build the secondary index while publishing data to the cluster (e.g., write to data table, write to index table).
+ If this is approach is taken after a data table already exists, then bootstrapping will be needed for the secondary index with a MapReduce job (see ).
+ There are a variety of middleware frameworks that could be employed for fire-and-forget processing, such as ActiveMQ.
+
+
+
+ Coprocessor Secondary Index
+
+ Coprocessors act like RDBMS triggers. These are currently on TRUNK.
+
+
+
+
+ Summary Tables
+
+ Where time-ranges are very wide (e.g., year-long report) and where the data is voluminous, summary tables are a common approach.
+ These would be generated with MapReduce jobs inta another table.
+ See for more information.
+
+
@@ -1576,8 +1640,7 @@
- For a useful introduction to the issues involved maintaining a secondary Index in a store like HBase,
- see the David Butler message in this thread, HBase, mail # user - Stargate+hbase
+ See