Index: hbase-assembly/src/docbkx/schema_design.xml
===================================================================
--- hbase-assembly/src/docbkx/schema_design.xml (revision 1464053)
+++ hbase-assembly/src/docbkx/schema_design.xml (working copy)
@@ -438,7 +438,7 @@
Log Data / Timeseries DataLog Data / Timeseries on Steroids
- Customer/Sales
+ Customer/OrderTall/Wide/Middle Schema DesignList Data
@@ -527,7 +527,7 @@
-
+ Case Study - Log Data and Timeseries Data on SteroidsThis effectively is the OpenTSDB approach. What OpenTSDB does is re-write data and pack rows into columns for
certain time-periods. For a detailed explanation, see: http://opentsdb.net/schema.html,
@@ -549,10 +549,10 @@
-
- Case Study - Customer / Sales
- Assume that HBase is used to store customer and sales information. There are two core record-types being ingested:
- a Customer record type, and Sales record type.
+
+ Case Study - Customer/Order
+ Assume that HBase is used to store customer and order information. There are two core record-types being ingested:
+ a Customer record type, and Order record type.
The Customer record type would include all the things that you’d typically expect:
@@ -562,21 +562,21 @@
Phone numbers, etc.
- The Sales record type would include things like:
+ The Order record type would include things like:
Customer number
- Sales/order number
+ Order numberSales date
- A series of nested objects for shipping locations and line-items (this itself is a design case study)
+ A series of nested objects for shipping locations and line-items (see
+ for details)Assuming that the combination of customer number and sales order uniquely identify an order, these two attributes will compose
the rowkey, and specifically a composite key such as:
- [customer number][sales number]
+ [customer number][order number]
-
-… for a SALES table. However, there are more design decisions to make: are the raw values the best choices for rowkeys?
+ … for a ORDER table. However, there are more design decisions to make: are the raw values the best choices for rowkeys?
The same design questions in the Log Data use-case confront us here. What is the keyspace of the customer number, and what is the
format (e.g., numeric? alphanumeric?) As it is advantageous to use fixed-length keys in HBase, as well as keys that can support a
@@ -585,16 +585,16 @@
Composite Rowkey With Hashes:
[MD5 of customer number] = 16 bytes
- [MD5 of sales number] = 16 bytes
+ [MD5 of order number] = 16 bytesComposite Numeric/Hash Combo Rowkey:
[substituted long for customer number] = 8 bytes
- [MD5 of sales number] = 16 bytes
+ [MD5 of order number] = 16 bytes
-
+ Single Table? Multiple Tables?A traditional design approach would have separate tables for CUSTOMER and SALES. Another option is to pack multiple
record types into a single table (e.g., CUSTOMER++).
@@ -605,11 +605,11 @@
[type] = type indicating ‘1’ for customer record type
- Sales Record Type Rowkey:
+ Order Record Type Rowkey:
[customer-id]
- [type] = type indicating ‘2’ for sales record type
- [sales-order]
+ [type] = type indicating ‘2’ for order record type
+ [order]The advantage of this particular CUSTOMER++ approach is that organizes many different record-types by customer-id
@@ -617,7 +617,121 @@
a particular record-type.
-
+
+ Order Object Design
+ Now we need to address how to model the Order object. Assume that the class structure is as follows:
+
+Order
+ ShippingLocation (an Order can have multiple ShippingLocations)
+ LineItem (a ShippingLocation can have multiple LineItems)
+
+ ... there are multiple options on storing this data.
+
+
+ Completely Normalized
+ With this approach, there would be separate tables for ORDER, SHIPPING_LOCATION, and LINE_ITEM.
+
+ The ORDER table's rowkey was described above:
+
+ The SHIPPING_LOCATION's composite rowkey would be something like this:
+
+ [order-rowkey]
+ [shipping location number] (e.g., 1st location, 2nd, etc.)
+
+
+ The LINE_ITEM table's composite rowkey would be something like this:
+
+ [order-rowkey]
+ [shipping location number] (e.g., 1st location, 2nd, etc.)
+ [line item number] (e.g., 1st lineitem, 2nd, etc.)
+
+
+ Such a normalized model is likely to be the approach with an RDBMS, but that's not your only option with HBase.
+ The cons of such an approach is that to retrieve information about any Order, you will need:
+
+ Get on the ORDER table for the Order
+ Scan on the SHIPPING_LOCATION table for that order to get the ShippingLocation instances
+ Scan on the LINE_ITEM for each ShippingLocation
+
+ ... granted, this is what an RDBMS would do under the covers anyway, but since there are no joins in HBase
+ you're just more aware of this fact.
+
+
+
+ Single Table With Record Types
+ With this approach, there would exist a single table ORDER that would contain
+
+ The Order rowkey was described above:
+
+ [order-rowkey]
+ [ORDER record type]
+
+
+ The ShippingLocation composite rowkey would be something like this:
+
+ [order-rowkey]
+ [SHIPPING record type]
+ [shipping location number] (e.g., 1st location, 2nd, etc.)
+
+
+ The LineItem composite rowkey would be something like this:
+
+ [order-rowkey]
+ [LINE record type]
+ [shipping location number] (e.g., 1st location, 2nd, etc.)
+ [line item number] (e.g., 1st lineitem, 2nd, etc.)
+
+
+
+
+ Denormalized
+ A variant of the Single Table With Record Types approach is to denormalize and flatten some of the object
+ hierarchy, such as collapsing the ShippingLocation attributes onto each LineItem instance.
+
+ The LineItem composite rowkey would be something like this:
+
+ [order-rowkey]
+ [LINE record type]
+ [line item number] (e.g., 1st lineitem, 2nd, etc. - care must be taken that there are unique across the entire order)
+
+
+ ... and the LineItem columns would be something like this:
+
+ itemNumber
+ quantity
+ price
+ shipToLine1 (denormalized from ShippingLocation)
+ shipToLine2 (denormalized from ShippingLocation)
+ shipToCity (denormalized from ShippingLocation)
+ shipToState (denormalized from ShippingLocation)
+ shipToZip (denormalized from ShippingLocation)
+
+
+ The pros of this approach include a less complex object heirarchy, but one of the cons is that updating gets more
+ complicated in case any of this information changes.
+
+
+
+ Object BLOB
+ With this approach, the entire Order object graph is treated, in one way or another, as a BLOB. For example, the
+ ORDER table's rowkey was described above: , and a
+ single column called "order" would contain an object that could be deserialized that contained a container Order,
+ ShippingLocations, and LineItems.
+
+ There are many options here: JSON, XML, Java Serialization, Avro, Hadoop Writables, etc. All of them are variants
+ of the same approach: encode the object graph to a byte-array. Care should be taken with this approach to ensure backward
+ compatibilty in case the object model changes such that older persisted structures can still be read back out of HBase.
+
+ Pros are being able to manage complex object graphs with minimal I/O (e.g., a single HBase Get per
+ Order in this example), but the cons include the aforementioned warning about backward compatiblity of serialization,
+ language dependencies of serialization (e.g., Java Serialization only works with Java clients), the fact that
+ you have to deserialize the entire object to get any piece of information inside the BLOB, and the difficulty in
+ getting frameworks like Hive to work with custom objects like this.
+
+
+
+
+
Case Study - "Tall/Wide/Middle" Schema Design SmackdownThis section will describe additional schema design questions that appear on the dist-list, specifically about
tall and wide tables. These are general guidelines and not laws - each application must consider its own needs.