Index: hbase-assembly/src/docbkx/schema_design.xml =================================================================== --- hbase-assembly/src/docbkx/schema_design.xml (revision 1464053) +++ hbase-assembly/src/docbkx/schema_design.xml (working copy) @@ -438,7 +438,7 @@ Log Data / Timeseries Data Log Data / Timeseries on Steroids - Customer/Sales + Customer/Order Tall/Wide/Middle Schema Design List Data @@ -527,7 +527,7 @@ -
+
Case Study - Log Data and Timeseries Data on Steroids This effectively is the OpenTSDB approach. What OpenTSDB does is re-write data and pack rows into columns for certain time-periods. For a detailed explanation, see: http://opentsdb.net/schema.html, @@ -549,10 +549,10 @@
-
- Case Study - Customer / Sales - Assume that HBase is used to store customer and sales information. There are two core record-types being ingested: - a Customer record type, and Sales record type. +
+ Case Study - Customer/Order + Assume that HBase is used to store customer and order information. There are two core record-types being ingested: + a Customer record type, and Order record type. The Customer record type would include all the things that you’d typically expect: @@ -562,21 +562,21 @@ Phone numbers, etc. - The Sales record type would include things like: + The Order record type would include things like: Customer number - Sales/order number + Order number Sales date - A series of nested objects for shipping locations and line-items (this itself is a design case study) + A series of nested objects for shipping locations and line-items (see + for details) Assuming that the combination of customer number and sales order uniquely identify an order, these two attributes will compose the rowkey, and specifically a composite key such as: - [customer number][sales number] + [customer number][order number] - -… for a SALES table. However, there are more design decisions to make: are the raw values the best choices for rowkeys? + … for a ORDER table. However, there are more design decisions to make: are the raw values the best choices for rowkeys? The same design questions in the Log Data use-case confront us here. What is the keyspace of the customer number, and what is the format (e.g., numeric? alphanumeric?) As it is advantageous to use fixed-length keys in HBase, as well as keys that can support a @@ -585,16 +585,16 @@ Composite Rowkey With Hashes: [MD5 of customer number] = 16 bytes - [MD5 of sales number] = 16 bytes + [MD5 of order number] = 16 bytes Composite Numeric/Hash Combo Rowkey: [substituted long for customer number] = 8 bytes - [MD5 of sales number] = 16 bytes + [MD5 of order number] = 16 bytes -
+
Single Table? Multiple Tables? A traditional design approach would have separate tables for CUSTOMER and SALES. Another option is to pack multiple record types into a single table (e.g., CUSTOMER++). @@ -605,11 +605,11 @@ [type] = type indicating ‘1’ for customer record type - Sales Record Type Rowkey: + Order Record Type Rowkey: [customer-id] - [type] = type indicating ‘2’ for sales record type - [sales-order] + [type] = type indicating ‘2’ for order record type + [order] The advantage of this particular CUSTOMER++ approach is that organizes many different record-types by customer-id @@ -617,7 +617,121 @@ a particular record-type.
-
+
+ Order Object Design + Now we need to address how to model the Order object. Assume that the class structure is as follows: + +Order + ShippingLocation (an Order can have multiple ShippingLocations) + LineItem (a ShippingLocation can have multiple LineItems) + + ... there are multiple options on storing this data. + +
+ Completely Normalized + With this approach, there would be separate tables for ORDER, SHIPPING_LOCATION, and LINE_ITEM. + + The ORDER table's rowkey was described above: + + The SHIPPING_LOCATION's composite rowkey would be something like this: + + [order-rowkey] + [shipping location number] (e.g., 1st location, 2nd, etc.) + + + The LINE_ITEM table's composite rowkey would be something like this: + + [order-rowkey] + [shipping location number] (e.g., 1st location, 2nd, etc.) + [line item number] (e.g., 1st lineitem, 2nd, etc.) + + + Such a normalized model is likely to be the approach with an RDBMS, but that's not your only option with HBase. + The cons of such an approach is that to retrieve information about any Order, you will need: + + Get on the ORDER table for the Order + Scan on the SHIPPING_LOCATION table for that order to get the ShippingLocation instances + Scan on the LINE_ITEM for each ShippingLocation + + ... granted, this is what an RDBMS would do under the covers anyway, but since there are no joins in HBase + you're just more aware of this fact. + +
+
+ Single Table With Record Types + With this approach, there would exist a single table ORDER that would contain + + The Order rowkey was described above: + + [order-rowkey] + [ORDER record type] + + + The ShippingLocation composite rowkey would be something like this: + + [order-rowkey] + [SHIPPING record type] + [shipping location number] (e.g., 1st location, 2nd, etc.) + + + The LineItem composite rowkey would be something like this: + + [order-rowkey] + [LINE record type] + [shipping location number] (e.g., 1st location, 2nd, etc.) + [line item number] (e.g., 1st lineitem, 2nd, etc.) + + +
+
+ Denormalized + A variant of the Single Table With Record Types approach is to denormalize and flatten some of the object + hierarchy, such as collapsing the ShippingLocation attributes onto each LineItem instance. + + The LineItem composite rowkey would be something like this: + + [order-rowkey] + [LINE record type] + [line item number] (e.g., 1st lineitem, 2nd, etc. - care must be taken that there are unique across the entire order) + + + ... and the LineItem columns would be something like this: + + itemNumber + quantity + price + shipToLine1 (denormalized from ShippingLocation) + shipToLine2 (denormalized from ShippingLocation) + shipToCity (denormalized from ShippingLocation) + shipToState (denormalized from ShippingLocation) + shipToZip (denormalized from ShippingLocation) + + + The pros of this approach include a less complex object heirarchy, but one of the cons is that updating gets more + complicated in case any of this information changes. + +
+
+ Object BLOB + With this approach, the entire Order object graph is treated, in one way or another, as a BLOB. For example, the + ORDER table's rowkey was described above: , and a + single column called "order" would contain an object that could be deserialized that contained a container Order, + ShippingLocations, and LineItems. + + There are many options here: JSON, XML, Java Serialization, Avro, Hadoop Writables, etc. All of them are variants + of the same approach: encode the object graph to a byte-array. Care should be taken with this approach to ensure backward + compatibilty in case the object model changes such that older persisted structures can still be read back out of HBase. + + Pros are being able to manage complex object graphs with minimal I/O (e.g., a single HBase Get per + Order in this example), but the cons include the aforementioned warning about backward compatiblity of serialization, + language dependencies of serialization (e.g., Java Serialization only works with Java clients), the fact that + you have to deserialize the entire object to get any piece of information inside the BLOB, and the difficulty in + getting frameworks like Hive to work with custom objects like this. + +
+
+
+
Case Study - "Tall/Wide/Middle" Schema Design Smackdown This section will describe additional schema design questions that appear on the dist-list, specifically about tall and wide tables. These are general guidelines and not laws - each application must consider its own needs.