Index: src/docbkx/book.xml =================================================================== --- src/docbkx/book.xml (revision 1080340) +++ src/docbkx/book.xml (working copy) @@ -1376,6 +1376,9 @@ Monotonically Increasing Row Keys/Timeseries Data + + In the HBase chapter of Tom White’s book “Hadoop: The Definitive Guide� there is a an optimization note on watching out for a phenomenon where an import process walks in lock-step with all clients in concert pounding one of the table’s regions (and thus, a single node), then moving onto the next region, etc. With monotonically increasing row-keys (i.e., using a timestamp), this will happen. This can, in part, be mitigated by randomizing the input records to not be in sorted order, but in general it’s best to avoid using a timestamp as the row-key. + See this comic by IKai Lan on why monotically increasing row keys are problematic in BigTable-like datastores: monotonically increasing values are bad. @@ -1381,8 +1384,9 @@ monotonically increasing values are bad. If you need to upload time series data into HBase, you should study OpenTSDB as a - successful example. It has a page describing the schema it uses in - HBase. You might also consider just using OpenTSDB altogether. + successful example. It has a page describing the schema it uses in + HBase. The key format in OpenTSDB is effectively [metric_type][event_timestamp], which would appear at first glance to contradict the previous advice about not using a timestamp as the key. However, the difference is that the timestamp is not in the lead position of the key, and the design assumption is that there are dozens or hundreds (or more) of different metric types. Thus, even with a continual stream of input data with a mix of metric types, the Puts are distributed across various points of regions in the table. +
Try to minimize row and column sizes @@ -1403,6 +1407,46 @@ names. `
+
+ + Table Creation – Pre-Creating Regions + + + Tables in HBase are initially created with one region by default. For bulk imports, this means that all clients will write to the same region until it is large enough to split and become distributed across the cluster. A useful pattern to speed up the bulk import process is to pre-create empty regions. Be somewhat conservative in this, because too-many regions can actually degrade performance. An example of pre-creation using hex-keys is as follows (note: this example may need to be tweaked to the individual application’s keys): + + +
+	public static boolean createTable(HBaseAdmin admin, HTableDescriptor table, 
+    		byte[][] splits) throws IOException {
+        try {
+            admin.createTable( table, splits );
+            return true;
+        } catch (TableExistsException e) {
+        	logger.info("table " + table.getNameAsString() + " already exists");
+        	// the table already exists...
+        	return false;  
+        }
+    }
+    public static byte[][] getHexSplits(String startKey, String endKey, int numRegions) {
+    	byte[][] splits = new byte[numRegions-1][];
+    	BigInteger lowestKey = new BigInteger(startKey, 16);
+    	BigInteger highestKey = new BigInteger(endKey, 16);
+    	BigInteger range = highestKey.subtract(lowestKey);
+    	
+    	BigInteger regionIncrement = range.divide(BigInteger.valueOf(numRegions));
+    	lowestKey = lowestKey.add(regionIncrement);
+    	for(int i=0; i < numRegions-1;i++) {
+    		BigInteger key = lowestKey.add(regionIncrement.multiply(BigInteger.valueOf(i)));
+    		byte[] b = String.format("%016x", key).getBytes();
+    		splits[i] = b;
+    	}
+    	
+    	return splits;
+    }
+  
+
+
+