Index: src/docbkx/performance.xml =================================================================== --- src/docbkx/performance.xml (revision 1153887) +++ src/docbkx/performance.xml (working copy) @@ -134,25 +134,21 @@ See . -
- Data Clumping +
+ Writing to HBase - If all your data is being written to one region, then re-read the - section on processing timeseries - data. -
- -
- Batch Loading - Use the bulk load tool if you can. See +
+ Batch Loading + Use the bulk load tool if you can. See Bulk Loads. Otherwise, pay attention to the below. - + +
-
- - Table Creation: Pre-Creating Regions - +
+ + Table Creation: Pre-Creating Regions + Tables in HBase are initially created with one region by default. For bulk imports, this means that all clients will write to the same region until it is large enough to split and become distributed across the cluster. A useful pattern to speed up the bulk import process is to pre-create empty regions. Be somewhat conservative in this, because too-many regions can actually degrade performance. An example of pre-creation using hex-keys is as follows (note: this example may need to be tweaked to the individual applications keys): @@ -185,10 +181,10 @@ }
-
- - Table Creation: Deferred Log Flush - +
+ + Table Creation: Deferred Log Flush + The default behavior for Puts using the Write Ahead Log (WAL) is that HLog edits will be written immediately. If deferred log flush is used, WAL edits are kept in memory until the flush period. The benefit is aggregated and asynchronous HLog- writes, but the potential downside is that if @@ -198,14 +194,10 @@ Deferred log flush can be configured on tables via HTableDescriptor. The default value of hbase.regionserver.optionallogflushinterval is 1000ms. -
-
- -
- HBase Client +
- AutoFlush + HBase Client: AutoFlush When performing a lot of Puts, make sure that setAutoFlush is set to false on your close on the HTable instance will invoke flushCommits.
+
+ HBase Client: Turn off WAL on Puts + A frequently discussed option for increasing throughput on Puts is to call writeToWAL(false). Turning this off means + that the RegionServer will not write the Put to the Write Ahead Log, + only into the memstore, HOWEVER the consequence is that if there + is a RegionServer failure there will be data loss. + If writeToWAL(false) is used, do so with extreme caution. You may find in actuality that + it makes little difference if your load is well distributed across the cluster. + + In general, it is best to use WAL for Puts, and where loading throughput + is a concern to use bulk loading techniques instead. + +
+
+ HBase Client: Group Puts by RegionServer + In addition to using the writeBuffer, grouping Puts by RegionServer can reduce the number of client RPC calls per writeBuffer flush. + There is a utility HTableUtil currently on TRUNK that does this, but you can either copy that or implement your own verison for + those still on 0.90.x or earlier. + +
+
+ MapReduce: Skip The Reducer + When writing a lot of data to an HBase table in a in a Mapper (e.g., with TableOutputFormat), + skip the Reducer step whenever possible. When a Reducer step is used, all of the output (Puts) from the Mapper will get spooled to disk, then shuffled to other + Reducers that will most likely be off-node. + +
+ +
+ Anti-Pattern: One Hot Region + If all your data is being written to one region at a time, then re-read the + section on processing timeseries data. + Also, see , as well as +
+ +
+ +
+ Reading from HBase
Scan Caching @@ -286,18 +318,12 @@ and minimal network traffic to the client for a single row.
-
- Turn off WAL on Puts - A frequently discussed option for increasing throughput on Puts is to call writeToWAL(false). Turning this off means - that the RegionServer will not write the Put to the Write Ahead Log, - only into the memstore, HOWEVER the consequence is that if there - is a RegionServer failure there will be data loss. - If writeToWAL(false) is used, do so with extreme caution. You may find in actuality that - it makes little difference if your load is well distributed across the cluster. - - In general, it is best to use WAL for Puts, and where loading throughput - is a concern to use bulk loading techniques instead. - -
-
+
+ Concurrency: Monitor Data Spread + When performing a high number of concurrent reads, monitor the data spread of the target tables. If there target table(s) are in + too few regions then the reads will fall on only a few nodes. + See , as well as +
+ +