Index: src/docbkx/performance.xml =================================================================== --- src/docbkx/performance.xml (revision 1156398) +++ src/docbkx/performance.xml (working copy) @@ -336,4 +336,25 @@ + +
+ Deleting from HBase +
+ Using HBase Tables as Queues + HBase tables are sometimes used as queues. In this case, special care must be taken to regularly perform major compactions on tables used in + this manner. As is documented in , marking rows as deleted creates additional StoreFiles which then need to be processed + on reads. Tombstones only get cleaned up with major compactions. + + See also and HBaseAdmin.majorCompact. + +
+
+ Delete RPC Behavior + Be aware that htable.delete(Delete) doesn't use the writeBuffer. It will execute an RegionServer RPC with each invocation. + For a large number of deletes, consider htable.delete(List). + + See + +
+
Index: src/docbkx/book.xml =================================================================== --- src/docbkx/book.xml (revision 1158223) +++ src/docbkx/book.xml (working copy) @@ -108,10 +108,10 @@ job // job instance ); ...and the mapper instance would extend TableMapper... - public class MyMapper extends TableMapper<Text, LongWritable> { -public void map(ImmutableBytesWritable row, Result value, Context context) -throws InterruptedException, IOException { -// process data for the row from the Result instance. + +public class MyMapper extends TableMapper<Text, LongWritable> { + public void map(ImmutableBytesWritable row, Result value, Context context) throws InterruptedException, IOException { + // process data for the row from the Result instance.
@@ -211,7 +211,7 @@
Try to minimize row and column sizes - Or why are my storefile indices large? + Or why are my StoreFile indices large? In HBase, values are always freighted with their coordinates; as a cell value passes through the system, it'll be accompanied by its row, column name, and timestamp - always. If your rows and column names @@ -230,9 +230,25 @@ Compression will also make for larger indices. See the thread a question storefileIndexSize up on the user mailing list. - ` - In summary, although verbose attribute names (e.g., "myImportantAttribute") are easier to read, you pay for the clarity in storage and increased I/O - use shorter attribute names and constants. - Also, try to keep the row-keys as small as possible too. + + Most frequently small inefficiencies don't matter all that much. Unfortunately, + this is a case where it does. Whatever patterns are selected for ColumnFamilies, attributes, and rowkeys they could be repeated + several billion times in your data +
Column Families + Try to keep the ColumnFamily names as small as possible, preferably one character (e.g. "d" for data/default). + +
+
Attributes + Although verbose attribute names (e.g., "myVeryImportantAttribute") are easier to read, prefer shorter attribute names (e.g., "via") + to store in HBase. + +
+
Row Key + Keep them as short as is reasonable such that they can still be useful for required data access (e.g., Get vs. Scan). + A short key that is useless for data access is not better than a longer key with better get/scan properties. Expect tradeoffs + when designing rowkeys. + +
@@ -289,6 +305,14 @@ <para>See <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html">HColumnDescriptor</link> for more information. </para> </section> + <section xml:id="ttl"> + <title>Time To Live (TTL) + ColumnFamilies can set a TTL length in seconds, and HBase will automatically delete rows once the expiration time is reached. + This applies to all versions of a row - even the current one. The TTL time encoded in the HBase for the row is specified in UTC. + + See HColumnDescriptor for more information. + +
Secondary Indexes and Alternate Query Paths