diff --git hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/TableMapReduceUtil.java hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/TableMapReduceUtil.java index 03bc4f0..5153474 100644 --- hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/TableMapReduceUtil.java +++ hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/TableMapReduceUtil.java @@ -391,8 +391,7 @@ public class TableMapReduceUtil { job.setMapOutputKeyClass(outputKeyClass); } job.setMapperClass(mapper); - Configuration conf = job.getConfiguration(); - HBaseConfiguration.merge(conf, HBaseConfiguration.create(conf)); + HBaseConfiguration.addHbaseResources(job.getConfiguration()); List scanStrings = new ArrayList(); for (Scan scan : scans) { diff --git src/main/docbkx/performance.xml src/main/docbkx/performance.xml index dad9b0c..04ca00c 100644 --- src/main/docbkx/performance.xml +++ src/main/docbkx/performance.xml @@ -295,18 +295,131 @@
Bloom Filters - Bloom Filters can be enabled per-ColumnFamily. Use - HColumnDescriptor.setBloomFilterType(NONE | ROW | ROWCOL) to enable blooms - per Column Family. Default = NONE for no bloom filters. If - ROW, the hash of the row will be added to the bloom on each insert. If - ROWCOL, the hash of the row + column family name + column family - qualifier will be added to the bloom on each key insert. - See HColumnDescriptor - and for more information or this answer up in quora, A Bloom filter, named for its creator, Burton Howard Bloom, is a data structure which is + designed to predict whether a given element is a member of a set of data. A positive result + from a Bloom filter is not always accurate, but a negative result is guaranteed to be + accurate. Bloom filters are designed to be "accurate enough" for sets of data which are so + large that conventional hashing mechanisms would be impractical. For more information about + Bloom filters in general, refer to . + In terms of HBase, Bloom filters provide a lightweight in-memory structure to reduce the + number of disk reads for a given Get operation (Bloom filters do not work with Scans) to only the StoreFiles likely to + contain the desired Row. The potential performance gain increases with the number of + parallel reads. + The Bloom filters themselves are stored in the metadata of each HFile and never need to + be updated. When an HFile is opened because a region is deployed to a RegionServer, the + Bloom filter is loaded into memory. + HBase includes some tuning mechanisms for folding the Bloom filter to reduce the size + and keep the false positive rate within a desired range. + Bloom filters were introduced in HBASE-1200. Since + HBase 0.96, row-based Bloom filters are enabled by default. (HBASE-) + For more information on Bloom filters in relation to HBase, see for more information, or the following Quora discussion: How are bloom filters used in HBase?. + +
+ When To Use Bloom Filters + Since HBase 0.96, row-based Bloom filters are enabled by default. You may choose to + disable them or to change some tables to use row+column Bloom filters, depending on the + characteristics of your data and how it is loaded into HBase. + + To determine whether Bloom filters could have a positive impact, check the value of + blockCacheHitRatio in the RegionServer metrics. If Bloom filters are enabled, the value of + blockCacheHitRatio should increase, because the Bloom filter is filtering out blocks that + are definitely not needed. + You can choose to enable Bloom filters for a row or for a row+column combination. If + you generally scan entire rows, the row+column combination will not provide any benefit. A + row-based Bloom filter can operate on a row+column Get, but not the other way around. + However, if you have a large number of column-level Puts, such that a row may be present + in every StoreFile, a row-based filter will always return a positive result and provide no + benefit. Unless you have one column per row, row+column Bloom filters require more space, + in order to store more keys. Bloom filters work best when the size of each data entry is + at least a few kilobytes in size. + Overhead will be reduced when your data is stored in a few larger StoreFiles, to avoid + extra disk IO during low-level scans to find a specific row. + Bloom filters need to be rebuilt upon deletion, so may not be appropriate in + environments with a large number of deletions. +
+ +
+ Enabling Bloom Filters + Bloom filters are enabled on a Column Family. You can do this by using the + setBloomFilterType method of HColumnDescriptor or using the HBase API. Valid values are + NONE (the default), ROW, or + ROWCOL. See for more information on ROW versus + ROWCOL. See also the API documentation for HColumnDescriptor. + The following example creates a table and enables a ROWCOL Bloom filter on the + colfam1 column family. + +hbase> create 'mytable',{NAME => 'colfam1', BLOOMFILTER => 'ROWCOL'} + +
+ +
+ Configuring Server-Wide Behavior of Bloom Filters + You can configure the following settings in the hbase-site.xml. + + + + + + Parameter + Default + Description + + + + + io.hfile.bloom.enabled + yes + Set to no to kill bloom filters server-wide if + something goes wrong + + + io.hfile.bloom.error.rate + .01 + The average false positive rate for bloom filters. Folding is used to + maintain the false positive rate. Expressed as a decimal representation of a + percentage. + + + io.hfile.bloom.max.fold + 7 + The guaranteed maximum fold rate. Changing this setting should not be + necessary and is not recommended. + + + io.storefile.bloom.max.keys + 128000000 + For default (single-block) Bloom filters, this specifies the maximum + number of keys. + + + io.storefile.delete.family.bloom.enabled + true + Master switch to enable Delete Family Bloom filters and store them in + the StoreFile. + + + io.storefile.bloom.block.size + 65536 + Target Bloom block size. Bloom filter blocks of approximately this size + are interleaved with data blocks. + + + hfile.block.bloom.cacheonwrite + false + Enables cache-on-write for inline blocks of a compound Bloom filter. + + + + +