Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-3063

BloomMapFile - fail-fast version of MapFile for sparsely populated key space


    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.0
    • Fix Version/s: 0.20.0
    • Component/s: io
    • Labels:
    • Hadoop Flags:
    • Release Note:
      Introduced BloomMapFile subclass of MapFile that creates a Bloom filter from all keys.


      The need for this improvement arose when working with large ancillary MapFile-s (essentially used as external dictionaries). For each invokation of map() / reduce() it was necessary to perform several look-ups in these MapFile-s, and in case of sparsely populated key-space the cost of finding that a key is absent was too high.

      This patch implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.

      Writer.append() operations update a DynamicBloomFilter, which is then serialized when the Writer is closed. This filter is loaded in memory when a Reader is created. Reader.get() operation first checks the filter for the key membership, and if the key is absent it immediately returns null without doing any further IO.


        1. bloommap-v4.patch
          101 kB
          Andrzej Bialecki
        2. bloommap-v3.patch
          96 kB
          Andrzej Bialecki
        3. bloommap-v2.patch
          77 kB
          Andrzej Bialecki
        4. bloommap.patch
          87 kB
          Andrzej Bialecki

          Issue Links



              • Assignee:
                ab Andrzej Bialecki
                ab Andrzej Bialecki
              • Votes:
                0 Vote for this issue
                7 Start watching this issue


                • Created: