Hadoop Common
  1. Hadoop Common
  2. HADOOP-3063

BloomMapFile - fail-fast version of MapFile for sparsely populated key space


    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.0
    • Fix Version/s: 0.20.0
    • Component/s: io
    • Labels:
    • Hadoop Flags:
    • Release Note:
      Introduced BloomMapFile subclass of MapFile that creates a Bloom filter from all keys.


      The need for this improvement arose when working with large ancillary MapFile-s (essentially used as external dictionaries). For each invokation of map() / reduce() it was necessary to perform several look-ups in these MapFile-s, and in case of sparsely populated key-space the cost of finding that a key is absent was too high.

      This patch implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.

      Writer.append() operations update a DynamicBloomFilter, which is then serialized when the Writer is closed. This filter is loaded in memory when a Reader is created. Reader.get() operation first checks the filter for the key membership, and if the key is absent it immediately returns null without doing any further IO.

      1. bloommap-v4.patch
        101 kB
        Andrzej Bialecki
      2. bloommap-v3.patch
        96 kB
        Andrzej Bialecki
      3. bloommap-v2.patch
        77 kB
        Andrzej Bialecki
      4. bloommap.patch
        87 kB
        Andrzej Bialecki

        Issue Links


          No work has yet been logged on this issue.


            • Assignee:
              Andrzej Bialecki
              Andrzej Bialecki
            • Votes:
              0 Vote for this issue
              7 Start watching this issue


              • Created: