Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-3063

BloomMapFile - fail-fast version of MapFile for sparsely populated key space

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.20.0
    • 0.20.0
    • io
    • None
    • Reviewed
    • Introduced BloomMapFile subclass of MapFile that creates a Bloom filter from all keys.

    Description

      The need for this improvement arose when working with large ancillary MapFile-s (essentially used as external dictionaries). For each invokation of map() / reduce() it was necessary to perform several look-ups in these MapFile-s, and in case of sparsely populated key-space the cost of finding that a key is absent was too high.

      This patch implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.

      Writer.append() operations update a DynamicBloomFilter, which is then serialized when the Writer is closed. This filter is loaded in memory when a Reader is created. Reader.get() operation first checks the filter for the key membership, and if the key is absent it immediately returns null without doing any further IO.

      Attachments

        1. bloommap.patch
          87 kB
          Andrzej Bialecki
        2. bloommap-v2.patch
          77 kB
          Andrzej Bialecki
        3. bloommap-v3.patch
          96 kB
          Andrzej Bialecki
        4. bloommap-v4.patch
          101 kB
          Andrzej Bialecki

        Issue Links

          Activity

            People

              ab Andrzej Bialecki
              ab Andrzej Bialecki
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: