Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-11829

Improve the vector size of Bloom Filter from int to long, and storage from memory to disk

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Invalid
    • None
    • None
    • util
    • None

    Description

      org.apache.hadoop.util.bloom.BloomFilter(int vectorSize, int nbHash, int hashType)
      This filter almost can insert 900 million objects, when False Positives Probability is 0.0001, and it needs 2.1G RAM.
      In My project, I needs established a filter which capacity is 2 billion, and it needs 4.7G RAM, the vector size is 38340233509, out the range of int, and I does not have so much RAM to do this, so I rebuild a big bloom filter which vector size type is long, and split the bit data to some files on disk, then distribute files to work node, and the performance is very good.
      I think I can contribute this code to Hadoop Common, and a 128-bit Hash function (MurmurHash)

      Attachments

        Activity

          People

            berezovsky Hongbo Xu
            berezovsky Hongbo Xu
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 168h
                168h
                Remaining:
                Remaining Estimate - 168h
                168h
                Logged:
                Time Spent - Not Specified
                Not Specified