Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-122

block numbers need a better random number generator

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Invalid
    • 0.8
    • None
    • fetcher, indexer
    • None

    Description

      In order to support billions of block numbers, a better PRNG than java.util.Random is needed. To reach billions with low probability of collision, 64 bit random numbers are needed (the Birthday Problem is the model for the number of bits needed; the result is that twice as many bits are needed as the number of bits to count the expected number of items.) The built-in java.util.Random keeps only 48 bits of state which is only sufficient for 2^24 items. Using repeated calls to or more than one instance of Random does not increase its total entropy.

      Analysis
      util.Random is a linear congruential generator (LCG) identical to drand48.
      util.Random keeps 48 bits of state and gangs together 2 consecutive values to return 64 bit values.
      LCGs suffer from periodicity in the low order bits which would make modular binning less than random.
      "low order bits" could mean least significant byte.
      LCGs have periods in the range 106 to 109 when using 32 bit words, a range of poor to fair.
      seed = ( 0x5DEECE66DL * seed + 0xBL ) & ((1L << 48) - 1);
      the origin of 0x5DEECE66D, a non-prime, is shrouded in the mists of time.
      Results of the Birthday Spacings Test look good.
      References
      http://www.math.utah.edu/~beebe/java/random/README
      http://www.pierssen.com/arcview/upload/esoterica/randomizer.html

      Recommended alternative: MersenneTwister
      Matsumoto and Nishimura (1998).
      Longest period of any known generator 2^19937 or about 10^6001.
      A period that exceeds the number of unique values seems ideal; obviously a shorter period than the number of unique values (like util.Random) is a problem).
      Faster than java.util.Random (Random was recent tweaked, however).
      Excellent result for Diehard Birthday Spacings Test.
      Can be seeded with up to 624 32 bit integers.

      Doug Cutting wrote on nutch-dev:
      > It just occurred to me that perhaps we could simply use sequential block numbering.
      > All block ids are generated centrally on the namenode.

      Response from Paul Baclace:
      I'm not sure what the advantage of sequential block numbers would be
      since long period PRNG block numbering does not even need to store
      it's state, just pick a new starting place.

      Sequential block numbering does have the downside that picking a datanode based on (BlockNum % DataNodeCount) would devolve into round robin. Any attempt to pass the sequence through a hash ends up becoming a random number generator.

      Sequential numbering provides contiguous numbers, but after G.C. that would be lost, so no advantage there.

      When human beings eyeball block numbers, many with small differences are more likely to be misread than many that are totally different.

      If block numbering is sequential, then there is a temptation to use 32 bits instead of 64, but 32 bits leads to wrap-around and uh oh.

      FSNamesystem uses Random to help pick a target datanode, but it could just use the randomness of block numbers.

      Attachments

        1. MersenneTwister.java
          25 kB
          Paul Baclace
        2. MersenneTwister.java
          25 kB
          Paul Baclace

        Activity

          People

            Unassigned Unassigned
            pbaclace Paul Baclace
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: