Mahout
  1. Mahout
  2. MAHOUT-687

Random generator objects- slight refactor

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.5
    • Fix Version/s: 0.6
    • Component/s: None
    • Labels:

      Description

      Problems:

      • The uncommons RepeatableRNG classes are the basis of RandomUtils.
        • These classes cheerfully ignore setSeed.
      • Some people in the project want to move off Uncommons anyway.

      This patch uses the org.apache.commons.math.random.RandomGenerator classes instead of org.apache.uncommons.maths.RepeatableRNG classes.
      .

      1. MAHOUT-687.patch
        6 kB
        Lance Norskog
      2. MAHOUT-687.patch
        11 kB
        Lance Norskog

        Activity

        Hide
        Sean Owen added a comment -

        Let's tease apart several things going on here.

        If you want setSeed() to work on MersenneTwisterRNG, that's easy with a different one-line change that makes a new generator. It's not necessarily necessary to remove the implementation.

        Removing Uncommons Maths is not necessarily a goal, but I'd support it. But more than just MersenneTwisterRNG is used in the project, so removing it won't let you remove Uncommons Math. So this patch fails to compile. (Side note, I would only post a patch if it still makes the project compile and pass tests.)

        But then this patch does a bit more. It's replacing seeding based on /dev/urandom or SecureRandom with a simple increasing counter. What is the reasoning behind that?

        Show
        Sean Owen added a comment - Let's tease apart several things going on here. If you want setSeed() to work on MersenneTwisterRNG, that's easy with a different one-line change that makes a new generator. It's not necessarily necessary to remove the implementation. Removing Uncommons Maths is not necessarily a goal, but I'd support it. But more than just MersenneTwisterRNG is used in the project, so removing it won't let you remove Uncommons Math. So this patch fails to compile. (Side note, I would only post a patch if it still makes the project compile and pass tests.) But then this patch does a bit more. It's replacing seeding based on /dev/urandom or SecureRandom with a simple increasing counter. What is the reasoning behind that?
        Hide
        Lance Norskog added a comment -

        If you want setSeed() to work on MersenneTwisterRNG, that's easy with a different one-line change that makes a new generator.

        A deterministic random matrix or vector needs to set the seed for each multiply. This fix would create too much garbage. (Each MersenneTwister has 2500 bytes!) Once you say you need Commons MersenneTwister instead, because it has a setSeed(long), the rest of the patch ticks over.

        Removing Uncommons Maths is not necessarily a goal, but I'd support it.

        Other chatter on the list talked about pushing uncommons out completely. One step at a time.

        It's replacing seeding based on /dev/urandom or SecureRandom with a simple increasing counter.

        Oops- thought I changed that back.
        .
        The patch is clearly not finished. If a test fails because it relies on a deterministic result, that's easy to fix. If a test fails otherwise, probably the test does not supply enough data points for the algorithm to function. From a quick look, LogLikelihoodTest may have this problem.

        Show
        Lance Norskog added a comment - If you want setSeed() to work on MersenneTwisterRNG, that's easy with a different one-line change that makes a new generator. A deterministic random matrix or vector needs to set the seed for each multiply. This fix would create too much garbage. (Each MersenneTwister has 2500 bytes!) Once you say you need Commons MersenneTwister instead, because it has a setSeed(long), the rest of the patch ticks over. Removing Uncommons Maths is not necessarily a goal, but I'd support it. Other chatter on the list talked about pushing uncommons out completely. One step at a time. It's replacing seeding based on /dev/urandom or SecureRandom with a simple increasing counter. Oops- thought I changed that back. . The patch is clearly not finished. If a test fails because it relies on a deterministic result, that's easy to fix. If a test fails otherwise, probably the test does not supply enough data points for the algorithm to function. From a quick look, LogLikelihoodTest may have this problem.
        Hide
        Ted Dunning added a comment -

        A deterministic random matrix or vector needs to set the seed for each multiply. This fix would create too much garbage. (Each MersenneTwister has 2500 bytes!) Once you say you need Commons MersenneTwister instead, because it has a setSeed(long), the rest of the patch ticks over.

        MersenneTwister is unacceptable for this usage anyway. It takes far too much startup time. The commons implementation just uses the long with a weak generator to build the long seed so there isn't a difference in garbage created. Besides, this is totally ephemeral garbage that won't even survive out of newspace.

        A good implementation option is Murmurhash applied to row and column and salt.

        Show
        Ted Dunning added a comment - A deterministic random matrix or vector needs to set the seed for each multiply. This fix would create too much garbage. (Each MersenneTwister has 2500 bytes!) Once you say you need Commons MersenneTwister instead, because it has a setSeed(long), the rest of the patch ticks over. MersenneTwister is unacceptable for this usage anyway. It takes far too much startup time. The commons implementation just uses the long with a weak generator to build the long seed so there isn't a difference in garbage created. Besides, this is totally ephemeral garbage that won't even survive out of newspace. A good implementation option is Murmurhash applied to row and column and salt.
        Hide
        Lance Norskog added a comment -

        Now we're getting somewhere!

        Why should the default creator of a random number generator include a system call to the random seed operating system device driver?

        Show
        Lance Norskog added a comment - Now we're getting somewhere! Why should the default creator of a random number generator include a system call to the random seed operating system device driver?
        Hide
        Ted Dunning added a comment -

        Why should the default creator of a random number generator include a system call to the random seed operating system device driver?

        Because it provides good seeds from real physical processes?

        Show
        Ted Dunning added a comment - Why should the default creator of a random number generator include a system call to the random seed operating system device driver? Because it provides good seeds from real physical processes?
        Hide
        Lance Norskog added a comment -

        By popular demand, a less ambitious version. Four topics:

        • SamplingLongPrimitiveIterator used one common Random which cause the unit test "deterministic seed" trick fail for some unknown reason. Changed class to use one Random per instance.
        • RandomUtils had a private WeakHashMap out of different RandomWrappers that it built. It never used the map.

        RandomWrapper does exactly what it used to do, with one change and one addition:

        • setSeed(long x) now throws UnsupportedOperationException, because all of the
          Uncommons RNG classes ignore setSeed(long x). Yes. All of them.
        • You can now pull the seed that RandomWrapper makes so that you can make another Uncommons RNG object, or use it for any purpose.
        Show
        Lance Norskog added a comment - By popular demand, a less ambitious version. Four topics: SamplingLongPrimitiveIterator used one common Random which cause the unit test "deterministic seed" trick fail for some unknown reason. Changed class to use one Random per instance. RandomUtils had a private WeakHashMap out of different RandomWrappers that it built. It never used the map. RandomWrapper does exactly what it used to do, with one change and one addition: setSeed(long x) now throws UnsupportedOperationException, because all of the Uncommons RNG classes ignore setSeed(long x). Yes. All of them. You can now pull the seed that RandomWrapper makes so that you can make another Uncommons RNG object, or use it for any purpose.
        Hide
        Lance Norskog added a comment -

        Why throw an exception instead of helping the user by making the problem go away? Because code should not "help". If there's a problem, just tell me. Only I know how I want to handle the error. Don't "help" me- it just wastes my time.

        Fail Loud
        Fail Early

        Show
        Lance Norskog added a comment - Why throw an exception instead of helping the user by making the problem go away? Because code should not "help". If there's a problem, just tell me. Only I know how I want to handle the error. Don't "help" me- it just wastes my time. Fail Loud Fail Early
        Hide
        Ted Dunning added a comment -

        SamplingLongPrimitiveIterator used one common Random which cause the unit test "deterministic seed" trick fail for some unknown reason. Changed class to use one Random per instance.

        Random number generators should almost never be shared between objects for many reasons. The two I think are most important that it makes the objects inherently thread unsafe and it makes the threads sharing the RNG's very slow due to memory synchronization. Some RNG's aren't even thread safe themselves so sharing is a complete disaster as opposed to just bad.

        Show
        Ted Dunning added a comment - SamplingLongPrimitiveIterator used one common Random which cause the unit test "deterministic seed" trick fail for some unknown reason. Changed class to use one Random per instance. Random number generators should almost never be shared between objects for many reasons. The two I think are most important that it makes the objects inherently thread unsafe and it makes the threads sharing the RNG's very slow due to memory synchronization. Some RNG's aren't even thread safe themselves so sharing is a complete disaster as opposed to just bad.
        Hide
        Sean Owen added a comment -

        RandomWrapper most certainly does use the map. Look at the use of INSTANCES just below it in useTestSeed() which shows its purpose. It needs to reset all of the wrappers that have been made. setSeed() may be ignored, and like I've said there's an easy fix to that – new RNG object. I don't see a need to make the setSeed() method fail.

        I don't have a strong opinion about sharing RNGs per se – obviously correctness is vital, followed by performance, as concerns. So for example I think it's reasonable to use an RNG per iterator, yes.

        But at the moment I don't know that we have a determinism problem as a result of this?

        This patch does something different than the first patch, so I think I'm losing the plot about what this is out to accomplish?

        Show
        Sean Owen added a comment - RandomWrapper most certainly does use the map. Look at the use of INSTANCES just below it in useTestSeed() which shows its purpose. It needs to reset all of the wrappers that have been made. setSeed() may be ignored, and like I've said there's an easy fix to that – new RNG object. I don't see a need to make the setSeed() method fail. I don't have a strong opinion about sharing RNGs per se – obviously correctness is vital, followed by performance, as concerns. So for example I think it's reasonable to use an RNG per iterator, yes. But at the moment I don't know that we have a determinism problem as a result of this? This patch does something different than the first patch, so I think I'm losing the plot about what this is out to accomplish?
        Hide
        Ted Dunning added a comment -

        Is this still useful? I am like Sean and have lost the thrust of the patch.

        Show
        Ted Dunning added a comment - Is this still useful? I am like Sean and have lost the thrust of the patch.
        Hide
        Sean Owen added a comment -

        OK, that is still not what the patch does though.
        I think there are two good ideas in the mix here that can be committed without controversy. First, work around setSeed() behavior by instantiating a new RNG when called. Second, don't use a shared RNG in the sampling Iterator. I suggest this is the substance of what to commit in this thread.

        Show
        Sean Owen added a comment - OK, that is still not what the patch does though. I think there are two good ideas in the mix here that can be committed without controversy. First, work around setSeed() behavior by instantiating a new RNG when called. Second, don't use a shared RNG in the sampling Iterator. I suggest this is the substance of what to commit in this thread.
        Hide
        Lance Norskog added a comment -

        Is this still useful? I am like Sean and have lost the thrust of the patch.

        I'm still working on it.

        Ted, you mentioned wanting a MurmurHash Random class. Is this what you envisioned? (It is not finished code; see below).

        public class MurmurHashRandom extends Random {
          private long murmurSeed;
          private final ByteBuffer buf;
          
          public MurmurHashRandom() {
            this(0);
          }
        
          public MurmurHashRandom(int seed) {
            SeedGenerator gen = new FastRandomSeedGenerator();
            byte[] bits = RandomUtils.longSeedtoBytes(gen.generateSeed());
            buf = ByteBuffer.wrap(bits);
            this.murmurSeed = MurmurHash.hash64A(bits, seed);
          }
          
          @Override
          public long nextLong() {
            long oldSeed = murmurSeed;
            murmurSeed = MurmurHash.hash64A(buf, (int) murmurSeed);
            return oldSeed;
          }
        
        

        It is coded against my patch, so is only here for study purposes.It's coded against the MurmurHash class in encoders.encoders.MurmurHash works in ints, not longs, so types are a bit confused in this demo code.

        Show
        Lance Norskog added a comment - Is this still useful? I am like Sean and have lost the thrust of the patch. I'm still working on it. Ted, you mentioned wanting a MurmurHash Random class. Is this what you envisioned? (It is not finished code; see below). public class MurmurHashRandom extends Random { private long murmurSeed; private final ByteBuffer buf; public MurmurHashRandom() { this (0); } public MurmurHashRandom( int seed) { SeedGenerator gen = new FastRandomSeedGenerator(); byte [] bits = RandomUtils.longSeedtoBytes(gen.generateSeed()); buf = ByteBuffer.wrap(bits); this .murmurSeed = MurmurHash.hash64A(bits, seed); } @Override public long nextLong() { long oldSeed = murmurSeed; murmurSeed = MurmurHash.hash64A(buf, ( int ) murmurSeed); return oldSeed; } It is coded against my patch, so is only here for study purposes.It's coded against the MurmurHash class in encoders.encoders.MurmurHash works in ints, not longs, so types are a bit confused in this demo code.
        Hide
        Lance Norskog added a comment -

        The Sampling iterator never drops the first sample.
        A minor nit: Iterator.hasNext() is not supposed to change anything, but this makes it hard to "keep or drop" the final sample.

        So, if you want to have a per-object random, that's great. It would be good to fix that everywhere as a sweep, but sweeps kill patches. Oh well.

        Show
        Lance Norskog added a comment - The Sampling iterator never drops the first sample. A minor nit: Iterator.hasNext() is not supposed to change anything, but this makes it hard to "keep or drop" the final sample. So, if you want to have a per-object random, that's great. It would be good to fix that everywhere as a sweep, but sweeps kill patches. Oh well.
        Hide
        Sean Owen added a comment -

        I don't know what you mean – it most definitely can drop the first element.
        What hasNext() method are you referring to?
        Yes, it's easy to make a similar change elsewhere to remove static Random instances.

        Show
        Sean Owen added a comment - I don't know what you mean – it most definitely can drop the first element. What hasNext() method are you referring to? Yes, it's easy to make a similar change elsewhere to remove static Random instances.
        Hide
        Sean Owen added a comment -

        I'm assigning to me to commit, after 0.5, changes for two aspects of this discussion:

        • RandomWrapper.setSeed() will now work, by instantiating a new RNG
        • static Random variables will be made into instance variables where possible
        Show
        Sean Owen added a comment - I'm assigning to me to commit, after 0.5, changes for two aspects of this discussion: RandomWrapper.setSeed() will now work, by instantiating a new RNG static Random variables will be made into instance variables where possible
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #848 (See https://builds.apache.org/hudson/job/Mahout-Quality/848/)

        Show
        Hudson added a comment - Integrated in Mahout-Quality #848 (See https://builds.apache.org/hudson/job/Mahout-Quality/848/ )
        Hide
        Sean Owen added a comment -

        Committed the part I referred to in comments

        Show
        Sean Owen added a comment - Committed the part I referred to in comments

          People

          • Assignee:
            Sean Owen
            Reporter:
            Lance Norskog
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Due:
              Created:
              Updated:
              Resolved:

              Development