Commons Math
  1. Commons Math
  2. MATH-310

Supply nextSample for all distributions with inverse cdf using inverse transform sampling approach

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.0
    • Fix Version/s: 2.2
    • Labels:
      None

      Description

      To be able to generate samples from the supported probability distributions, a generic function nextSample is implemented in AbstractContinuousDistribution and AbstractIntegerDistribution. This also gives the possibility to override the method if better algorithms are available for specific distributions as shown in the small example with the exponential distribution.

      Because the nextExponential is used several places: in nextPoisson it can be replaces by an instance if the ExponentialDistribution and in ValueServer it can as well, although maybe not in as natural maner as the other.

      This problem with the Exponential is a special problem. In general the nextSample-approaches immediately gives the possibility the sample from all the distributions with inverse cdf instead just only a couple.

      Only AbstractContinuousDistribution and AbstractIntegerDistribution extends AbstractDistribution, and both AbstractIntegerDistribution and AbstractContinuousDistribution has an inverseCumulativeProbability-function. But in AbstractContinuousDistribution the inverse cdf returns a double, and at AbstractIntegerDistribution it - naturally - returns an integer. Therefor the nextSample is not put on AbstractDistribution, but on each extension with different return types.

      RandomGenerator as parameter instead of getting a RNG inside the nextSample, because one typically wants to use the same RNG because often several random samples are wanted. Another option is to have a RNG as a field in the class, but that would be more ugly and also result in several RNGs at runtime.

      The nextPoisson etc. ought to be moved as well, if the enhancement is accepted, but it should be a quick fix.

      Tests has to be written for this change as well.

      1. patch_proposal
        5 kB
        Mikkel Meyer Andersen

        Activity

        Hide
        Ted Dunning added a comment -

        Looks nice.

        A comment on the motivation for the 0 test might be nice, but I was able to figure it out with a moment of reflection so maybe it isn't necessary.

        Show
        Ted Dunning added a comment - Looks nice. A comment on the motivation for the 0 test might be nice, but I was able to figure it out with a moment of reflection so maybe it isn't necessary.
        Hide
        Mikkel Meyer Andersen added a comment -

        Yeah, it probably would be a good idea. Especially when it's implicitly assumed that 1 is not possible to get (because of the RandomGeneratorImpl), but it's really not a contract. Maybe include 1 in the test although it's not possible with traditional RNG. And also write that the nextDouble-method in the RandomNumber generator should provide a double between 0 and 1, either in- or exclusive.

        Show
        Mikkel Meyer Andersen added a comment - Yeah, it probably would be a good idea. Especially when it's implicitly assumed that 1 is not possible to get (because of the RandomGeneratorImpl), but it's really not a contract. Maybe include 1 in the test although it's not possible with traditional RNG. And also write that the nextDouble-method in the RandomNumber generator should provide a double between 0 and 1, either in- or exclusive.
        Hide
        Phil Steitz added a comment -

        See discussion here: http://markmail.org/message/kolivuytbt5cj25s

        As stated on the mailing list, I am +0/1 on the idea of adding generic inversion-based generators that work with any invertible distribution; but I still do not see attaching them to the distribution implementations as a good idea. This is for three reasons: 0) I see it as poor separation of concerns (admittedly this is a matter of taste, but I do not see sourcing random deviates as an essential behavior of a probability distribution) 1) if the implementation is only inversion-based, it will be naive for some distributions and we do not want users to get a bad impl by default 2) to fix 1) we have to essentially refactor our package structure to place random data generation into the distributions package, causing users to have to instantiate distributions and also configure generators to get deviates. I see it as simpler and more natural to use a RandomData instance. I am -1 on dropping the random package. Therefore, I am not in favor of attaching this functionality to the distributions.

        I propose that we resolve this issue by

        1) Extend RandomDataImpl to include deviate generation from all currently implemented distributions, using generic inversion as defined below where this is the best available method, or where we have not yet implemented a better one. In 3.0, we can either eliminate the RandomData interface or extend it to include the new methods.

        2) Add a generic nextInversionDeviate(distribution, generator) method similar to that defined in the patch to RandomDataImpl. Doc the fact that for distributions having a nextXxx implementation, this should be used in place of nextInversionDeviate unless inversion is specifically desired.

        Show
        Phil Steitz added a comment - See discussion here: http://markmail.org/message/kolivuytbt5cj25s As stated on the mailing list, I am +0/1 on the idea of adding generic inversion-based generators that work with any invertible distribution; but I still do not see attaching them to the distribution implementations as a good idea. This is for three reasons: 0) I see it as poor separation of concerns (admittedly this is a matter of taste, but I do not see sourcing random deviates as an essential behavior of a probability distribution) 1) if the implementation is only inversion-based, it will be naive for some distributions and we do not want users to get a bad impl by default 2) to fix 1) we have to essentially refactor our package structure to place random data generation into the distributions package, causing users to have to instantiate distributions and also configure generators to get deviates. I see it as simpler and more natural to use a RandomData instance. I am -1 on dropping the random package. Therefore, I am not in favor of attaching this functionality to the distributions. I propose that we resolve this issue by 1) Extend RandomDataImpl to include deviate generation from all currently implemented distributions, using generic inversion as defined below where this is the best available method, or where we have not yet implemented a better one. In 3.0, we can either eliminate the RandomData interface or extend it to include the new methods. 2) Add a generic nextInversionDeviate(distribution, generator) method similar to that defined in the patch to RandomDataImpl. Doc the fact that for distributions having a nextXxx implementation, this should be used in place of nextInversionDeviate unless inversion is specifically desired.
        Hide
        Gilles added a comment - - edited

        Wouldn't something along the following lines satisfy everybody's concerns?

        In each "...Distribution" interface, add a "nextSample()" (with the appropriate return type). In each "...DistributionImpl", add a "RandomData" instance variable to be used within the "nextSample()" implementation. E.g. in "PoissonDistributionImpl.java"

        PoissonDistributionImpl.java
        import org.apache.commons.math.random.RandomData;
        import org.apache.commons.math.random.RandomDataImpl;
        
        public class PoissonDistributionImpl extends ... {
            private final RandomData randomData = new RandomDataImpl();
            // ....
            long nextSample() {
                randomData.nextPoisson();
            }   
        }
        
        Show
        Gilles added a comment - - edited Wouldn't something along the following lines satisfy everybody's concerns? In each "...Distribution" interface, add a "nextSample()" (with the appropriate return type). In each "...DistributionImpl", add a "RandomData" instance variable to be used within the "nextSample()" implementation. E.g. in "PoissonDistributionImpl.java" PoissonDistributionImpl.java import org.apache.commons.math.random.RandomData; import org.apache.commons.math.random.RandomDataImpl; public class PoissonDistributionImpl extends ... { private final RandomData randomData = new RandomDataImpl(); // .... long nextSample() { randomData.nextPoisson(); } }
        Hide
        Phil Steitz added a comment -

        The problem is that to be really useful, this requires that you expose the RandomData instance and allow the RandomGenerator to be configured. This contributes to the aroma of bad separation of concerns, IMO. To make it a little clearer why I see this as bad separation of concerns, consider that you could accomplish the same thing by putting the impl in each of the distributions and then having RandomDataImpl use the impls in the distributions. That seems backwards to me.

        Show
        Phil Steitz added a comment - The problem is that to be really useful, this requires that you expose the RandomData instance and allow the RandomGenerator to be configured. This contributes to the aroma of bad separation of concerns, IMO. To make it a little clearer why I see this as bad separation of concerns, consider that you could accomplish the same thing by putting the impl in each of the distributions and then having RandomDataImpl use the impls in the distributions. That seems backwards to me.
        Hide
        Gilles added a comment -

        From what I've read in the ML thread, people that want to have "nextSample()" don't really care about configuring the RNG, so you don't have to expose it to provide that syntactic sugar.
        Those who would like more flexibility (like configuring the RNG) should use "RandomData" explicitely.
        And the concern that all implementations dealing with sampling stay in package "random" is fulfilled.

        Show
        Gilles added a comment - From what I've read in the ML thread, people that want to have "nextSample()" don't really care about configuring the RNG, so you don't have to expose it to provide that syntactic sugar. Those who would like more flexibility (like configuring the RNG) should use "RandomData" explicitely. And the concern that all implementations dealing with sampling stay in package "random" is fulfilled.
        Hide
        Phil Steitz added a comment -

        Not being able to reseed the PRNG is a pretty serious limitation when generating random data. Having to maintain and document random data generation in two places is also not something that I personally look forward to. Making the distribution classes (in my opinion needlessly) stateful and forcing them all to initialize and drag along a RandomData instance is also not a good idea, IMO.

        Show
        Phil Steitz added a comment - Not being able to reseed the PRNG is a pretty serious limitation when generating random data. Having to maintain and document random data generation in two places is also not something that I personally look forward to. Making the distribution classes (in my opinion needlessly) stateful and forcing them all to initialize and drag along a RandomData instance is also not a good idea, IMO.
        Hide
        Gilles added a comment - - edited

        > Not being able to reseed the PRNG is a pretty serious limitation when generating random data.

        As I said, for those who need rich features, they can use them with your preferred option, namely through directly instantiating an object of type "RandomData".

        > Having to maintain and document random data generation in two places [...]

        This just reduces to e.g.

        PoissonDistributionImpl.java
        public class PoissonDistributionImpl extends ... {
            /**
              * Provide basic sampling functionality. This method directly calls the
              * the {@link RandomData#nextPoisson() nextPoisson} method in {@link RandomData}.
              * For more control, please use that class directly.
              * @see {@link RandomDataImpl}.
              */
            long nextSample() { randomData.nextPoisson(); }
        }
        

        > Making the distribution classes [...] stateful [...]

        This is really an "implementation detail" (as it is completely hidden), that is, if you accept that users cannot access the RNG machinery from this side.

        Show
        Gilles added a comment - - edited > Not being able to reseed the PRNG is a pretty serious limitation when generating random data. As I said, for those who need rich features, they can use them with your preferred option, namely through directly instantiating an object of type "RandomData". > Having to maintain and document random data generation in two places [...] This just reduces to e.g. PoissonDistributionImpl.java public class PoissonDistributionImpl extends ... { /** * Provide basic sampling functionality. This method directly calls the * the {@link RandomData#nextPoisson() nextPoisson} method in {@link RandomData}. * For more control, please use that class directly. * @see {@link RandomDataImpl}. */ long nextSample() { randomData.nextPoisson(); } } > Making the distribution classes [...] stateful [...] This is really an "implementation detail" (as it is completely hidden), that is, if you accept that users cannot access the RNG machinery from this side.
        Hide
        Phil Steitz added a comment -

        I guess I am OK with this, as long as RandomDataImpl retains its strategy of lazy initialization of SecureRandom and Random instances - so what is "dragged along" is not a big memory footprint or initialization overhead. This creates a dependency shared by all distributions, but if we are willing to manage it, then OK.

        Show
        Phil Steitz added a comment - I guess I am OK with this, as long as RandomDataImpl retains its strategy of lazy initialization of SecureRandom and Random instances - so what is "dragged along" is not a big memory footprint or initialization overhead. This creates a dependency shared by all distributions, but if we are willing to manage it, then OK.
        Hide
        Gilles added a comment -

        Shall we implement this feature in 2.1?
        As it is an addition, there isn't a high risk that it'll break existing code. [The only potential problem is if someone created his own implementation of one of the "...Distribution" interfaces. How likely is that?]

        Show
        Gilles added a comment - Shall we implement this feature in 2.1? As it is an addition, there isn't a high risk that it'll break existing code. [The only potential problem is if someone created his own implementation of one of the "...Distribution" interfaces. How likely is that?]
        Hide
        Phil Steitz added a comment -

        Fix completed in r949895.

        Show
        Phil Steitz added a comment - Fix completed in r949895.
        Hide
        Luc Maisonobe added a comment -

        Closing issue as it was included in version 2.2, which has been released

        Show
        Luc Maisonobe added a comment - Closing issue as it was included in version 2.2, which has been released

          People

          • Assignee:
            Unassigned
            Reporter:
            Mikkel Meyer Andersen
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 3h
              3h
              Remaining:
              Remaining Estimate - 3h
              3h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development