Uploaded image for project: 'Commons Math'
  1. Commons Math
  2. MATH-1153

Sampling from a 'BetaDistribution' is slow

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0, 3.6
    • Labels:
      None

      Description

      Currently the `BetaDistribution#sample` uses inverse CDF method, which is quite slow for sampling-intensive computations. I've implemented a method from the R. C. H. Cheng paper and it seems to work much better. Here's a simple microbenchmark:

      o.j.b.s.SamplingBenchmark.algorithmBCorBB       1e-3    1000  thrpt        5  2592200.015    14391.520  ops/s
      o.j.b.s.SamplingBenchmark.algorithmBCorBB       1000    1000  thrpt        5  3210800.292    33330.791  ops/s
      o.j.b.s.SamplingBenchmark.commonsVersion        1e-3    1000  thrpt        5    31034.225      438.273  ops/s
      o.j.b.s.SamplingBenchmark.commonsVersion        1000    1000  thrpt        5    21834.010      433.324  ops/s
      

      Should I submit a patch?

      R. C. H. Cheng (1978). Generating beta variates with nonintegral shape parameters. Communications of the ACM, 21, 317–322.

      1. ChengBetaSampler.java
        3 kB
        Sergei Lebedev
      2. ChengBetaSampler.java
        3 kB
        Sergei Lebedev
      3. ChengBetaSamplerTest.java
        3 kB
        Sergei Lebedev

        Activity

        Hide
        tn Thomas Neidhart added a comment -

        Hi Sergei,

        yes, please attach the patch so we can evaluate it, the performance improvements are really significant.

        Show
        tn Thomas Neidhart added a comment - Hi Sergei, yes, please attach the patch so we can evaluate it, the performance improvements are really significant.
        Hide
        psteitz Phil Steitz added a comment -

        Definitely looks like a big performance improvement. Lets make sure to run G- or ChiSquare tests to verify results. Is the paper available online somewhere? Thanks for the patch!

        Show
        psteitz Phil Steitz added a comment - Definitely looks like a big performance improvement. Lets make sure to run G- or ChiSquare tests to verify results. Is the paper available online somewhere? Thanks for the patch!
        Hide
        lebedev Sergei Lebedev added a comment -

        Is it alright if I submit a patch via a PR to the GitHub mirror?

        Show
        lebedev Sergei Lebedev added a comment - Is it alright if I submit a patch via a PR to the GitHub mirror ?
        Hide
        luc Luc Maisonobe added a comment -

        There is a problem with the GitHub mirror currently, as it does not follow our new Git repository, which is http://git-wip-us.apache.org/repos/asf/commons-math.git

        Show
        luc Luc Maisonobe added a comment - There is a problem with the GitHub mirror currently, as it does not follow our new Git repository, which is http://git-wip-us.apache.org/repos/asf/commons-math.git
        Hide
        lebedev Sergei Lebedev added a comment -

        > Is the paper available online somewhere?
        I've downloaded the paper via SciHub.

        > Lets make sure to run G- or ChiSquare tests to verify results.
        I've done G- and KS- tests at level 0.01 on the sample of size 1000. Both fail to reject the null, which either means that I was lucky or the method seems to work more often then it doesn't

        Attached are the sources of ChengBetaSampler and ChengBetaSamplerTest. Looking forward for your comments and suggestions.

        Show
        lebedev Sergei Lebedev added a comment - > Is the paper available online somewhere? I've downloaded the paper via SciHub . > Lets make sure to run G- or ChiSquare tests to verify results. I've done G- and KS- tests at level 0.01 on the sample of size 1000. Both fail to reject the null, which either means that I was lucky or the method seems to work more often then it doesn't Attached are the sources of ChengBetaSampler and ChengBetaSamplerTest . Looking forward for your comments and suggestions.
        Hide
        luc Luc Maisonobe added a comment -

        I have reworked the patch in order to integrate it directly into the BetaDistribution class.
        This involved a few variable renames, and I am clearly not sure I did it properly.

        It seems the goodness of fit test failed in many cases. I finally selected some seeds for which it succeeded, just to make some progress here. This
        is clearly not satisfaying. However, the patch has a side effect of making most random generator tests fail, as they share a common test based on beta distribution sampling. They all inherit this test from RandomDataGeneratorTest, the test is testNextInversionDeviate.

        So my current state of mind is that I broke something while updating the patch, but I do not have the necessary skills to analyze it and even less to fix it.

        Could someone look at my attempts. They are available on a MATH-1153 branch in the git repository.

        In the meantime, I propose to postpone this issue after 3.4.

        Show
        luc Luc Maisonobe added a comment - I have reworked the patch in order to integrate it directly into the BetaDistribution class. This involved a few variable renames, and I am clearly not sure I did it properly. It seems the goodness of fit test failed in many cases. I finally selected some seeds for which it succeeded, just to make some progress here. This is clearly not satisfaying. However, the patch has a side effect of making most random generator tests fail, as they share a common test based on beta distribution sampling. They all inherit this test from RandomDataGeneratorTest, the test is testNextInversionDeviate. So my current state of mind is that I broke something while updating the patch, but I do not have the necessary skills to analyze it and even less to fix it. Could someone look at my attempts. They are available on a MATH-1153 branch in the git repository. In the meantime, I propose to postpone this issue after 3.4.
        Hide
        tn Thomas Neidhart added a comment -

        The same failures appear when using the original patch, so there was no error during the integration / renaming of variables.

        It looks like the test failure (RandomDataGeneratorTest.testNextInversionDeviate) is due to changes how the random generator is used during sampling. Thus the assumptions made in the test are not correct anymore.

        Show
        tn Thomas Neidhart added a comment - The same failures appear when using the original patch, so there was no error during the integration / renaming of variables. It looks like the test failure (RandomDataGeneratorTest.testNextInversionDeviate) is due to changes how the random generator is used during sampling. Thus the assumptions made in the test are not correct anymore.
        Hide
        tn Thomas Neidhart added a comment -

        btw. if the two method algorithmBB and algorithmBC are made static with an additional parameter for the random generator to be used, you could avoid all the variable renamings and the code would be cleaner imho.

        Show
        tn Thomas Neidhart added a comment - btw. if the two method algorithmBB and algorithmBC are made static with an additional parameter for the random generator to be used, you could avoid all the variable renamings and the code would be cleaner imho.
        Hide
        psteitz Phil Steitz added a comment - - edited

        The RandomDataGenerator test failure is due to the fact that after the patch, the Beta distribution sample() method no longer uses the default inversion-based method provided by AbstractRealDistribution. Either the test should be dropped, or it should be modified to use a distribution that uses inversion-based sampling, which to prevent this happening again should be a concrete dist class defined just for this test. I think the test should be dropped. It was introduced before sampling moved to the distributions and is really a test of inversion-based sampling. If we really want to keep a test of the inversion algorithm, it should probably move to AbstractRealDistributionTest. I would be +1 for just dropping it.

        The homogeneity test failures are more problematic, as the need to be careful with the PRNG seeds may indicate that the generated values do not follow the target distribution. We should investigate using different parameter values and sample sizes and also add some Chisquare tests using the TestUtils method with a larger number of bins than what the superclass test uses. If these fail, we might be able to see systematic bias in the reported bin failures.

        Show
        psteitz Phil Steitz added a comment - - edited The RandomDataGenerator test failure is due to the fact that after the patch, the Beta distribution sample() method no longer uses the default inversion-based method provided by AbstractRealDistribution. Either the test should be dropped, or it should be modified to use a distribution that uses inversion-based sampling, which to prevent this happening again should be a concrete dist class defined just for this test. I think the test should be dropped. It was introduced before sampling moved to the distributions and is really a test of inversion-based sampling. If we really want to keep a test of the inversion algorithm, it should probably move to AbstractRealDistributionTest. I would be +1 for just dropping it. The homogeneity test failures are more problematic, as the need to be careful with the PRNG seeds may indicate that the generated values do not follow the target distribution. We should investigate using different parameter values and sample sizes and also add some Chisquare tests using the TestUtils method with a larger number of bins than what the superclass test uses. If these fail, we might be able to see systematic bias in the reported bin failures.
        Hide
        luc Luc Maisonobe added a comment -

        In fact, I started incorporating the patch as an internal static class with static methods, and passed the generator, alpha and beta to these methods
        I thought it would appear overengineered, but it seems to be a good approach after all. So I'll put it back this way so it is closer to the original patch.

        Show
        luc Luc Maisonobe added a comment - In fact, I started incorporating the patch as an internal static class with static methods, and passed the generator, alpha and beta to these methods I thought it would appear overengineered, but it seems to be a good approach after all. So I'll put it back this way so it is closer to the original patch.
        Hide
        lebedev Sergei Lebedev added a comment - - edited

        I think KS-test failures might be related to the incorrect P-value calculation in the presence of ties (reported in MATH-1197). Ties are common for Beta distribution with extreme parameter values, e. g.

        > sum(rbeta(1024, 100, 0.01) == 1)
        [1] 738
        

        I've attached a minor improvement to the original ChengBetaSampler which uses logs where appropriate.

        Show
        lebedev Sergei Lebedev added a comment - - edited I think KS-test failures might be related to the incorrect P-value calculation in the presence of ties (reported in MATH-1197 ). Ties are common for Beta distribution with extreme parameter values, e. g. > sum(rbeta(1024, 100, 0.01) == 1) [1] 738 I've attached a minor improvement to the original ChengBetaSampler which uses logs where appropriate.
        Hide
        tn Thomas Neidhart added a comment - - edited

        After fixing the KS inference tests the respective test failures disappeared as expected.

        The remaining test failure in testNextInversionDeviate is because the Cheng sampler uses a kind of rejection sampling method and will consume more randomness from the provided RandomGenerator.

        This is a recurring issue, as also for other distributions there are improved sampling methods that consume more randomness (see MATH-1220 for the Zipf distribution).

        This also relates to MATH-1158 as it proposes a different way to create a sampler for a distribution. This would probably also allow to provide different samplers using a common interface, e.g. the default one uses the inverse transform method while more optimized ones could be available which require different assumptions, e.g. wrt the RandomGenerator.

        Show
        tn Thomas Neidhart added a comment - - edited After fixing the KS inference tests the respective test failures disappeared as expected. The remaining test failure in testNextInversionDeviate is because the Cheng sampler uses a kind of rejection sampling method and will consume more randomness from the provided RandomGenerator. This is a recurring issue, as also for other distributions there are improved sampling methods that consume more randomness (see MATH-1220 for the Zipf distribution). This also relates to MATH-1158 as it proposes a different way to create a sampler for a distribution. This would probably also allow to provide different samplers using a common interface, e.g. the default one uses the inverse transform method while more optimized ones could be available which require different assumptions, e.g. wrt the RandomGenerator.
        Hide
        psteitz Phil Steitz added a comment -

        See my comment above on the testNextInversionDeviate failure. I remain +1 for just dropping the test, as it is just a test of the default inversion-based sampling. There is no reason to expect that it will succeed for non-inversion based samplers. If there are no objections, I will drop it. If we really want to retain it, we should create a dummy distribution that will always use inversion-based sampling and replace the Beta instance with that (or come up with a better test somehow).

        I am not sure I understand your comments, Thomas, about "consuming more randomness" for other distributions. What tests, exactly would fail and why?

        The failing test is from the RandomDataGenerator and it is designed just to test the default inversion-based sampler that distributions that do not supply a custom sampler use. It was implemented (stupidly, in retrospect) using a Beta distribution when at the time that distribution did not override the default. Given the current code structure (with samplers moved to the dist package) it would probably also be better to move this test to AbstractRealDistributionTest if we decide to rectify and retain it.

        Show
        psteitz Phil Steitz added a comment - See my comment above on the testNextInversionDeviate failure. I remain +1 for just dropping the test, as it is just a test of the default inversion-based sampling. There is no reason to expect that it will succeed for non-inversion based samplers. If there are no objections, I will drop it. If we really want to retain it, we should create a dummy distribution that will always use inversion-based sampling and replace the Beta instance with that (or come up with a better test somehow). I am not sure I understand your comments, Thomas, about "consuming more randomness" for other distributions. What tests, exactly would fail and why? The failing test is from the RandomDataGenerator and it is designed just to test the default inversion-based sampler that distributions that do not supply a custom sampler use. It was implemented (stupidly, in retrospect) using a Beta distribution when at the time that distribution did not override the default. Given the current code structure (with samplers moved to the dist package) it would probably also be better to move this test to AbstractRealDistributionTest if we decide to rectify and retain it.
        Hide
        tn Thomas Neidhart added a comment -

        I was probably not very clear in my comment, I just wanted to say that different sampling methods use > 1 random sample per call to RealDistribution.sample(), while the test assumes that it is = 1.

        I am also fine with removing the test.

        Show
        tn Thomas Neidhart added a comment - I was probably not very clear in my comment, I just wanted to say that different sampling methods use > 1 random sample per call to RealDistribution.sample(), while the test assumes that it is = 1. I am also fine with removing the test.
        Hide
        tn Thomas Neidhart added a comment -

        Looking at the wikipedia page I found another way to sample from a beta distribution:

          sample X from a gamma distribution with parameters (alpha, 1)
          sample Y from a gamma distribution with parameters (beta, 1)
          return X / (X + Y)
        

        I quickly tested this method and it is even faster than the Cheng method, succeeding the tests.
        The same method is also implemented in mahout (see https://github.com/apache/mahout/blob/master/mr/src/main/java/org/apache/mahout/clustering/UncommonDistributions.java)

        Show
        tn Thomas Neidhart added a comment - Looking at the wikipedia page I found another way to sample from a beta distribution: sample X from a gamma distribution with parameters (alpha, 1) sample Y from a gamma distribution with parameters (beta, 1) return X / (X + Y) I quickly tested this method and it is even faster than the Cheng method, succeeding the tests. The same method is also implemented in mahout (see https://github.com/apache/mahout/blob/master/mr/src/main/java/org/apache/mahout/clustering/UncommonDistributions.java )
        Hide
        lebedev Sergei Lebedev added a comment -

        Interesting, can you provide benchmark results for different values of alpha and beta?

        Show
        lebedev Sergei Lebedev added a comment - Interesting, can you provide benchmark results for different values of alpha and beta?
        Hide
        tn Thomas Neidhart added a comment -

        I did not write a jmh benchmark yet, but it looks like that for alpha/beta values < 1 the variant with sampling from a gamma distribution is faster, while for values > 1 the cheng sampling method is slightly faster.

        The difference is just small, and is probably negligible once the code has been compiled natively.

        Show
        tn Thomas Neidhart added a comment - I did not write a jmh benchmark yet, but it looks like that for alpha/beta values < 1 the variant with sampling from a gamma distribution is faster, while for values > 1 the cheng sampling method is slightly faster. The difference is just small, and is probably negligible once the code has been compiled natively.
        Hide
        tn Thomas Neidhart added a comment -

        Committed patch to 4.0 in commit 5597ed7ea300ae3d08cd893b0133bce26038a7df.

        Show
        tn Thomas Neidhart added a comment - Committed patch to 4.0 in commit 5597ed7ea300ae3d08cd893b0133bce26038a7df.
        Hide
        tn Thomas Neidhart added a comment -

        Commited to 3.6 in commit f5d028ca6af5591ca51785da7c15d7bd81d4215f.

        Show
        tn Thomas Neidhart added a comment - Commited to 3.6 in commit f5d028ca6af5591ca51785da7c15d7bd81d4215f.
        Hide
        tn Thomas Neidhart added a comment -

        Thanks for the patch and your patience!

        Show
        tn Thomas Neidhart added a comment - Thanks for the patch and your patience!
        Hide
        luc Luc Maisonobe added a comment -

        Closing all resolved issues that were included in 3.6 release.

        Show
        luc Luc Maisonobe added a comment - Closing all resolved issues that were included in 3.6 release.

          People

          • Assignee:
            Unassigned
            Reporter:
            lebedev Sergei Lebedev
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development