Uploaded image for project: 'Commons Math'
  1. Commons Math
  2. MATH-1220

More efficient sample() method for ZipfDistribution

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0, 3.6
    • Labels:
      None
    • Flags:
      Patch

      Description

      Currently, sampling from a ZipfDistribution is very inefficient. Random values are generated by inverting the CDF. However, the current implementation uses O(N) power function evaluations to calculate the CDF for some point. (Here N is the number of points of the Zipf distribution.) I propose to use rejection sampling instead, which allows the generation of a single random value in constant time.

      1. patch_v3
        20 kB
        Otmar Ertl
      2. Zipf Rejection Inversion Sampling Notes.pdf
        378 kB
        Otmar Ertl
      3. patch_v2
        17 kB
        Otmar Ertl
      4. patch_v1
        16 kB
        Otmar Ertl

        Activity

        Hide
        Otmar Ertl Otmar Ertl added a comment -

        The patch overrides the sample() method for ZipfDistribution by an implementation that uses rejection sampling. To demonstrate the speed of the new method the (ignored) unit test testPerformance() can be used. Using the default sample() method from AbstractIntegerDistribution instead, the test will not finish in reasonable time. The generation of a single random value consumes at most 2 uniformly distributed random numbers on average, dependent on the parameters of the Zipf distribution (see the output of testPerformance()).

        Show
        Otmar Ertl Otmar Ertl added a comment - The patch overrides the sample() method for ZipfDistribution by an implementation that uses rejection sampling. To demonstrate the speed of the new method the (ignored) unit test testPerformance() can be used. Using the default sample() method from AbstractIntegerDistribution instead, the test will not finish in reasonable time. The generation of a single random value consumes at most 2 uniformly distributed random numbers on average, dependent on the parameters of the Zipf distribution (see the output of testPerformance()).
        Hide
        tn Thomas Neidhart added a comment -

        what about the method proposed in this Stackoverflow answer: http://stackoverflow.com/a/8788662/3784643

        Show
        tn Thomas Neidhart added a comment - what about the method proposed in this Stackoverflow answer: http://stackoverflow.com/a/8788662/3784643
        Hide
        Otmar Ertl Otmar Ertl added a comment - - edited

        Do you mean using a pre-calculated CDF table and sampling using binary search? This would have O(N) initialization costs, O(N) space costs, and sampling would require O(log(N)) time. All that is constant for the method I have proposed. If initialization and memory costs are not an issue, sampling using such a pre-calculated table for the CDF is probably faster for smaller N.

        The proposed rejection sampling algorithm divides the support into two parts. The head which currently only consists of the first point (equal to 1) and the tail which consists of all remaining points. Rejection sampling is only used to sample points from the tail. Instead, it would be possible to take the first X (<=N) points as the head and to use such a pre-calculated CDF table in combination with binary serach for sampling points from the head. For the tail still the rejection method can be applied. When developing that algorithm I had that in mind, but finally decided to set X = 1 to minimize the initialization costs. The optimal value for X is likely to be use case dependent and difficult to determine. I do not think that a pure CDF table based approach (X=N) is feasible for all N, because initialization and memory costs could get very large.

        Show
        Otmar Ertl Otmar Ertl added a comment - - edited Do you mean using a pre-calculated CDF table and sampling using binary search? This would have O(N) initialization costs, O(N) space costs, and sampling would require O(log(N)) time. All that is constant for the method I have proposed. If initialization and memory costs are not an issue, sampling using such a pre-calculated table for the CDF is probably faster for smaller N. The proposed rejection sampling algorithm divides the support into two parts. The head which currently only consists of the first point (equal to 1) and the tail which consists of all remaining points. Rejection sampling is only used to sample points from the tail. Instead, it would be possible to take the first X (<=N) points as the head and to use such a pre-calculated CDF table in combination with binary serach for sampling points from the head. For the tail still the rejection method can be applied. When developing that algorithm I had that in mind, but finally decided to set X = 1 to minimize the initialization costs. The optimal value for X is likely to be use case dependent and difficult to determine. I do not think that a pure CDF table based approach (X=N) is feasible for all N, because initialization and memory costs could get very large.
        Hide
        tn Thomas Neidhart added a comment -

        Yes, I was referring to the pre-computated CDF table. I understand that for large N this will not be as efficient as your rejection sampling approach. Is the O(1) runtime complexity of your approach also guaranteed for small N? Otherwise the proposed mixture could be a good solution.

        Another thing: as far as I understand, the rejection sampling method could be potentially applied to different distributions. Do you see a way to generalize it so that it could be re-used for others too?

        Show
        tn Thomas Neidhart added a comment - Yes, I was referring to the pre-computated CDF table. I understand that for large N this will not be as efficient as your rejection sampling approach. Is the O(1) runtime complexity of your approach also guaranteed for small N? Otherwise the proposed mixture could be a good solution. Another thing: as far as I understand, the rejection sampling method could be potentially applied to different distributions. Do you see a way to generalize it so that it could be re-used for others too?
        Hide
        Otmar Ertl Otmar Ertl added a comment -

        The avg. number of iterations until a random value is accepted can be bound by a constant that is independent of the number of points and the exponent. Therefore, the O(1) runtime complexity is ensured. See the output of the testPerformance() unit test, which shows that the avg. number of consumed uniformly distributed random numbers is limited, which is directly connected to the number of iterations.

        I can imagine to generalize the rejection sampling method in its basic form. Basically, you need to define two methods, a method to generate a value from the instrumental distribution and a method to calculate the corresponding acceptance rate for that value. However, there are many rejection sampling methods that do not fit into this basic scheme (see for example the Ziggurat algorithm or the Monty Python method to generate normal distributed values). Furthermore, defining a method to calculate the acceptance rate is not feasible in all cases, because it restricts transformations of the inequality (acceptance rate < uniformly distributed random value). For example, the proposed method could also be optimized without the definition of an explicit method for the acceptance rate (a division could be replaced in favor of a mutliplication).

        Show
        Otmar Ertl Otmar Ertl added a comment - The avg. number of iterations until a random value is accepted can be bound by a constant that is independent of the number of points and the exponent. Therefore, the O(1) runtime complexity is ensured. See the output of the testPerformance() unit test, which shows that the avg. number of consumed uniformly distributed random numbers is limited, which is directly connected to the number of iterations. I can imagine to generalize the rejection sampling method in its basic form. Basically, you need to define two methods, a method to generate a value from the instrumental distribution and a method to calculate the corresponding acceptance rate for that value. However, there are many rejection sampling methods that do not fit into this basic scheme (see for example the Ziggurat algorithm or the Monty Python method to generate normal distributed values). Furthermore, defining a method to calculate the acceptance rate is not feasible in all cases, because it restricts transformations of the inequality (acceptance rate < uniformly distributed random value). For example, the proposed method could also be optimized without the definition of an explicit method for the acceptance rate (a division could be replaced in favor of a mutliplication).
        Hide
        tn Thomas Neidhart added a comment -

        I played around with your patch, and I have two remarks to independently improve the Zipf distribution:

        • the call to generalizedHarmonic(numberOfElements, exponent) in the probability and logProbability and cumulativeProbabilty methods can be cached as the parameters do not change
        • the inverseCumulativeProbability method could be improved in case of power-law distributions as currently it does a linear interpolation which will obviously result in slow convergence for such distributions
        Show
        tn Thomas Neidhart added a comment - I played around with your patch, and I have two remarks to independently improve the Zipf distribution: the call to generalizedHarmonic(numberOfElements, exponent) in the probability and logProbability and cumulativeProbabilty methods can be cached as the parameters do not change the inverseCumulativeProbability method could be improved in case of power-law distributions as currently it does a linear interpolation which will obviously result in slow convergence for such distributions
        Hide
        tn Thomas Neidhart added a comment - - edited

        btw. is the implemented rejection method published anywhere?

        The most recent paper about sampling from a Zipf distribution that I could find is available here: http://epub.wu.ac.at/1176/ which refers to two older methods from Devroye and Dagpunar.

        Edit: I did a lot of tests and research already on the topic, and it seems that the proposed method improves the current state of the art, but I could not find any reference or published version of the algorithm. Do you plan to publish it?

        Show
        tn Thomas Neidhart added a comment - - edited btw. is the implemented rejection method published anywhere? The most recent paper about sampling from a Zipf distribution that I could find is available here: http://epub.wu.ac.at/1176/ which refers to two older methods from Devroye and Dagpunar. Edit: I did a lot of tests and research already on the topic, and it seems that the proposed method improves the current state of the art, but I could not find any reference or published version of the algorithm. Do you plan to publish it?
        Hide
        Otmar Ertl Otmar Ertl added a comment -

        Caching generalizedHarmonic(numberOfElements, exponent) makes sense.

        The inverse cumulative probability would be more efficient by simply summing up the probabilities of points until the searched probability is met.

        Furthermore, I would allow the exponent to be non-negative. Currently, it is restricted to positive values.

        I have developed the method by myself. I do not know if a similar method can be found in literature. So far, apart from this math library, I have no plans to publish it somewhere else. I am not sure, if I could bring up the time to write some paper.

        Show
        Otmar Ertl Otmar Ertl added a comment - Caching generalizedHarmonic(numberOfElements, exponent) makes sense. The inverse cumulative probability would be more efficient by simply summing up the probabilities of points until the searched probability is met. Furthermore, I would allow the exponent to be non-negative. Currently, it is restricted to positive values. I have developed the method by myself. I do not know if a similar method can be found in literature. So far, apart from this math library, I have no plans to publish it somewhere else. I am not sure, if I could bring up the time to write some paper.
        Hide
        tn Thomas Neidhart added a comment - - edited

        Furthermore, I would allow the exponent to be non-negative. Currently, it is restricted to positive values.

        I am not sure if we should do this. Normally in literature and implementations, the exponent is even further restricted (exp > 1), and if you look at the pmf and cdf functions for exponents < 1 you can see that the resulting distribution is quite distorted. I guess the typical implementation uses the Riemann zeta function which does not converge for exponents < 1.

        btw. there is something in your implementation that I do not understand:

             * An instrumental distribution g(k) is used to generate random values by rejection sampling.
             * g(k) is defined as g(1):= 1 and g(k) := I(-s,k-1/2,k+1/2) for k larger than 1,
             * where s denotes the exponent of the Zipf distribution and I(r,a,b) is the integral of x^r for x from a to b.
             * Since 1^x^s is a convex function, Jensens's inequality gives
             * I(-s,k-1/2,k+1/2) >= 1/k^s for all positive k and non-negative s.
             * In order to limit the rejection rate for large exponents s,
             * the instrumental distribution weight is differently defined for value 1.
             */
            @Override
            public int sample() {
                if (Double.isNaN(instrumentalDistributionTailWeight)) {
                    instrumentalDistributionTailWeight = integratePowerFunction(-exponent, 1.5, numberOfElements+0.5);
                }
        

        talks about integrating the power function in the range [k-0.5, k+0.5] but in fact uses 1.5 and k+0.5 as bounds. Where does the 1.5 come from?

        Is it because you evaluate the tail in the range [2, N]?

        Show
        tn Thomas Neidhart added a comment - - edited Furthermore, I would allow the exponent to be non-negative. Currently, it is restricted to positive values. I am not sure if we should do this. Normally in literature and implementations, the exponent is even further restricted (exp > 1), and if you look at the pmf and cdf functions for exponents < 1 you can see that the resulting distribution is quite distorted. I guess the typical implementation uses the Riemann zeta function which does not converge for exponents < 1. btw. there is something in your implementation that I do not understand: * An instrumental distribution g(k) is used to generate random values by rejection sampling. * g(k) is defined as g(1):= 1 and g(k) := I(-s,k-1/2,k+1/2) for k larger than 1, * where s denotes the exponent of the Zipf distribution and I(r,a,b) is the integral of x^r for x from a to b. * Since 1^x^s is a convex function, Jensens's inequality gives * I(-s,k-1/2,k+1/2) >= 1/k^s for all positive k and non-negative s. * In order to limit the rejection rate for large exponents s, * the instrumental distribution weight is differently defined for value 1. */ @Override public int sample() { if ( Double .isNaN(instrumentalDistributionTailWeight)) { instrumentalDistributionTailWeight = integratePowerFunction(-exponent, 1.5, numberOfElements+0.5); } talks about integrating the power function in the range [k-0.5, k+0.5] but in fact uses 1.5 and k+0.5 as bounds. Where does the 1.5 come from? Is it because you evaluate the tail in the range [2, N] ?
        Hide
        Otmar Ertl Otmar Ertl added a comment -

        Since we are sampling from a finite number of points, convergence of the infinite series is irrelevant. Exponent equal to 0 corresponds to a uniform distribution.

        Yes, the tail includes the points starting from 2 to N, because the first point (the head) is treated differently in order to limit the rejection rate. Otherwise, the rejection rate could become arbitrarily large for large exponents.

        Show
        Otmar Ertl Otmar Ertl added a comment - Since we are sampling from a finite number of points, convergence of the infinite series is irrelevant. Exponent equal to 0 corresponds to a uniform distribution. Yes, the tail includes the points starting from 2 to N, because the first point (the head) is treated differently in order to limit the rejection rate. Otherwise, the rejection rate could become arbitrarily large for large exponents.
        Hide
        tn Thomas Neidhart added a comment - - edited

        Committed to 3.6 branch in commit 321269ed9aa84d15b18296ee6e73d53489efb622.

        Changed the contributed code to a static inner class and make the sampler a transient field in the distribution which is initialized at first use to avoid problems wrt serialization.

        Show
        tn Thomas Neidhart added a comment - - edited Committed to 3.6 branch in commit 321269ed9aa84d15b18296ee6e73d53489efb622. Changed the contributed code to a static inner class and make the sampler a transient field in the distribution which is initialized at first use to avoid problems wrt serialization.
        Hide
        tn Thomas Neidhart added a comment -

        Committed to 4.0 branch in commit 002276ea313fd880122502e9840b43f996acd537.

        Thanks for the report and patch!

        Show
        tn Thomas Neidhart added a comment - Committed to 4.0 branch in commit 002276ea313fd880122502e9840b43f996acd537. Thanks for the report and patch!
        Hide
        Otmar Ertl Otmar Ertl added a comment -

        I have found some time to read the paper you mentioned: http://epub.wu.ac.at/1176/. The algorithm described there is superior to the method I have proposed. The only drawback is that it is restricted to exponents larger than 1. However, I have found a way to transform the algorithm so that it should work for any non-negative exponents (see patch_v2).

        Show
        Otmar Ertl Otmar Ertl added a comment - I have found some time to read the paper you mentioned: http://epub.wu.ac.at/1176/ . The algorithm described there is superior to the method I have proposed. The only drawback is that it is restricted to exponents larger than 1. However, I have found a way to transform the algorithm so that it should work for any non-negative exponents (see patch_v2).
        Hide
        tn Thomas Neidhart added a comment -

        I am not so sure that rejection inversion method is generally better than yours.

        Before applying the patch, I did some tests with this method too (using another implementation that I found here: http://www.nr.com/forum/showthread.php?t=1396, btw go also uses this method by default: https://golang.org/src/math/rand/zipf.go), but I found that your version seems to be faster for exponents > 1.

        The rejection inversion method draws less random variables which kept me undecided, but I wanted to value your contribution.

        Show
        tn Thomas Neidhart added a comment - I am not so sure that rejection inversion method is generally better than yours. Before applying the patch, I did some tests with this method too (using another implementation that I found here: http://www.nr.com/forum/showthread.php?t=1396 , btw go also uses this method by default: https://golang.org/src/math/rand/zipf.go ), but I found that your version seems to be faster for exponents > 1. The rejection inversion method draws less random variables which kept me undecided, but I wanted to value your contribution.
        Hide
        Otmar Ertl Otmar Ertl added a comment -

        Yes, my proposed method outperforms the rejection inversion method for large exponents. However, I suppose large exponents (greater than 10) are not very interesting for data modelling. The exponent is usually close to 1 or even less than 1. See for example following paper about word frequencies http://colala.bcs.rochester.edu/papers/piantadosi2014zipfs.pdf. The original rejection inversion method cannot handle exponents in that range. The second patch contains an adapted rejection inversion algorithm, that covers all non-negative exponents. (I still favor to include 0 as allowed value for the exponent. It is easier for data modelling, if you are allowed to choose the parameter from an closed interval. Furthermore, exponent equal to 0 represents a meaningful distribution, since it corresponds to a uniform distribution.)

        Show
        Otmar Ertl Otmar Ertl added a comment - Yes, my proposed method outperforms the rejection inversion method for large exponents. However, I suppose large exponents (greater than 10) are not very interesting for data modelling. The exponent is usually close to 1 or even less than 1. See for example following paper about word frequencies http://colala.bcs.rochester.edu/papers/piantadosi2014zipfs.pdf . The original rejection inversion method cannot handle exponents in that range. The second patch contains an adapted rejection inversion algorithm, that covers all non-negative exponents. (I still favor to include 0 as allowed value for the exponent. It is easier for data modelling, if you are allowed to choose the parameter from an closed interval. Furthermore, exponent equal to 0 represents a meaningful distribution, since it corresponds to a uniform distribution.)
        Hide
        tn Thomas Neidhart added a comment -

        sorry for the delay, I am still looking at your second patch and trying to understand your changes to the original algorithm.

        Show
        tn Thomas Neidhart added a comment - sorry for the delay, I am still looking at your second patch and trying to understand your changes to the original algorithm.
        Hide
        Otmar Ertl Otmar Ertl added a comment -

        I attached my notes that should make it clearer how the original algorithm was transformed.

        Show
        Otmar Ertl Otmar Ertl added a comment - I attached my notes that should make it clearer how the original algorithm was transformed.
        Hide
        Otmar Ertl Otmar Ertl added a comment -

        I have revised the patch that makes use of rejection-inversion sampling. I have improved documentation and inlined a proof that shows that the generated random numbers are really Zipf distributed. I think the new approach should be more comprehensible now. I would be grateful for a review.

        Show
        Otmar Ertl Otmar Ertl added a comment - I have revised the patch that makes use of rejection-inversion sampling. I have improved documentation and inlined a proof that shows that the generated random numbers are really Zipf distributed. I think the new approach should be more comprehensible now. I would be grateful for a review.
        Hide
        Otmar Ertl Otmar Ertl added a comment -

        If there are not any concerns, I will commit patch_v3.

        Show
        Otmar Ertl Otmar Ertl added a comment - If there are not any concerns, I will commit patch_v3 .
        Hide
        tn Thomas Neidhart added a comment -

        +1, the updated patch is certainly cleaner.

        Show
        tn Thomas Neidhart added a comment - +1, the updated patch is certainly cleaner.
        Hide
        Shaofengshi Shaofeng SHI added a comment -

        I also found the sample() method is very slow when numberOfElements is big (like 100,000); After changing to the new code base, it is much much fast now; Thanks for the enhancement!

        Show
        Shaofengshi Shaofeng SHI added a comment - I also found the sample() method is very slow when numberOfElements is big (like 100,000); After changing to the new code base, it is much much fast now; Thanks for the enhancement!
        Hide
        Otmar Ertl Otmar Ertl added a comment -

        Introduced rejection-inversion sampling approach with commits

        • 484196fc2e11b632964895b6662dc783d4538b35 (3.6)
        • 9c51e5316babbd370bc32aed0fee134216726ec9 (4.0)
        Show
        Otmar Ertl Otmar Ertl added a comment - Introduced rejection-inversion sampling approach with commits 484196fc2e11b632964895b6662dc783d4538b35 (3.6) 9c51e5316babbd370bc32aed0fee134216726ec9 (4.0)
        Hide
        luc Luc Maisonobe added a comment -

        Closing all resolved issues that were included in 3.6 release.

        Show
        luc Luc Maisonobe added a comment - Closing all resolved issues that were included in 3.6 release.

          People

          • Assignee:
            Otmar Ertl Otmar Ertl
            Reporter:
            Otmar Ertl Otmar Ertl
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development