Details
Description
Currently, sampling from a ZipfDistribution is very inefficient. Random values are generated by inverting the CDF. However, the current implementation uses O(N) power function evaluations to calculate the CDF for some point. (Here N is the number of points of the Zipf distribution.) I propose to use rejection sampling instead, which allows the generation of a single random value in constant time.

 Zipf Rejection Inversion Sampling Notes.pdf
 378 kB
 Otmar Ertl

 patch_v3
 20 kB
 Otmar Ertl

 patch_v2
 17 kB
 Otmar Ertl

 patch_v1
 16 kB
 Otmar Ertl
Activity
 All
 Comments
 Work Log
 History
 Activity
 Transitions
what about the method proposed in this Stackoverflow answer: http://stackoverflow.com/a/8788662/3784643
Do you mean using a precalculated CDF table and sampling using binary search? This would have O(N) initialization costs, O(N) space costs, and sampling would require O(log(N)) time. All that is constant for the method I have proposed. If initialization and memory costs are not an issue, sampling using such a precalculated table for the CDF is probably faster for smaller N.
The proposed rejection sampling algorithm divides the support into two parts. The head which currently only consists of the first point (equal to 1) and the tail which consists of all remaining points. Rejection sampling is only used to sample points from the tail. Instead, it would be possible to take the first X (<=N) points as the head and to use such a precalculated CDF table in combination with binary serach for sampling points from the head. For the tail still the rejection method can be applied. When developing that algorithm I had that in mind, but finally decided to set X = 1 to minimize the initialization costs. The optimal value for X is likely to be use case dependent and difficult to determine. I do not think that a pure CDF table based approach (X=N) is feasible for all N, because initialization and memory costs could get very large.
Yes, I was referring to the precomputated CDF table. I understand that for large N this will not be as efficient as your rejection sampling approach. Is the O(1) runtime complexity of your approach also guaranteed for small N? Otherwise the proposed mixture could be a good solution.
Another thing: as far as I understand, the rejection sampling method could be potentially applied to different distributions. Do you see a way to generalize it so that it could be reused for others too?
The avg. number of iterations until a random value is accepted can be bound by a constant that is independent of the number of points and the exponent. Therefore, the O(1) runtime complexity is ensured. See the output of the testPerformance() unit test, which shows that the avg. number of consumed uniformly distributed random numbers is limited, which is directly connected to the number of iterations.
I can imagine to generalize the rejection sampling method in its basic form. Basically, you need to define two methods, a method to generate a value from the instrumental distribution and a method to calculate the corresponding acceptance rate for that value. However, there are many rejection sampling methods that do not fit into this basic scheme (see for example the Ziggurat algorithm or the Monty Python method to generate normal distributed values). Furthermore, defining a method to calculate the acceptance rate is not feasible in all cases, because it restricts transformations of the inequality (acceptance rate < uniformly distributed random value). For example, the proposed method could also be optimized without the definition of an explicit method for the acceptance rate (a division could be replaced in favor of a mutliplication).
I played around with your patch, and I have two remarks to independently improve the Zipf distribution:
 the call to generalizedHarmonic(numberOfElements, exponent) in the probability and logProbability and cumulativeProbabilty methods can be cached as the parameters do not change
 the inverseCumulativeProbability method could be improved in case of powerlaw distributions as currently it does a linear interpolation which will obviously result in slow convergence for such distributions
btw. is the implemented rejection method published anywhere?
The most recent paper about sampling from a Zipf distribution that I could find is available here: http://epub.wu.ac.at/1176/ which refers to two older methods from Devroye and Dagpunar.
Edit: I did a lot of tests and research already on the topic, and it seems that the proposed method improves the current state of the art, but I could not find any reference or published version of the algorithm. Do you plan to publish it?
Caching generalizedHarmonic(numberOfElements, exponent) makes sense.
The inverse cumulative probability would be more efficient by simply summing up the probabilities of points until the searched probability is met.
Furthermore, I would allow the exponent to be nonnegative. Currently, it is restricted to positive values.
I have developed the method by myself. I do not know if a similar method can be found in literature. So far, apart from this math library, I have no plans to publish it somewhere else. I am not sure, if I could bring up the time to write some paper.
Furthermore, I would allow the exponent to be nonnegative. Currently, it is restricted to positive values.
I am not sure if we should do this. Normally in literature and implementations, the exponent is even further restricted (exp > 1), and if you look at the pmf and cdf functions for exponents < 1 you can see that the resulting distribution is quite distorted. I guess the typical implementation uses the Riemann zeta function which does not converge for exponents < 1.
btw. there is something in your implementation that I do not understand:
* An instrumental distribution g(k) is used to generate random values by rejection sampling. * g(k) is defined as g(1):= 1 and g(k) := I(s,k1/2,k+1/2) for k larger than 1, * where s denotes the exponent of the Zipf distribution and I(r,a,b) is the integral of x^r for x from a to b. * Since 1^x^s is a convex function, Jensens's inequality gives * I(s,k1/2,k+1/2) >= 1/k^s for all positive k and nonnegative s. * In order to limit the rejection rate for large exponents s, * the instrumental distribution weight is differently defined for value 1. */ @Override public int sample() { if (Double.isNaN(instrumentalDistributionTailWeight)) { instrumentalDistributionTailWeight = integratePowerFunction(exponent, 1.5, numberOfElements+0.5); }
talks about integrating the power function in the range [k0.5, k+0.5] but in fact uses 1.5 and k+0.5 as bounds. Where does the 1.5 come from?
Is it because you evaluate the tail in the range [2, N]?
Since we are sampling from a finite number of points, convergence of the infinite series is irrelevant. Exponent equal to 0 corresponds to a uniform distribution.
Yes, the tail includes the points starting from 2 to N, because the first point (the head) is treated differently in order to limit the rejection rate. Otherwise, the rejection rate could become arbitrarily large for large exponents.
Committed to 3.6 branch in commit 321269ed9aa84d15b18296ee6e73d53489efb622.
Changed the contributed code to a static inner class and make the sampler a transient field in the distribution which is initialized at first use to avoid problems wrt serialization.
Committed to 4.0 branch in commit 002276ea313fd880122502e9840b43f996acd537.
Thanks for the report and patch!
I have found some time to read the paper you mentioned: http://epub.wu.ac.at/1176/. The algorithm described there is superior to the method I have proposed. The only drawback is that it is restricted to exponents larger than 1. However, I have found a way to transform the algorithm so that it should work for any nonnegative exponents (see patch_v2).
I am not so sure that rejection inversion method is generally better than yours.
Before applying the patch, I did some tests with this method too (using another implementation that I found here: http://www.nr.com/forum/showthread.php?t=1396, btw go also uses this method by default: https://golang.org/src/math/rand/zipf.go), but I found that your version seems to be faster for exponents > 1.
The rejection inversion method draws less random variables which kept me undecided, but I wanted to value your contribution.
Yes, my proposed method outperforms the rejection inversion method for large exponents. However, I suppose large exponents (greater than 10) are not very interesting for data modelling. The exponent is usually close to 1 or even less than 1. See for example following paper about word frequencies http://colala.bcs.rochester.edu/papers/piantadosi2014zipfs.pdf. The original rejection inversion method cannot handle exponents in that range. The second patch contains an adapted rejection inversion algorithm, that covers all nonnegative exponents. (I still favor to include 0 as allowed value for the exponent. It is easier for data modelling, if you are allowed to choose the parameter from an closed interval. Furthermore, exponent equal to 0 represents a meaningful distribution, since it corresponds to a uniform distribution.)
sorry for the delay, I am still looking at your second patch and trying to understand your changes to the original algorithm.
I attached my notes that should make it clearer how the original algorithm was transformed.
I have revised the patch that makes use of rejectioninversion sampling. I have improved documentation and inlined a proof that shows that the generated random numbers are really Zipf distributed. I think the new approach should be more comprehensible now. I would be grateful for a review.
If there are not any concerns, I will commit patch_v3^{}.
+1, the updated patch is certainly cleaner.
I also found the sample() method is very slow when numberOfElements is big (like 100,000); After changing to the new code base, it is much much fast now; Thanks for the enhancement!
Introduced rejectioninversion sampling approach with commits
 484196fc2e11b632964895b6662dc783d4538b35 (3.6)
 9c51e5316babbd370bc32aed0fee134216726ec9 (4.0)
Closing all resolved issues that were included in 3.6 release.
The patch overrides the sample() method for ZipfDistribution by an implementation that uses rejection sampling. To demonstrate the speed of the new method the (ignored) unit test testPerformance() can be used. Using the default sample() method from AbstractIntegerDistribution instead, the test will not finish in reasonable time. The generation of a single random value consumes at most 2 uniformly distributed random numbers on average, dependent on the parameters of the Zipf distribution (see the output of testPerformance()).