Details

Type: Bug

Status: Closed

Priority: Minor

Resolution: Fixed

Affects Version/s: 1.0, 1.1, 1.2, 2.0, 2.1, 2.2, 3.0

Fix Version/s: 3.0

Labels:None
Description
There are some inconsistencies in the documentation and implementation of functions regarding cumulative probabilities and inverse cumulative probabilities. More precisely, '<' and '<=' are not used in a consistent way.
Besides I would move the function inverseCumulativeProbability(double) to the interface Distribution. A true inverse of the distribution function does neither exist for Distribution nor for ContinuosDistribution. Thus we need to define the inverse in terms of quantiles anyway, and this can already be done for Distribution.
On the whole I would declare the (inverse) cumulative probability functions in the basic distribution interfaces as follows:
Distribution:
 cumulativeProbability(double x): returns P(X <= x)
 cumulativeProbability(double x0, double x1): returns P(x0 < X <= x1) [see also 1)]
 inverseCumulativeProbability(double p):
returns the quantile function inf {x in R  P(X<=x) >= p}[see also 2), 3), and 4)]
1) An aternative definition could be P(x0 <= X <= x1). But this requires to put the function probability(double x) or another cumulative probability function into the interface Distribution in order be able to calculate P(x0 <= X <= x1) in AbstractDistribution.
2) This definition is stricter than the definition in ContinuousDistribution, because the definition there does not specify what to do if there are multiple x satisfying P(X<=x) = p.
3) A modification could be defined for p=0: Returning sup
would yield the infimum of the distribution's support instead of a mandatory infinity.
4) This affects issue MATH540. I'd prefere the definition from above for the following reasons:
 This definition simplifies inverse transform sampling (as mentioned in the other issue).
 It is the standard textbook definition for the quantile function.
 For integer distributions it has the advantage that the result doesn't change when switching to "x in Z", i.e. the result is independent of considering the intergers as sole set or as part of the reals.
ContinuousDistribution:
nothing to be added regarding (inverse) cumulative probability functions
IntegerDistribution:
 cumulativeProbability(int x): returns P(X <= x)
 cumulativeProbability(int x0, int x1): returns P(x0 < X <= x1) [see also 1) above]

 Math692_realDomain_patch1.patch
 26 kB
 Christian Winter

 MATH692_integerDomain_patch1.patch
 55 kB
 Christian Winter
Activity
 All
 Comments
 Work Log
 History
 Activity
 Transitions
I have neither used nor developed this part of CM, so my view on this is of but little value. Having said that, anything improving consistency can only be desirable, especially at this stage. So I'm all for it, and will be soon available (when I'm done on SYMMLQ) for an (novice on these issues) help.
Sébastien
+1
Thanks for the feedback to all. Sébastien, thanks for offering your help. If you like and find time for it, you could implement AbstractDistribution.inverseCumulativeProbability(double p).
I will provide some patches next week, but adjusting AbstractContinuousDistribution.inverseCumulativeProbability(double p) will take some more time.
After thinking a little more about the structure of the interfaces, I'd like to put the function probability(double x) to Distribution anyway (independently of the thought in point 1) above).
Are there any preferences on P(x0 <= X <= x1) or P(x0 < X <= x1) for cumulativeProbability(double x0, double x1)?
I am not sure it is really makes sense to add probability(double x) to the Distribution interface. It would have to be defined as density (referring to the distribution function) to make sense in the continuous case, since defined as p(X = x) it would in most cases be identically 0 for continuous distributions.
Regarding the cum definition, I am fine with P(x0 < X <= x1).
Happy to help on the inverse cumulative probability. You will have to be patient and forgieving with me, though, as I discover this part of CM.
As for the definition, I think that one of the bounds should be excluded, so that these cumulative probabilities can be summed
P(a < X <= c) = P(a < X <= b) + P(b < X <= c),
even in the case of discrete PDFs.
Whether the lower or upper bound should be excluded is another matter. I usually work with continuous pdfs, so I don't know if there is a common practice in the probability community. If there is none, I would tend to chose the following definition
P(x0 <= X < x1)
(sorry Phil!), because it would be consistent with the way things are usually indexed in java (a[0].. a[a.length1]). See also org.apache.commons.math.util.MultidimensionalCounter. Although this type of consistency is not an absolute requirement, I think it is nice for the user to have such simple principle: "lower bound always included, upper bound always excluded". Appart from this small point, I really have no objection to any choice.
Have a look at the default implementation of cum(x0,x1) now in AbstractDistribution. I think the incorrectness in the documentation there may have been what triggered Christian to raise this issue. The equation cum(a,b) = F(b)  F(a) where F is the distribution function is natural and what the impl there is trying to do. In the discrete case, this equation fails, however, unless you define the cum to exclude the lower endpoint. That's why P(x0 < X <= x1) is a better definition.
OK, Phil, it makes perfect sense.
Good, the definition of cum(x0,x1) will be P(x0 < X <= x1). Phil, you are right: cum(x0,x1) in AbstractDistribution was a reason for raising this issue. Another reason was cum(int x0, int x1) in AbstractIntegerDistribution.
The idea behind probability(double x) is in fact to define it as p(X = x) and to return 0 for continuous distributions. This function would be useful for discrete distributions not inheriting from IntergerDistribution and for distributions being composed of discrete and continuous parts.
I guess I am OK with pushing p up. See related post to follow in commonsdev.
Hi Christian,
I've started looking into this issue. As I said, you will have to be patient with me .
I can see there already is a default implementation of AbstractContinuousDistribution.inverseCumulativeProbability. So what exactly would you like me to do? Is this implementation fragile? Would you like me to improve robustness? Provide full testing?
I think there might be issues when the PDF falls down to zero in a range (in which case the cum exhibits a plateau). The returned value might differ from the mathematical definition you proposed. Is this what you want me to work on? Have you already identified other issues?
Best regards,
Sébastien
Hi Sébastien,
the problem with the plateau is indeed one issue which needs to be solved.
Additionally, AbstractDistribution will need an implementation of inverseCumulativeProbability. In fact both implementations should be the same except for the solver to be used. Thus inverseCumulativeProbability should be implemented just once in AbstractDistribution, and invoking the solver should be put to a separate procedure so that it can be overridden in AbstractContinuousDistribution.
A third point is the choice of the solvers. For AbstractDistribution we need a solver which works even for discontinuous cdfs (BisectionSolver can do the job, but maybe the implementations of the faster IllinoisSolver, PegasusSolver, BrentSolver, or another solver can cope with discontinuities, too). For AbstractContinuousDistribution it would be beneficial to use a DifferentiableUnivariateRealSolver. However, the NewtonSolver cannot be used due to uncertainty of convergence and an alternative doesn't seem to exist by now. So we have to choose one of the other solvers for now.
As all these points are interdependent, I guess it's best to solve them as a whole. If you like, you can do this.
Best Regards,
Christian
Another point for discussion:
I'd like to introduce
getDomainBracket(double p): returns double[]
to AbstractDistribution as helper function for inverseCumulativeProbability. This allows to avoid searching a bracket where a bracket can be specified directly.
The function getDomainBracket could be made abstract (which means to remove getInitialDomain, getDomainLowerBound, and getDomainUpperBound as these functions aren't needed any more), or it could have a default implementation (according to the corresponding part of the current implementation of inverseCumulativeProbability) which uses getInitialDomain, getDomainLowerBound, and getDomainUpperBound. However, getInitialDomain, getDomainLowerBound, and getDomainUpperBound should not be abstract in the latter case. Otherwise a derived class would be forced to implement something it potentially doesn't use. Thus the functions getInitialDomain, getDomainLowerBound, and getDomainUpperBound should have default implementations which either return default values (0, infinity, +infinity) or throw an exception saying something like "has to be implemented".
Hi Christian,
Hi Sébastien,
the problem with the plateau is indeed one issue which needs to be solved.
I'm working on it...
Additionally, AbstractDistribution will need an implementation of inverseCumulativeProbability. In fact both implementations should be the same except for the solver to be used. Thus inverseCumulativeProbability should be implemented just once in AbstractDistribution, and invoking the solver should be put to a separate procedure so that it can be overridden in AbstractContinuousDistribution.
OK, for now, I'm concentrating on making the current impl in AbstractContinuousDistribution more robust. The other impl should be easier.
A third point is the choice of the solvers. For AbstractDistribution we need a solver which works even for discontinuous cdfs (BisectionSolver can do the job, but maybe the implementations of the faster IllinoisSolver, PegasusSolver, BrentSolver, or another solver can cope with discontinuities, too). For AbstractContinuousDistribution it would be beneficial to use a DifferentiableUnivariateRealSolver. However, the NewtonSolver cannot be used due to uncertainty of convergence and an alternative doesn't seem to exist by now. So we have to choose one of the other solvers for now.
The current implementation uses a Brent solver. I think the solver itself is only one side of the issue. The other point is the algorithm used to bracket the solution, in order to ensure that the result is consistent with the definition of the cumprob. As for the DifferentiableUnivariateRealSolver, I'm not too sure. I guess it depends on what is meant by "continuous distribution". For me, it means that the random variable takes values in a continuous set, and possibly its distribution is defined by a density. However, in my view, nothing prevents occurences of Dirac functions, in which case the cum sum is only piecewise C1. It's all a matter of definition, of course, and I'll ask the question on the forum to check whether or not people want to allow for such a situation.
As all these points are interdependent, I guess it's best to solve them as a whole. If you like, you can do this.
Best Regards,
Christian
Yes, I'm very interested.
Best regards,
Sébastien
Please note that MATH699 has been created specifically to handle plateaux.
Sébastien
Here is the first patch for this issue (unfortunately with some delay). It adjusts the distributions with real domain to the definitions in this issue, and it mainly changes documentations.
I could not move inverseCumulativeProbability(double) up to Distribution because there would be a conflict with IntegerDistribution.inverseCumulativeProbability(double): This method returns int. This problem will be removed by solving issue MATH703.
The implementation of inverseCumulativeProbability(double) is not changed as Sébastien is working on this.
I will provide the patch for the integer distributions as soon as I have adjusted the test data to the new inequalities and reverified the adjusted test data.
All,
since I'm already working on this package, I'm happy to commit the patch on behalf of Christian. However, since I'm a relatively new committer, I would feel more confident if one of the "old, wise committers" could double check the svn log afterwards.
Best,
Sébastien
Hey, that's how it always works
I don't know about "wise" but I certainly qualify as "old" by any standard, so will have a look once you have reviewed and committed.
Thanks!
Patch Math692_realDomain_patch1.patch (20111108) applied in rev 1200179, with minor modifications (mostly checkstyle fixes).
Thanks Christian!
As mentioned by Sébastien in MATH699, the implementation of IntegerDistribution.inverseCumulativeProbability(double p) can benefit from the ideas which came up for RealDistribution.inverseCumulativeProbability(double p) in that thread.
Thus I will remove getDomainLowerBound(double p) and getDomainUpperBound(double p) from the integer distributions. I checked that all current implementations of the lower/upper bound methods provide the whole support of the distribution as starting bracket. This means that using getSupportLowerBound() and getSupportUpperBound() for the starting bracket won't degrade the performance of the current distribution implementations. However, a user might want the improve the performance of his distribution implementations by providing a more targeted starting bracket for probability p. Thus I will swap the solving step to a protected function solveInverseCumulativeProbability(double p, int lower, int upper), so that it gets easy to override inverseCumulativeProbability with an implementation which finds a better starting bracket.
Furthermore, Phil's idea with Chebyshev's inequality can be applied to the generic implementation of inverseCumulativeProbability in order to get a better starting bracket.
Hi Christian,
If you agree with that, I suggest that you also take care of MATH718, as the two issues seem to be very much connected.
Sébastien
Hi Sébastien,
my changes in the integer distributions don't solve MATH718. Instead I found a probably related problem with the Pascal distribution.
The integer distribution patch for this issue still isn't ready. I will provide it next week.
Christian
This is the patch which adjusts the integer distributions to the agreements above.
The changes to the test cases for the random generators may be unexpected. But these changes initially were triggered by adjusting RandomDataTest.checkNextPoissonConsistency(double) to the new contract for integer distributions. Then some random generator tests failed due to chance. While adjusting their seeds, I found some other tests with a high failure probability. Thus I also set some failure probabilities to 0.01 in order to find suitable seeds more quickly.
My next task on this issue is to adjust the user guid.
Hi Christian,
thanks for this contribution. I am away for a few days, but am very happy to commit this patch as soon as I am back, if you are not in too much of a hurry.
Thanks again,
Sébastien
Well, we've recently run into some troubles with SVN, but it seems everything is working fine again. Patch MATH692_integerDomain_patch1.patch (with minor checkstyle changes) committed in revision 1226041.
Please do not forget to run mvn clean; mvn site:site and check the reports (in particular, checkstyle) prior to submitting a patch!
Thanks for this contribution.
The committed patch actually causes failure of Well1024Test in o.a.c.m.random.
Thanks for committing the patch, Sébastien. I see you already changed the seed in Well1024aTest. This hopefully removes the failure.
I'll have a look into Maven to prepare a better patch next time.
I see you already changed the seed in Well1024aTest.
Yes I did, but is this really how we want Well2004aTest to pass?
I guess there is no alternative to this way of making probabilistic test cases pass. However, I understand your bad feeling with this kind of failure fixing. The problem is that probabilistic tests are quiet fuzzy: Neither a passed test nor a failed test provides a clear answer whether something is right or wrong in the implementation. There is just a high chance to pass such a test with a correct implementation. The chance for failure increases with an erroneous implementation due to systematic deviations in the generated data. These chances tell whether it is easy to find a seed which passes the tests or not. Thus difficulties in finding a suitable seed are an indicator for problems in the code.
Thus difficulties in finding a suitable seed are an indicator for problems in the code.
That's exactly the point I've raised on the mailinglist: out of three seeds (100, 1000 and 1001), only one works. Of course, I would not dare to call that representative statistics, but I'm wondering whether or not we should be worried...
The issue about selection of an appropriate seed has been raised elsewhere. No definitive answer has been provided so far, so I suggest we consider this issue as solved for the time being.
Thanks for raising this issue, Christian  especially now as we finalize the 3.0 API.
I am +1 for these changes. I agree that the infbased definition of inverse cum is more standard and we are in a position now make the change, so I say lets do it. I am also +1 on the move of this up to the distribution interface. The reason we did not include it there originally was that we thought we might implement distributions for which we could not define inverses. That has not happened in the last 8 years, so I think its safe enough to push it up.
The code, test, user guide and doc changes for this have to be done carefully. Patches most welcome.
Is everyone else OK with this change?