Commons Math
  1. Commons Math
  2. MATH-100

[Math] HypergeometricDistributionImpl cumulativeProbability calculation overflown

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None
    • Environment:

      Operating System: Windows XP
      Platform: PC

      Description

      Hi, for all that might use this class:

      several things I found when using this class to calculate the
      cumulative probability. I attached my code FYI. three things:

      1. when I used my code to calculate the cumulativeProbability(50) of
      5000 200 100 (Population size, number of successes, sample size),
      result was greater than 1 (1.0000000000134985);
      2. when I calculated cumulativeProbability(50) and
      cumulativeProbability(51) for the distribution 5000 200 100, I got the
      same results, but it should have been different;
      2. the cumulativeProbability returns "for this distribution, X,
      P(X<=x)", but most of the time (at least in my case) what I care about
      is the upper tail (X>=x). based on the above findings, I can't simply
      use 1-cumulativeProbability(x-1) to get what I want.

      here's what I think might be related to the problem: since the
      cumulativeProbability is calculating the lower tail (X<=x), a
      distribution like above often has this probability very close to 1;
      thus it's difficult to record a number that = 1-1E-50 'cause all you
      can do is record sth like 0.9999..... and further digits will be
      rounded. to avoid this, I suggest adding a new function to calculate
      upper tail or change this to calculate x in a range like (n<=x<=m), in
      addition to fix the overflow of the current function.

      thank you for your patience to get here. I'm a newbie but I've asked
      Java experts in our lab about this. looking into the source code really
      isn't up for me......hope someone can fix it, BTW I'm using cygwin under
      WinXP pro SP2, with Java SDK 1.4.2_09 build b05, and the commons-math I used is
      both the 1.0 and the nightly build of 8-15-05.

      the code:
      -------------------
      import org.apache.commons.math.distribution.HypergeometricDistributionImpl;

      class HyperGeometricProbability {

      public static void main(String args[]) {

      if(args.length != 4)

      { System.out.println("USAGE: java HyperGeometricProbabilityCalc [population] [numsuccess] [sample] [overlap]"); }

      else {

      String population = args[0];
      String numsuccess = args[1];
      String sample = args[2];
      String overlap = args[3];

      int populationI = Integer.parseInt(population);
      int numsuccessI = Integer.parseInt(numsuccess);
      int sampleI = Integer.parseInt(sample);
      int overlapI = Integer.parseInt(overlap);

      HypergeometricDistributionImpl hDist = new
      HypergeometricDistributionImpl(populationI, numsuccessI, sampleI);

      double raw_probability = 1.0;
      double cumPro = 1.0;
      double real_cumPro = 1.0;

      try {
      if (0 < overlapI && 0 < numsuccessI && 0 < sampleI)

      { raw_probability = hDist.probability(overlapI); cumPro = hDist.cumulativeProbability(overlapI - 1.0); real_cumPro = 1.0 - cumPro; System.out.println("cumulative probability=" + cumPro + "\t" + "raw probability=" + raw_probability + "\t" + "real cumulative probability=" + "\t" + real_cumPro); }

      }
      catch (Exception e)

      { System.err.println("BAD PROBABILITY CALCULATION"); }

      }

      }

      }
      ----------------------------------

        Activity

        Hide
        mike hu added a comment -

        I tested the nightly build 8-29-05. it seems like the bugs I mentioned above are
        all fixed. will post if I get further problems.

        Thanks a lot!

        Show
        mike hu added a comment - I tested the nightly build 8-29-05. it seems like the bugs I mentioned above are all fixed. will post if I get further problems. Thanks a lot!
        Hide
        mike hu added a comment -

        I saw your post and actually bugzilla told me
        Mid-air collision detected!
        Someone else has made changes to this bug at the same time you were trying to.
        The changes made were:
        pretty cool huh?

        anyways, it seems like that for some cases your are using 1-small probabilities,
        while in the example I mentioned it's not calculated this way: why there seems
        to be two ways to calculate this? is it for performance concern? personally I'd
        rather get the accurate answer than saving time, and I'm currently doing the
        slow looping thing to get the right answer.

        thanks!

        Show
        mike hu added a comment - I saw your post and actually bugzilla told me Mid-air collision detected! Someone else has made changes to this bug at the same time you were trying to. The changes made were: pretty cool huh? anyways, it seems like that for some cases your are using 1-small probabilities, while in the example I mentioned it's not calculated this way: why there seems to be two ways to calculate this? is it for performance concern? personally I'd rather get the accurate answer than saving time, and I'm currently doing the slow looping thing to get the right answer. thanks!
        Hide
        mike hu added a comment -

        OKay, for the love of Java, let's see the details of all the bugs. I tend to
        believe they're not all caught yet, but if I have the jar I can definitely test
        them out.

        for the current build 8-25-05, the code below
        HypergeometricDistributionImpl hDist = new HypergeometricDistributionImpl(26932,
        270, 823);

        double probability = hDist.probability(52);
        double utprobability = hDist.upperCumulativeProbability(52);
        double cprobability = hDist.cumulativeProbability(52);

        System.out.println(probability);
        System.out.println(utprobability);
        System.out.println(cprobability);
        System.out.println(1 - cprobability);
        will return:
        1.018427824183987E-26
        -2.437485768780334E-10
        1.0000000002437486
        -2.437485768780334E-10
        this shows that in this example, the upper tail cumPro is just 1 - cumPro, which
        is wrong because of the numeric error accumulation we discusse earlier.

        but this is obviously not always happening, as when I tried 6000 200 100 50, it
        returned 6.02E-49, as I expected. what's the mystery here?

        furthermore, if you run the above code but change it so that it runs
        HypergeometricDistributionImpl hDist = new HypergeometricDistributionImpl(26932,
        823, 270);
        the result will be different:
        1.0184278236099406E-26
        3.7159353372118176E-10
        0.99999999962840647
        3.7159353372118176E-10
        but by textbook definition of hypergeometric distribution, the order of
        numSuccess and sample shouldn't matter at all, notice here: A. the raw
        probability is slightly different, at the 9th digit; B. this should have
        generated what the earlier calculation had, but it's totally different.

        so, summarizing the bugs here:
        1. upper tail cumulative probability when running some examples, will overflow
        and give negative values;
        2. upper tail cumPro obviously is not consistently calculating in the same
        fashion, sometimes it works, sometimes not;
        3. the order of numSuccess and sample sometimes matters to the code, while it
        should not, ever;
        4. the raw probability, when change the order of numSuccess and sample, will
        differ slightly, if it's calculated in exactly the same way then it shouldn't.

        I suggest that when you have a fix, test it out further with many more examples,
        like what we suggested here, and in reversing order, etc. and if you could send
        me the jar or post it here I can test it out as well, rather than waiting for
        the nightly build to include it.

        pop numSuccess sample query upperCumPro
        26932 823 270 53 1.4160591836816684E-27
        26932 270 823 53
        6000 200 100 50 6.020909331626761E-49
        6000 100 200 50
        26896 895 55 15 2.077516591801479E-10
        26896 55 895 15

        Show
        mike hu added a comment - OKay, for the love of Java, let's see the details of all the bugs. I tend to believe they're not all caught yet, but if I have the jar I can definitely test them out. for the current build 8-25-05, the code below HypergeometricDistributionImpl hDist = new HypergeometricDistributionImpl(26932, 270, 823); double probability = hDist.probability(52); double utprobability = hDist.upperCumulativeProbability(52); double cprobability = hDist.cumulativeProbability(52); System.out.println(probability); System.out.println(utprobability); System.out.println(cprobability); System.out.println(1 - cprobability); will return: 1.018427824183987E-26 -2.437485768780334E-10 1.0000000002437486 -2.437485768780334E-10 this shows that in this example, the upper tail cumPro is just 1 - cumPro, which is wrong because of the numeric error accumulation we discusse earlier. but this is obviously not always happening, as when I tried 6000 200 100 50, it returned 6.02E-49, as I expected. what's the mystery here? furthermore, if you run the above code but change it so that it runs HypergeometricDistributionImpl hDist = new HypergeometricDistributionImpl(26932, 823, 270); the result will be different: 1.0184278236099406E-26 3.7159353372118176E-10 0.99999999962840647 3.7159353372118176E-10 but by textbook definition of hypergeometric distribution, the order of numSuccess and sample shouldn't matter at all, notice here: A. the raw probability is slightly different, at the 9th digit; B. this should have generated what the earlier calculation had, but it's totally different. so, summarizing the bugs here: 1. upper tail cumulative probability when running some examples, will overflow and give negative values; 2. upper tail cumPro obviously is not consistently calculating in the same fashion, sometimes it works, sometimes not; 3. the order of numSuccess and sample sometimes matters to the code, while it should not, ever; 4. the raw probability, when change the order of numSuccess and sample, will differ slightly, if it's calculated in exactly the same way then it shouldn't. I suggest that when you have a fix, test it out further with many more examples, like what we suggested here, and in reversing order, etc. and if you could send me the jar or post it here I can test it out as well, rather than waiting for the nightly build to include it. pop numSuccess sample query upperCumPro 26932 823 270 53 1.4160591836816684E-27 26932 270 823 53 6000 200 100 50 6.020909331626761E-49 6000 100 200 50 26896 895 55 15 2.077516591801479E-10 26896 55 895 15
        Hide
        Brent Worden added a comment -

        (In reply to comment #15)

        I have not had time to get the fix pushed to the trunk, soit will not be in the
        nightly builds yet. I hope to get around to this today.

        Both problems have been resolved. The problem with returning 0 was caused by a
        simple erroneous boundry value check. The poor accuracy for small cumulative
        probabilities was the result of a truncation error when performing 1.0 -
        verySmallProbability in certain circumstances.

        Of note, the performance for these routines is quite poor for large population
        sizes. I plan to address these limitations and improve performance post
        release 1.1.

        Show
        Brent Worden added a comment - (In reply to comment #15) I have not had time to get the fix pushed to the trunk, soit will not be in the nightly builds yet. I hope to get around to this today. Both problems have been resolved. The problem with returning 0 was caused by a simple erroneous boundry value check. The poor accuracy for small cumulative probabilities was the result of a truncation error when performing 1.0 - verySmallProbability in certain circumstances. Of note, the performance for these routines is quite poor for large population sizes. I plan to address these limitations and improve performance post release 1.1.
        Hide
        pkillion added a comment -

        hello - i show no change in the behavior of the upperCumulativeProbability returning 0.0 when x =
        numsuccess or x = sample size. i am not sure about the other issue (raised by mike hu on 0825).
        perhaps the fix did not make the 0826 build or it is still not fixed.

        Show
        pkillion added a comment - hello - i show no change in the behavior of the upperCumulativeProbability returning 0.0 when x = numsuccess or x = sample size. i am not sure about the other issue (raised by mike hu on 0825). perhaps the fix did not make the 0826 build or it is still not fixed.
        Hide
        pkillion added a comment -

        will this fix be in the 0826 nightly? does it address both issues raised by mike hu?

        Show
        pkillion added a comment - will this fix be in the 0826 nightly? does it address both issues raised by mike hu?
        Hide
        Brent Worden added a comment -

        Fixed. SVN revision 240613. Fixed in MATH_1_1 branch.

        Show
        Brent Worden added a comment - Fixed. SVN revision 240613. Fixed in MATH_1_1 branch.
        Hide
        mike hu added a comment -

        oh, another thing I forgot to mention just now: if you calculate something like
        1000 100 1 1, the upper cumPro will return 0. basically, if your query number
        equals to either the number of success or the sample, then you got 0. for these
        cases, the cumPro should just be the probability of this query number.

        Show
        mike hu added a comment - oh, another thing I forgot to mention just now: if you calculate something like 1000 100 1 1, the upper cumPro will return 0. basically, if your query number equals to either the number of success or the sample, then you got 0. for these cases, the cumPro should just be the probability of this query number.
        Hide
        mike hu added a comment -

        I really hate to post this but I think this bug is somewhat hard to catch:
        evolution somehow kicks in......

        check this out:
        before I used the new jar I coded a loop to calculate all the probabilities of
        the upper tail and sum them to get the upper cumPro, just like textbook
        definition, but of course, slow.
        I used the new upperCumulativeProbability method to calculate my example, 5000
        200 100 50. used my old looping way calculation and a perl script to cross
        check. I got exactly the same answer: upper cumPro = 4.407167010070517E-45. Great!!!
        but, as I used it more, ugly bugs showed up: when I calculate 26896 895 55 19,
        the new jar with the fix gave me 3.40373618E-10, my old looping way and the perl
        script says 6.2588249E-15 is right, and the probability(19)= 5.8805728E-15,
        which is a good comfirmation that the second answer is right, because you can
        imagine something with a tail of even smaller than E-15 won't accumulate to E-10.

        I have absolutely no idea where this comes from now: the fixed jar can calculate
        something as small as 4E-45, as we saw in the first example, which indicates
        accumulation of numeric error is not happening here; without any knowledge of
        the source code (not that I can't get it, but I won't be able to understand it
        at all, as a newbie), I have no idea why it would calculate the second case
        wrong. you guys might think "hey, these are values small enough to be taken as
        zero, what's the difference?!" I guess mathematicall precision is what we're
        after here. my looping code can do the job, but it's gonna be tooo slow imagine
        you do this over and over.

        Thanks for everyone's patience and effort in solving this.

        Show
        mike hu added a comment - I really hate to post this but I think this bug is somewhat hard to catch: evolution somehow kicks in...... check this out: before I used the new jar I coded a loop to calculate all the probabilities of the upper tail and sum them to get the upper cumPro, just like textbook definition, but of course, slow. I used the new upperCumulativeProbability method to calculate my example, 5000 200 100 50. used my old looping way calculation and a perl script to cross check. I got exactly the same answer: upper cumPro = 4.407167010070517E-45. Great!!! but, as I used it more, ugly bugs showed up: when I calculate 26896 895 55 19, the new jar with the fix gave me 3.40373618E-10, my old looping way and the perl script says 6.2588249E-15 is right, and the probability(19)= 5.8805728E-15, which is a good comfirmation that the second answer is right, because you can imagine something with a tail of even smaller than E-15 won't accumulate to E-10. I have absolutely no idea where this comes from now: the fixed jar can calculate something as small as 4E-45, as we saw in the first example, which indicates accumulation of numeric error is not happening here; without any knowledge of the source code (not that I can't get it, but I won't be able to understand it at all, as a newbie), I have no idea why it would calculate the second case wrong. you guys might think "hey, these are values small enough to be taken as zero, what's the difference?!" I guess mathematicall precision is what we're after here. my looping code can do the job, but it's gonna be tooo slow imagine you do this over and over. Thanks for everyone's patience and effort in solving this.
        Hide
        mike hu added a comment -

        Thanks a lot! you guys rock

        Show
        mike hu added a comment - Thanks a lot! you guys rock
        Hide
        Brent Worden added a comment -

        I just merged the changes, including this fix, in the MATH_1_1 branch to the
        trunk. The next nightly build should contain your fix.

        The actual added method is only found on the HypergeometricDistributionImpl and
        has a signature of:
        public double upperCumulativeProbability(int x);

        It returns P(X >= x).

        Show
        Brent Worden added a comment - I just merged the changes, including this fix, in the MATH_1_1 branch to the trunk. The next nightly build should contain your fix. The actual added method is only found on the HypergeometricDistributionImpl and has a signature of: public double upperCumulativeProbability(int x); It returns P(X >= x).
        Hide
        mike hu added a comment -

        Oh, I forgot to ask, can you also let me know something about the new upper tail
        method that Brent mentioned? Thanks!

        Show
        mike hu added a comment - Oh, I forgot to ask, can you also let me know something about the new upper tail method that Brent mentioned? Thanks!
        Hide
        mike hu added a comment -

        Ah, I see. so could you please just send me a jar to test with? I don't know how
        to do it the other way, after all, I'm a newbie to the java world.
        microarray@gmail.com
        Thanks!

        (In reply to comment #6)

        Show
        mike hu added a comment - Ah, I see. so could you please just send me a jar to test with? I don't know how to do it the other way, after all, I'm a newbie to the java world. microarray@gmail.com Thanks! (In reply to comment #6)
        Hide
        Phil Steitz added a comment -

        The fix was committed to the MATH_1_1 release branch, not to the svn trunk, so
        the nightlies have not picked it up yet. Once the 1.1 release goes final
        (hopefully very soon), we will merge all of the fixes recently made (including
        this one) back into the main development trunk.

        For now, to test your code, you will have to check out and build using the
        MATH_1_1 branch. If you need help doing this, or would just like a jar to test
        with, let me know.

        Show
        Phil Steitz added a comment - The fix was committed to the MATH_1_1 release branch, not to the svn trunk, so the nightlies have not picked it up yet. Once the 1.1 release goes final (hopefully very soon), we will merge all of the fixes recently made (including this one) back into the main development trunk. For now, to test your code, you will have to check out and build using the MATH_1_1 branch. If you need help doing this, or would just like a jar to test with, let me know.
        Hide
        mike hu added a comment -

        Thanks for the fix. However I'm still having problems with this. When I compiled
        my code again with the new build (I tried the nightly build of 8-20 and 8-21)
        and try to calculate the cumulativeProbability of 5000 200 100 50, I still got
        the same overflow 1.0000000000134985.
        Initially I guess it's because Brent Worden has added a new method to calculate
        the upper tail and the current method is untouched? But I can't find anywhere
        about how to use the new method.
        This post may sound stupid; can anyone tell me whether I'm doing the right thing
        here and the bug is really still here?

        (In reply to comment #4)
        > Addressed. SVN revision 233412. Committed to MATH_1_1 branch. Added upper
        > tail cumulative probability method to HypergeometricDistributionImpl. Note,
        > this new method was not added to the HypergeometricDistribution interface to
        > avoid binary incompatiblities with older versions.

        Show
        mike hu added a comment - Thanks for the fix. However I'm still having problems with this. When I compiled my code again with the new build (I tried the nightly build of 8-20 and 8-21) and try to calculate the cumulativeProbability of 5000 200 100 50, I still got the same overflow 1.0000000000134985. Initially I guess it's because Brent Worden has added a new method to calculate the upper tail and the current method is untouched? But I can't find anywhere about how to use the new method. This post may sound stupid; can anyone tell me whether I'm doing the right thing here and the bug is really still here? (In reply to comment #4) > Addressed. SVN revision 233412. Committed to MATH_1_1 branch. Added upper > tail cumulative probability method to HypergeometricDistributionImpl. Note, > this new method was not added to the HypergeometricDistribution interface to > avoid binary incompatiblities with older versions.
        Hide
        Brent Worden added a comment -

        Addressed. SVN revision 233412. Committed to MATH_1_1 branch. Added upper
        tail cumulative probability method to HypergeometricDistributionImpl. Note,
        this new method was not added to the HypergeometricDistribution interface to
        avoid binary incompatiblities with older versions.

        Show
        Brent Worden added a comment - Addressed. SVN revision 233412. Committed to MATH_1_1 branch. Added upper tail cumulative probability method to HypergeometricDistributionImpl. Note, this new method was not added to the HypergeometricDistribution interface to avoid binary incompatiblities with older versions.
        Hide
        mike hu added a comment -

        I think you're right. it is coming from the accumulation of numerical error. but
        I think the solution should be more complicated than a simple short circuit.
        let's see an example again:

        if we calculate cumulativeProbability(50) for (6000, 200, 100), the upper tail
        (X>=50) probability will be 1 - cumulativeProbability(50) = 3.37E-11. but if you
        do a sum of the point probabilities using probability(50),(51),...,(100), which
        by textbook definition defines the upper tail, you'll get 6.02E-49.
        the discrepency, coming from the accumulation of numerical error, results in a
        false estimate of the real cumulative probability, by a factor of E38. true,
        both are very small, but in some cases it would matter for sure.

        that's why my suggestion is, either we define a new function to the class to
        calculate P(X>=x) for upper tail probability, or we can define the
        cumulativeProbability so that it calculates P(m<=x<=n); in this case if you
        simply want upper or lower tail, just define m or n separately. and I guess
        it'll help to short-circuit the class so that it won't produce p>1 or p<0 cases.

        what do you think?

        (In reply to comment #2)

        Show
        mike hu added a comment - I think you're right. it is coming from the accumulation of numerical error. but I think the solution should be more complicated than a simple short circuit. let's see an example again: if we calculate cumulativeProbability(50) for (6000, 200, 100), the upper tail (X>=50) probability will be 1 - cumulativeProbability(50) = 3.37E-11. but if you do a sum of the point probabilities using probability(50),(51),...,(100), which by textbook definition defines the upper tail, you'll get 6.02E-49. the discrepency, coming from the accumulation of numerical error, results in a false estimate of the real cumulative probability, by a factor of E38. true, both are very small, but in some cases it would matter for sure. that's why my suggestion is, either we define a new function to the class to calculate P(X>=x) for upper tail probability, or we can define the cumulativeProbability so that it calculates P(m<=x<=n); in this case if you simply want upper or lower tail, just define m or n separately. and I guess it'll help to short-circuit the class so that it won't produce p>1 or p<0 cases. what do you think? (In reply to comment #2)
        Hide
        Brent Worden added a comment -

        I think we're witnessing the accumulation of numerical error. The cumulative
        probability method uses the textbook definition for its computation: summing
        the point probabilities. Each of these probabilities results in a small
        numerical error and the process of summing them just exacerbates the error.

        In your case, X ~ hyper(2000, 500, 100) and P(X <= 50) there are 51 small
        errors that, when added together, cause the total sum to be over 1.0. Also,
        the point probabilities the larger x values (30 - 50) are so small that they
        provide little value to the overall sum, just error.

        I think we can address this problem by simply short-circuiting the summation
        whenever it goes above 1.0. This will solve the problem of cumulative
        probablities greater than 1 and make the upper tail probabilties correct
        because, again in your case, P(X >= 50) is extremely close to zero.

        What do others think?

        Show
        Brent Worden added a comment - I think we're witnessing the accumulation of numerical error. The cumulative probability method uses the textbook definition for its computation: summing the point probabilities. Each of these probabilities results in a small numerical error and the process of summing them just exacerbates the error. In your case, X ~ hyper(2000, 500, 100) and P(X <= 50) there are 51 small errors that, when added together, cause the total sum to be over 1.0. Also, the point probabilities the larger x values (30 - 50) are so small that they provide little value to the overall sum, just error. I think we can address this problem by simply short-circuiting the summation whenever it goes above 1.0. This will solve the problem of cumulative probablities greater than 1 and make the upper tail probabilties correct because, again in your case, P(X >= 50) is extremely close to zero. What do others think?
        Hide
        mike hu added a comment -

        Created an attachment (id=16073)
        the code as a txt file

        the same code as in my post. figured this might be easier to work with.

        Show
        mike hu added a comment - Created an attachment (id=16073) the code as a txt file the same code as in my post. figured this might be easier to work with.

          People

          • Assignee:
            Unassigned
            Reporter:
            mike hu
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development