Commons Math
  1. Commons Math
  2. MATH-418

add a storeless version of Percentile

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 2.1
    • Fix Version/s: 4.0
    • Labels:
      None

      Description

      The Percentile class can handle only in-memory data.
      It would be interesting to use an on-line algorithm to estimate quantiles as a storeless statistic.
      An example of such an algorithm is the exponentially weighted stochastic approximation described in a 2000 paper by Fei Chen , Diane Lambert and José C. Pinheiro "Incremental Quantile Estimation for Massive Tracking" which can be retrieved from CiteSeerX at http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.1580.

      1. patch
        72 kB
        Venkatesha Murthy TS

        Issue Links

          Activity

          Hide
          Phil Steitz added a comment -

          I am interested in working on this; but for greater good will wait for 3.1

          Show
          Phil Steitz added a comment - I am interested in working on this; but for greater good will wait for 3.1
          Hide
          Luc Maisonobe added a comment -

          Postponing to 4.0.

          Show
          Luc Maisonobe added a comment - Postponing to 4.0.
          Hide
          Phil Steitz added a comment -

          Another algorithm to consider is P-Square http://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf

          Also, as a special case, I have a Remdedian implementation, following http://web.ipac.caltech.edu/staff/fmasci/home/statistics_refs/Remedian.pdf that I was going to propose for inclusion in [math]. I still have some work to do on the tests, but I could do that fairly quickly if others are OK adding it. I understand if consensus is to just implement a general solution.

          Show
          Phil Steitz added a comment - Another algorithm to consider is P-Square http://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf Also, as a special case, I have a Remdedian implementation, following http://web.ipac.caltech.edu/staff/fmasci/home/statistics_refs/Remedian.pdf that I was going to propose for inclusion in [math] . I still have some work to do on the tests, but I could do that fairly quickly if others are OK adding it. I understand if consensus is to just implement a general solution.
          Hide
          Ajo Fod added a comment -

          Any solution is better than the current situation. However, there are two desirable features.

          Reducability: Since large data datasets typically are associated with parallel environments, the ideal algorithm would be map/reduce-able.

          Heteroskedasticity: What if one passes it the quantiles of say an exponential distribution not randomly but in sequential order to the quantile estimator? How big would the quantile error be? Is it significantly more than the iid case?

          Show
          Ajo Fod added a comment - Any solution is better than the current situation. However, there are two desirable features. Reducability: Since large data datasets typically are associated with parallel environments, the ideal algorithm would be map/reduce-able. Heteroskedasticity: What if one passes it the quantiles of say an exponential distribution not randomly but in sequential order to the quantile estimator? How big would the quantile error be? Is it significantly more than the iid case?
          Hide
          Ajo Fod added a comment -

          A sister project under Apache has an interesting implementation. http://linkedin.github.io/datafu/docs/javadoc/datafu/pig/stats/StreamingQuantile.html

          Show
          Ajo Fod added a comment - A sister project under Apache has an interesting implementation. http://linkedin.github.io/datafu/docs/javadoc/datafu/pig/stats/StreamingQuantile.html
          Hide
          Ajo Fod added a comment -
          Show
          Ajo Fod added a comment - Just read the original paper here: http://polylogblog.files.wordpress.com/2009/08/80munro-median.pdf
          Hide
          Ajo Fod added a comment -

          Just in case anyone is interested: The colt implementation for the class here:
          http://acs.lbl.gov/software/colt/api/hep/aida/bin/QuantileBin1D.html

          is based on the paper by Manku: http://www.cs.umd.edu/~samir/498/manku.pdf

          Show
          Ajo Fod added a comment - Just in case anyone is interested: The colt implementation for the class here: http://acs.lbl.gov/software/colt/api/hep/aida/bin/QuantileBin1D.html is based on the paper by Manku: http://www.cs.umd.edu/~samir/498/manku.pdf
          Hide
          Ted Dunning added a comment -

          The t-digest is also available.

          You can read more at

          https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf?raw=true

          This paper contains accuracy comparisons against the DataFu StreamingQuantile implementation of GK01 and against the Q-digest algorithm. T-digest dominates both in terms of space and accuracy. I haven't tested for speed yet since the t-digest implementation is improving rapidly. T-digest is also a very simple algorithm conceptually which helps with accurate implementation. T-digest is particularly good at maintaining accuracy for extreme quantiles and for highly skewed distribution. GK01 does well on skewed distributions, but maintains constant absolute accuracy rather than relative accuracy. Q-digest depends on integer representations and thus fails badly on sufficiently skewed distributions.

          The library involved is available via maven and is apache licensed. Apache Commons Math has a "no dependency" policy which might mean that sucking in the code would be a better option than simply linking to this.

          The standard of the art before t-digest is generally considered to be either Greenwald and Khanna's algorithm GK01 (what DataFu uses) or the Q-digest (what stream-lib used before they adopted t-digest). References are in the paper above.

          In case it isn't obvious, source code is available on github at

          https://github.com/tdunning/t-digest

          The p^2 algorithm is actually quite old and is far from the state of the art.

          Show
          Ted Dunning added a comment - The t-digest is also available. You can read more at https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf?raw=true This paper contains accuracy comparisons against the DataFu StreamingQuantile implementation of GK01 and against the Q-digest algorithm. T-digest dominates both in terms of space and accuracy. I haven't tested for speed yet since the t-digest implementation is improving rapidly. T-digest is also a very simple algorithm conceptually which helps with accurate implementation. T-digest is particularly good at maintaining accuracy for extreme quantiles and for highly skewed distribution. GK01 does well on skewed distributions, but maintains constant absolute accuracy rather than relative accuracy. Q-digest depends on integer representations and thus fails badly on sufficiently skewed distributions. The library involved is available via maven and is apache licensed. Apache Commons Math has a "no dependency" policy which might mean that sucking in the code would be a better option than simply linking to this. The standard of the art before t-digest is generally considered to be either Greenwald and Khanna's algorithm GK01 (what DataFu uses) or the Q-digest (what stream-lib used before they adopted t-digest). References are in the paper above. In case it isn't obvious, source code is available on github at https://github.com/tdunning/t-digest The p^2 algorithm is actually quite old and is far from the state of the art.
          Hide
          Venkatesha Murthy TS added a comment -

          Here is the issue i created for p^2 (https://issues.apache.org/jira/browse/MATH-1112 ) .
          Hi All
          thanks for mentioning this old issue. By the way would it help having a common strategy interface to accommodate these various algorithms.
          I feel either ways we should soon have this ability.
          Please let know if i could still submit my implementation for review

          thanks
          vmurthy

          Show
          Venkatesha Murthy TS added a comment - Here is the issue i created for p^2 ( https://issues.apache.org/jira/browse/MATH-1112 ) . Hi All thanks for mentioning this old issue. By the way would it help having a common strategy interface to accommodate these various algorithms. I feel either ways we should soon have this ability. Please let know if i could still submit my implementation for review thanks vmurthy
          Hide
          Ted Dunning added a comment -

          If you already have an implementation, feel free to post it.

          Best, however, would be to have a stand-alone implementation that does accuracy and speed comparison in the style of the tests in the t-digest code.

          Perhaps the shortest path for you would be to fork t-digest, put your implementation into those tests and we can look at accuracy and speed.

          Show
          Ted Dunning added a comment - If you already have an implementation, feel free to post it. Best, however, would be to have a stand-alone implementation that does accuracy and speed comparison in the style of the tests in the t-digest code. Perhaps the shortest path for you would be to fork t-digest, put your implementation into those tests and we can look at accuracy and speed.
          Hide
          Venkatesha Murthy TS added a comment -

          placed a first snapshot into sonatype and have added tests similar to what is available for Percentile.
          Please guide me for next steps.

          https://issues.sonatype.org/browse/OSSRH-9364

          Show
          Venkatesha Murthy TS added a comment - placed a first snapshot into sonatype and have added tests similar to what is available for Percentile. Please guide me for next steps. https://issues.sonatype.org/browse/OSSRH-9364
          Hide
          Phil Steitz added a comment -

          Can you please submit the changes as a patch against trunk in svn?

          Show
          Phil Steitz added a comment - Can you please submit the changes as a patch against trunk in svn?
          Hide
          Venkatesha Murthy TS added a comment -

          Please ignore files such as build, unversioned_files and file named patch itself in the attached file. the only interested files are java files (one in src/main and another src/test).

          Show
          Venkatesha Murthy TS added a comment - Please ignore files such as build, unversioned_files and file named patch itself in the attached file. the only interested files are java files (one in src/main and another src/test).
          Hide
          Venkatesha Murthy TS added a comment -

          added the patch file

          Show
          Venkatesha Murthy TS added a comment - added the patch file
          Hide
          Venkatesha Murthy TS added a comment -

          Hi,
          I have added the patch file into this ticket as attachment. please let me know

          thanks
          murthy

          Show
          Venkatesha Murthy TS added a comment - Hi, I have added the patch file into this ticket as attachment. please let me know thanks murthy

            People

            • Assignee:
              Unassigned
              Reporter:
              Luc Maisonobe
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:

                Development