Mahout
  1. Mahout
  2. MAHOUT-1361

Online algorithm for computing accurate Quantiles using 1-D clustering

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.9
    • Fix Version/s: 0.9
    • Component/s: Math
    • Labels:
      None

      Description

      Implementation of Ted Dunning's paper and initial work on this subject. See https://github.com/tdunning/t-digest/blob/master/docs/theory/t-digest-paper/histo.pdf for the paper.

      An on-line algorithm for computing approximations of rank-based statistics that allows controllable accuracy. This algorithm can also be used to compute hybrid statistics such as trimmed means in addition to computing arbitrary quantiles.

      1. MAHOUT-1361.patch
        51 kB
        Suneel Marthi

        Issue Links

          Activity

          Hide
          Suneel Marthi added a comment -

          Patch committed to trunk.

          Show
          Suneel Marthi added a comment - Patch committed to trunk.
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in Mahout-Quality #2333 (See https://builds.apache.org/job/Mahout-Quality/2333/)
          MAHOUT-1361: Updated with latest changes from Ted's git repo. (smarthi: rev 1544930)

          • /mahout/trunk/math/src/main/java/org/apache/mahout/math/stats/GroupTree.java
          • /mahout/trunk/math/src/main/java/org/apache/mahout/math/stats/Histo.java
          • /mahout/trunk/math/src/main/java/org/apache/mahout/math/stats/TDigest.java
          • /mahout/trunk/math/src/test/java/org/apache/mahout/math/stats/GroupTreeTest.java
          • /mahout/trunk/math/src/test/java/org/apache/mahout/math/stats/HistoTest.java
          • /mahout/trunk/math/src/test/java/org/apache/mahout/math/stats/TDigestTest.java
          Show
          Hudson added a comment - SUCCESS: Integrated in Mahout-Quality #2333 (See https://builds.apache.org/job/Mahout-Quality/2333/ ) MAHOUT-1361 : Updated with latest changes from Ted's git repo. (smarthi: rev 1544930) /mahout/trunk/math/src/main/java/org/apache/mahout/math/stats/GroupTree.java /mahout/trunk/math/src/main/java/org/apache/mahout/math/stats/Histo.java /mahout/trunk/math/src/main/java/org/apache/mahout/math/stats/TDigest.java /mahout/trunk/math/src/test/java/org/apache/mahout/math/stats/GroupTreeTest.java /mahout/trunk/math/src/test/java/org/apache/mahout/math/stats/HistoTest.java /mahout/trunk/math/src/test/java/org/apache/mahout/math/stats/TDigestTest.java
          Hide
          Suneel Marthi added a comment -

          Changes from Ted's github repo committed to Mahout trunk.

          Show
          Suneel Marthi added a comment - Changes from Ted's github repo committed to Mahout trunk.
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in Mahout-Quality #2332 (See https://builds.apache.org/job/Mahout-Quality/2332/)
          MAHOUT-1361: Online algorithm for computing accurate Quantiles using 1-D clustering (smarthi: rev 1544927)

          • /mahout/trunk/CHANGELOG
          • /mahout/trunk/math/src/main/java/org/apache/mahout/math/stats/GroupTree.java
          • /mahout/trunk/math/src/main/java/org/apache/mahout/math/stats/Histo.java
          • /mahout/trunk/math/src/test/java/org/apache/mahout/math/stats/GroupTreeTest.java
          • /mahout/trunk/math/src/test/java/org/apache/mahout/math/stats/HistoTest.java
          Show
          Hudson added a comment - SUCCESS: Integrated in Mahout-Quality #2332 (See https://builds.apache.org/job/Mahout-Quality/2332/ ) MAHOUT-1361 : Online algorithm for computing accurate Quantiles using 1-D clustering (smarthi: rev 1544927) /mahout/trunk/CHANGELOG /mahout/trunk/math/src/main/java/org/apache/mahout/math/stats/GroupTree.java /mahout/trunk/math/src/main/java/org/apache/mahout/math/stats/Histo.java /mahout/trunk/math/src/test/java/org/apache/mahout/math/stats/GroupTreeTest.java /mahout/trunk/math/src/test/java/org/apache/mahout/math/stats/HistoTest.java
          Hide
          Ted Dunning added a comment -

          Otis,

          I added merging and tests on the github version.

          Haven't added it to the paper.

          Show
          Ted Dunning added a comment - Otis, I added merging and tests on the github version. Haven't added it to the paper.
          Hide
          Ted Dunning added a comment -

          it's my understanding current code works on double values (integers).

          It works on doubles, not integers (casting works, of course.

          Do you think it is possible to adapt it to a lexicographical order of unlimited values?

          Yes.

          Except that you need to define what a mean of a bunch of lexicographical values might be. I need an online updatable quantity that works for that. It might be as simple as 11 random values and reservoir sampling, but I don't see a compelling use case for that yet.

          Show
          Ted Dunning added a comment - it's my understanding current code works on double values (integers). It works on doubles, not integers (casting works, of course. Do you think it is possible to adapt it to a lexicographical order of unlimited values? Yes. Except that you need to define what a mean of a bunch of lexicographical values might be. I need an online updatable quantity that works for that. It might be as simple as 11 random values and reservoir sampling, but I don't see a compelling use case for that yet.
          Hide
          Dmitriy Lyubimov added a comment -

          Ted,

          it's my understanding current code works on double values (integers).

          Do you think it is possible to adapt it to a lexicographical order of unlimited values? Thank you.

          Show
          Dmitriy Lyubimov added a comment - Ted, it's my understanding current code works on double values (integers). Do you think it is possible to adapt it to a lexicographical order of unlimited values? Thank you.
          Hide
          Ted Dunning added a comment -

          does Q-digest to anything that t-digest just cannot do?

          No. Not that I know of.

          is stream-lib the right home for t-digest? (not sure if Mahout depends on it already/yet)

          I think that steam-lib is a good place for it.

          Show
          Ted Dunning added a comment - does Q-digest to anything that t-digest just cannot do? No. Not that I know of. is stream-lib the right home for t-digest? (not sure if Mahout depends on it already/yet) I think that steam-lib is a good place for it.
          Hide
          Otis Gospodnetic added a comment -

          Ted - you lost me at Otis
          But if you'll be adding that to your paper, I'll make sure my colleagues and I read and.... t-digest it.

          2 more questions:

          1. does Q-digest to anything that t-digest just cannot do?
          2. is stream-lib the right home for t-digest? (not sure if Mahout depends on it already/yet)
          Show
          Otis Gospodnetic added a comment - Ted - you lost me at Otis But if you'll be adding that to your paper, I'll make sure my colleagues and I read and.... t-digest it. 2 more questions: does Q-digest to anything that t-digest just cannot do? is stream-lib the right home for t-digest? (not sure if Mahout depends on it already/yet)
          Hide
          Ted Dunning added a comment -

          t-digest - love the name, Ted Dunning

          The name was an act of desperation. In presenting the results informally to several friends, the term 1-d clustering just wasn't working at all. The analogy with Q-digests was quite attractive and in a fit of weakness, I picked t to be the letter to replace Q.

          Show
          Ted Dunning added a comment - t-digest - love the name, Ted Dunning The name was an act of desperation. In presenting the results informally to several friends, the term 1-d clustering just wasn't working at all. The analogy with Q-digests was quite attractive and in a fit of weakness, I picked t to be the letter to replace Q.
          Hide
          Ted Dunning added a comment -

          Otis,

          I should put merging into the paper. Omitting it was an oversight.

          Yes. It is quite possible. The basic idea is that you concatenate the sub-digests and compact the result exactly as you do in the sequential points case where the number of centroids grows too large. The compaction is done by taking centroids in random order and adding them to a new digest. The slightly sophistication is that the centroid being added can be added in pieces to several other centroids if they are the closest ones. This cleverness will probably have nearly negligible effect, but it will guarantee the invariant more precisely than if you create a new centroid from the centroid if it has too much weight to fit on any of the equally near existing centroids.

          Show
          Ted Dunning added a comment - Otis, I should put merging into the paper. Omitting it was an oversight. Yes. It is quite possible. The basic idea is that you concatenate the sub-digests and compact the result exactly as you do in the sequential points case where the number of centroids grows too large. The compaction is done by taking centroids in random order and adding them to a new digest. The slightly sophistication is that the centroid being added can be added in pieces to several other centroids if they are the closest ones. This cleverness will probably have nearly negligible effect, but it will guarantee the invariant more precisely than if you create a new centroid from the centroid if it has too much weight to fit on any of the equally near existing centroids.
          Hide
          Ted Dunning added a comment -

          DL,

          I can't comment specifically (which is part of why I implemented this). My impression is that Q-digests dominate count min sketches in terms of memory / accuracy trade-off, but I am not at all sure. I am absolutely certain that t-digests are better than Q-digests in terms of memory / accuracy for extreme quantiles since the errors will be measured in a few ppm instead of a few percent. If you used a flat centroid size limit to compare against Q-digests, the memory consumption should be quite similar. In any case, t-digests are much simpler conceptually and do not have the difficulty that Q-digests have that the proof of accuracy is based on the batch mode rather than the on-line case.

          Show
          Ted Dunning added a comment - DL, I can't comment specifically (which is part of why I implemented this). My impression is that Q-digests dominate count min sketches in terms of memory / accuracy trade-off, but I am not at all sure. I am absolutely certain that t-digests are better than Q-digests in terms of memory / accuracy for extreme quantiles since the errors will be measured in a few ppm instead of a few percent. If you used a flat centroid size limit to compare against Q-digests, the memory consumption should be quite similar. In any case, t-digests are much simpler conceptually and do not have the difficulty that Q-digests have that the proof of accuracy is based on the batch mode rather than the on-line case.
          Hide
          Otis Gospodnetic added a comment -

          t-digest - love the name, Ted Dunning
          I skimmed the paper, but didn't see any mentions about being able to merge multiple computations. I think that's one of the key properties of QDigest. Is that doable with t-digest?

          Show
          Otis Gospodnetic added a comment - t-digest - love the name, Ted Dunning I skimmed the paper, but didn't see any mentions about being able to merge multiple computations. I think that's one of the key properties of QDigest. Is that doable with t-digest?
          Hide
          Dmitriy Lyubimov added a comment -

          interesting.

          i've been using minCountSketch quantiles with a fairly ok results . How does it compare in effort/precision to minCountSketch and similar sketchlike stuff?

          Show
          Dmitriy Lyubimov added a comment - interesting. i've been using minCountSketch quantiles with a fairly ok results . How does it compare in effort/precision to minCountSketch and similar sketchlike stuff?
          Hide
          Suneel Marthi added a comment -

          Patch created from Ted's github repo at https://github.com/tdunning/t-digest

          Show
          Suneel Marthi added a comment - Patch created from Ted's github repo at https://github.com/tdunning/t-digest

            People

            • Assignee:
              Suneel Marthi
              Reporter:
              Suneel Marthi
            • Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development