Commons Math
  1. Commons Math
  2. MATH-418

add a storeless version of Percentile

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.1
    • Fix Version/s: 3.4
    • Labels:
      None

      Description

      The Percentile class can handle only in-memory data.
      It would be interesting to use an on-line algorithm to estimate quantiles as a storeless statistic.
      An example of such an algorithm is the exponentially weighted stochastic approximation described in a 2000 paper by Fei Chen , Diane Lambert and José C. Pinheiro "Incremental Quantile Estimation for Massive Tracking" which can be retrieved from CiteSeerX at http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.1580.

      1. 30-may-2014-418-psquare-patch
        66 kB
        Venkatesha Murthy TS
      2. 418-psquare-patch
        96 kB
        Venkatesha Murthy TS
      3. psquare-23-june.patch
        2 kB
        Venkatesha Murthy TS
      4. psquare-patch
        99 kB
        Venkatesha Murthy TS

        Issue Links

          Activity

          Hide
          Phil Steitz added a comment -

          I am interested in working on this; but for greater good will wait for 3.1

          Show
          Phil Steitz added a comment - I am interested in working on this; but for greater good will wait for 3.1
          Hide
          Luc Maisonobe added a comment -

          Postponing to 4.0.

          Show
          Luc Maisonobe added a comment - Postponing to 4.0.
          Hide
          Phil Steitz added a comment -

          Another algorithm to consider is P-Square http://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf

          Also, as a special case, I have a Remdedian implementation, following http://web.ipac.caltech.edu/staff/fmasci/home/statistics_refs/Remedian.pdf that I was going to propose for inclusion in [math]. I still have some work to do on the tests, but I could do that fairly quickly if others are OK adding it. I understand if consensus is to just implement a general solution.

          Show
          Phil Steitz added a comment - Another algorithm to consider is P-Square http://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf Also, as a special case, I have a Remdedian implementation, following http://web.ipac.caltech.edu/staff/fmasci/home/statistics_refs/Remedian.pdf that I was going to propose for inclusion in [math] . I still have some work to do on the tests, but I could do that fairly quickly if others are OK adding it. I understand if consensus is to just implement a general solution.
          Hide
          Ajo Fod added a comment -

          Any solution is better than the current situation. However, there are two desirable features.

          Reducability: Since large data datasets typically are associated with parallel environments, the ideal algorithm would be map/reduce-able.

          Heteroskedasticity: What if one passes it the quantiles of say an exponential distribution not randomly but in sequential order to the quantile estimator? How big would the quantile error be? Is it significantly more than the iid case?

          Show
          Ajo Fod added a comment - Any solution is better than the current situation. However, there are two desirable features. Reducability: Since large data datasets typically are associated with parallel environments, the ideal algorithm would be map/reduce-able. Heteroskedasticity: What if one passes it the quantiles of say an exponential distribution not randomly but in sequential order to the quantile estimator? How big would the quantile error be? Is it significantly more than the iid case?
          Hide
          Ajo Fod added a comment -

          A sister project under Apache has an interesting implementation. http://linkedin.github.io/datafu/docs/javadoc/datafu/pig/stats/StreamingQuantile.html

          Show
          Ajo Fod added a comment - A sister project under Apache has an interesting implementation. http://linkedin.github.io/datafu/docs/javadoc/datafu/pig/stats/StreamingQuantile.html
          Hide
          Ajo Fod added a comment -
          Show
          Ajo Fod added a comment - Just read the original paper here: http://polylogblog.files.wordpress.com/2009/08/80munro-median.pdf
          Hide
          Ajo Fod added a comment -

          Just in case anyone is interested: The colt implementation for the class here:
          http://acs.lbl.gov/software/colt/api/hep/aida/bin/QuantileBin1D.html

          is based on the paper by Manku: http://www.cs.umd.edu/~samir/498/manku.pdf

          Show
          Ajo Fod added a comment - Just in case anyone is interested: The colt implementation for the class here: http://acs.lbl.gov/software/colt/api/hep/aida/bin/QuantileBin1D.html is based on the paper by Manku: http://www.cs.umd.edu/~samir/498/manku.pdf
          Hide
          Ted Dunning added a comment -

          The t-digest is also available.

          You can read more at

          https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf?raw=true

          This paper contains accuracy comparisons against the DataFu StreamingQuantile implementation of GK01 and against the Q-digest algorithm. T-digest dominates both in terms of space and accuracy. I haven't tested for speed yet since the t-digest implementation is improving rapidly. T-digest is also a very simple algorithm conceptually which helps with accurate implementation. T-digest is particularly good at maintaining accuracy for extreme quantiles and for highly skewed distribution. GK01 does well on skewed distributions, but maintains constant absolute accuracy rather than relative accuracy. Q-digest depends on integer representations and thus fails badly on sufficiently skewed distributions.

          The library involved is available via maven and is apache licensed. Apache Commons Math has a "no dependency" policy which might mean that sucking in the code would be a better option than simply linking to this.

          The standard of the art before t-digest is generally considered to be either Greenwald and Khanna's algorithm GK01 (what DataFu uses) or the Q-digest (what stream-lib used before they adopted t-digest). References are in the paper above.

          In case it isn't obvious, source code is available on github at

          https://github.com/tdunning/t-digest

          The p^2 algorithm is actually quite old and is far from the state of the art.

          Show
          Ted Dunning added a comment - The t-digest is also available. You can read more at https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf?raw=true This paper contains accuracy comparisons against the DataFu StreamingQuantile implementation of GK01 and against the Q-digest algorithm. T-digest dominates both in terms of space and accuracy. I haven't tested for speed yet since the t-digest implementation is improving rapidly. T-digest is also a very simple algorithm conceptually which helps with accurate implementation. T-digest is particularly good at maintaining accuracy for extreme quantiles and for highly skewed distribution. GK01 does well on skewed distributions, but maintains constant absolute accuracy rather than relative accuracy. Q-digest depends on integer representations and thus fails badly on sufficiently skewed distributions. The library involved is available via maven and is apache licensed. Apache Commons Math has a "no dependency" policy which might mean that sucking in the code would be a better option than simply linking to this. The standard of the art before t-digest is generally considered to be either Greenwald and Khanna's algorithm GK01 (what DataFu uses) or the Q-digest (what stream-lib used before they adopted t-digest). References are in the paper above. In case it isn't obvious, source code is available on github at https://github.com/tdunning/t-digest The p^2 algorithm is actually quite old and is far from the state of the art.
          Hide
          Venkatesha Murthy TS added a comment -

          Here is the issue i created for p^2 (https://issues.apache.org/jira/browse/MATH-1112 ) .
          Hi All
          thanks for mentioning this old issue. By the way would it help having a common strategy interface to accommodate these various algorithms.
          I feel either ways we should soon have this ability.
          Please let know if i could still submit my implementation for review

          thanks
          vmurthy

          Show
          Venkatesha Murthy TS added a comment - Here is the issue i created for p^2 ( https://issues.apache.org/jira/browse/MATH-1112 ) . Hi All thanks for mentioning this old issue. By the way would it help having a common strategy interface to accommodate these various algorithms. I feel either ways we should soon have this ability. Please let know if i could still submit my implementation for review thanks vmurthy
          Hide
          Ted Dunning added a comment -

          If you already have an implementation, feel free to post it.

          Best, however, would be to have a stand-alone implementation that does accuracy and speed comparison in the style of the tests in the t-digest code.

          Perhaps the shortest path for you would be to fork t-digest, put your implementation into those tests and we can look at accuracy and speed.

          Show
          Ted Dunning added a comment - If you already have an implementation, feel free to post it. Best, however, would be to have a stand-alone implementation that does accuracy and speed comparison in the style of the tests in the t-digest code. Perhaps the shortest path for you would be to fork t-digest, put your implementation into those tests and we can look at accuracy and speed.
          Hide
          Venkatesha Murthy TS added a comment -

          placed a first snapshot into sonatype and have added tests similar to what is available for Percentile.
          Please guide me for next steps.

          https://issues.sonatype.org/browse/OSSRH-9364

          Show
          Venkatesha Murthy TS added a comment - placed a first snapshot into sonatype and have added tests similar to what is available for Percentile. Please guide me for next steps. https://issues.sonatype.org/browse/OSSRH-9364
          Hide
          Phil Steitz added a comment -

          Can you please submit the changes as a patch against trunk in svn?

          Show
          Phil Steitz added a comment - Can you please submit the changes as a patch against trunk in svn?
          Hide
          Venkatesha Murthy TS added a comment -

          Please ignore files such as build, unversioned_files and file named patch itself in the attached file. the only interested files are java files (one in src/main and another src/test).

          Show
          Venkatesha Murthy TS added a comment - Please ignore files such as build, unversioned_files and file named patch itself in the attached file. the only interested files are java files (one in src/main and another src/test).
          Hide
          Venkatesha Murthy TS added a comment -

          added the patch file

          Show
          Venkatesha Murthy TS added a comment - added the patch file
          Hide
          Venkatesha Murthy TS added a comment -

          Hi,
          I have added the patch file into this ticket as attachment. please let me know

          thanks
          murthy

          Show
          Venkatesha Murthy TS added a comment - Hi, I have added the patch file into this ticket as attachment. please let me know thanks murthy
          Hide
          Phil Steitz added a comment -

          Thanks for the patch, Murthy!

          Sorry for the delay reviewing. I am not finished reviewing the implementation, but need to confirm a couple of things before plowing ahead with it:

          0) You are OK removing the @author tag. We don't use those in Apache Commons.
          1) The code in the patch is a "clean room" implementation based on the articles, not a port / translation from another source
          2) You can contribute this code under the terms of the Apache Contributor's License Agreement (https://www.apache.org/licenses/icla.txt)

          CheckStyle and Findbugs complain about quite a few things in the source. These will need to be fixed or explained prior to commit. If you run "mvn site" from trunk you can see the reports. An updated patch with this stuff cleaned up would save us some time.

          Show
          Phil Steitz added a comment - Thanks for the patch, Murthy! Sorry for the delay reviewing. I am not finished reviewing the implementation, but need to confirm a couple of things before plowing ahead with it: 0) You are OK removing the @author tag. We don't use those in Apache Commons. 1) The code in the patch is a "clean room" implementation based on the articles, not a port / translation from another source 2) You can contribute this code under the terms of the Apache Contributor's License Agreement ( https://www.apache.org/licenses/icla.txt ) CheckStyle and Findbugs complain about quite a few things in the source. These will need to be fixed or explained prior to commit. If you run "mvn site" from trunk you can see the reports. An updated patch with this stuff cleaned up would save us some time.
          Hide
          Ted Dunning added a comment -

          I just read the proposed code.

          The tests are very good when it comes to checking for boundary conditions and illegal input, but they do not test several functional aspects.

          In particular, it would be good to have tests for the following conditions:

          a) extreme quantiles such as 99.999%-ile or 0.00001 %-ile

          b) highly skewed input distributions such as Gamma(0.1, 0.1), especially with low quantiles like 0.1%-ile

          This algorithm is essentially identical to the one that Mahout used to use. We dropped it in favor of t-digest because it was only able to compute one percentile per set of markers and because accuracy for skewed distributions was relatively marginal.

          The t-digest and Greenwald Khanna algorithms (let's call it GK01) both handle skewed distributions much more accurately than the Chen, Lambert, Pinheiro (let's call it CLP) algorithm. A particular virtue of GK01 and t-digest is that they allow any quantile to be computed without having to pre-specify which quantiles you want. It is common to need many quantiles from the same set of data and some applications require arbitrary quantiles. GK01 and t-digest also allow inverse CDF calculation (where you get to find out the quantile value for a particular x). GK01 is the algorithm used by Sawzall and (I think) Dremel. The t-digest algorithm has been adopted by Mahout, Streamlib and Elastic Search.

          The particular virtue of t-digest over GK01 is that it is several orders of magnitude more accurate for extreme quantiles. There are three well-tested implementations of t-digest available in both source and as a maven artifact. All the code is ASL 2 licensed.

          See https://github.com/tdunning/t-digest for more information.

          Show
          Ted Dunning added a comment - I just read the proposed code. The tests are very good when it comes to checking for boundary conditions and illegal input, but they do not test several functional aspects. In particular, it would be good to have tests for the following conditions: a) extreme quantiles such as 99.999%-ile or 0.00001 %-ile b) highly skewed input distributions such as Gamma(0.1, 0.1), especially with low quantiles like 0.1%-ile This algorithm is essentially identical to the one that Mahout used to use. We dropped it in favor of t-digest because it was only able to compute one percentile per set of markers and because accuracy for skewed distributions was relatively marginal. The t-digest and Greenwald Khanna algorithms (let's call it GK01) both handle skewed distributions much more accurately than the Chen, Lambert, Pinheiro (let's call it CLP) algorithm. A particular virtue of GK01 and t-digest is that they allow any quantile to be computed without having to pre-specify which quantiles you want. It is common to need many quantiles from the same set of data and some applications require arbitrary quantiles. GK01 and t-digest also allow inverse CDF calculation (where you get to find out the quantile value for a particular x). GK01 is the algorithm used by Sawzall and (I think) Dremel. The t-digest algorithm has been adopted by Mahout, Streamlib and Elastic Search. The particular virtue of t-digest over GK01 is that it is several orders of magnitude more accurate for extreme quantiles. There are three well-tested implementations of t-digest available in both source and as a maven artifact. All the code is ASL 2 licensed. See https://github.com/tdunning/t-digest for more information.
          Hide
          Venkatesha Murthy TS added a comment -

          Ted ,
          I understand the point of view ; however just to clarify
          a) While I have modeled this implement based on the existing Percentile class and as StorelessUnivariatestatistic; i agree its approximate for smaller sets but should improve for larger set. I could add a comment that this is an approximate technique,
          b) GK, T-Digest, Q-Digest, BinMedian, BinApprox etc i feel all of them can be implementation choices that user could use rather than sticking to any perticular algo

          Please let me know .

          Phil,
          Also l will clear the check style and findbug issues and re submit with tests added

          Show
          Venkatesha Murthy TS added a comment - Ted , I understand the point of view ; however just to clarify a) While I have modeled this implement based on the existing Percentile class and as StorelessUnivariatestatistic; i agree its approximate for smaller sets but should improve for larger set. I could add a comment that this is an approximate technique, b) GK, T-Digest, Q-Digest, BinMedian, BinApprox etc i feel all of them can be implementation choices that user could use rather than sticking to any perticular algo Please let me know . Phil, Also l will clear the check style and findbug issues and re submit with tests added
          Hide
          Ted Dunning added a comment -

          Murthy,

          I think that you didn't get the point.

          All of these algorithms (except those retaining all of the dat) are approximate. That isn't news, nor does it differentiate the algorithms.

          The questions are:

          a) is only a single quantile computed and is that quantile flexible?

          b) how accurate is the quantile calculation?

          c) how does the accuracy vary as a function of quantile, especially near 0 or 1?

          d) how long does it take to insert a new point?

          My assertion is that t-digest dominates all of these other options on all of these questions except possibly for (d) where it roughly matches them. I am not sure I see the point of implementing these other algorithms if they provide no advantage.

          To be specific:

          GK - is relatively fast and allows computation of multiple quantiles. Memory usage for a large number of quantiles is large, accuracy is less than t-digest controlling for memory usage. Accuracy at quantiles near 0 or 1 is several orders of magnitude worse. GK is not particularly susceptible to skewed distributions.

          Q-digest - only works on an integer domain which makes it not work well at all for skewed distributions. It has comparable accuracy as GK when quantization is not a factor. Memory usage is quite comparable to t-digest, but accuracy is noticeably worse and near q=0 or 1, accuracy is several orders of magnitude worse.

          BinMedian and BinApprox - only computes the median, fails entirely for skewed distributions where the median is outside the range mean - sd to mean + sd. Accuracy is comparable to other algorithms but since no quantiles other than median are computed it doesn't even compete.

          Show
          Ted Dunning added a comment - Murthy, I think that you didn't get the point. All of these algorithms (except those retaining all of the dat) are approximate. That isn't news, nor does it differentiate the algorithms. The questions are: a) is only a single quantile computed and is that quantile flexible? b) how accurate is the quantile calculation? c) how does the accuracy vary as a function of quantile, especially near 0 or 1? d) how long does it take to insert a new point? My assertion is that t-digest dominates all of these other options on all of these questions except possibly for (d) where it roughly matches them. I am not sure I see the point of implementing these other algorithms if they provide no advantage. To be specific: GK - is relatively fast and allows computation of multiple quantiles. Memory usage for a large number of quantiles is large, accuracy is less than t-digest controlling for memory usage. Accuracy at quantiles near 0 or 1 is several orders of magnitude worse. GK is not particularly susceptible to skewed distributions. Q-digest - only works on an integer domain which makes it not work well at all for skewed distributions. It has comparable accuracy as GK when quantization is not a factor. Memory usage is quite comparable to t-digest, but accuracy is noticeably worse and near q=0 or 1, accuracy is several orders of magnitude worse. BinMedian and BinApprox - only computes the median, fails entirely for skewed distributions where the median is outside the range mean - sd to mean + sd. Accuracy is comparable to other algorithms but since no quantiles other than median are computed it doesn't even compete.
          Hide
          Ted Dunning added a comment -

          As a clarification, the median and the mean are never separated by more than a single standard deviation. The failure of binapprox on skewed data occurs because essentially all of the data can be in the range around the mean. A secondary failure is on sorted data.

          For example, for Gamma(0.1, 0.1), the mean is 1, the sd is 3.15 and the median is 0.006. About 93% of the distribution is in the range [mean-sd, mean+sd]

          Show
          Ted Dunning added a comment - As a clarification, the median and the mean are never separated by more than a single standard deviation. The failure of binapprox on skewed data occurs because essentially all of the data can be in the range around the mean. A secondary failure is on sorted data. For example, for Gamma(0.1, 0.1), the mean is 1, the sd is 3.15 and the median is 0.006. About 93% of the distribution is in the range [mean-sd, mean+sd]
          Hide
          Venkatesha Murthy TS added a comment -

          Thanks for the details..
          So are there any plans to get this T-Digest into Apache math3 interfaces that can be substituted for say Percentile or as a StorelessUnivariateStatistic.

          Show
          Venkatesha Murthy TS added a comment - Thanks for the details.. So are there any plans to get this T-Digest into Apache math3 interfaces that can be substituted for say Percentile or as a StorelessUnivariateStatistic.
          Hide
          Ted Dunning added a comment -

          There are no plans as far as I know.

          You are welcome to jump on it if you like. I am happy to help with advice, but beyond what I have already done in deriving the algorithm and writing the t-digest reference implementation, I have to bow out of real coding for this.

          Show
          Ted Dunning added a comment - There are no plans as far as I know. You are welcome to jump on it if you like. I am happy to help with advice, but beyond what I have already done in deriving the algorithm and writing the t-digest reference implementation, I have to bow out of real coding for this.
          Hide
          Venkatesha Murthy TS added a comment -

          Hi Phil,
           
          With Ted commenting on the accuracy at near extremes, and T-Digest not yet being available in Math3 interfaces; does it still hold good if i resubmit the patch for p-2 algo with all the necessary changes
           
          Please let know.
           
          Also if there are other problems in commons that i can take a look at. please let me know the link for the same.
           
          thanks for your help/advice and review
           
          Regards
          Murthy

          Show
          Venkatesha Murthy TS added a comment - Hi Phil,   With Ted commenting on the accuracy at near extremes, and T-Digest not yet being available in Math3 interfaces; does it still hold good if i resubmit the patch for p-2 algo with all the necessary changes   Please let know.   Also if there are other problems in commons that i can take a look at. please let me know the link for the same.   thanks for your help/advice and review   Regards Murthy
          Hide
          Phil Steitz added a comment -

          We are a "do-ocracy" here. Those who actually do the work determine what happens to the code. Ted has a cool new algorithm and implementation that he is willing to grant, but he is not willing to do the work to get it ready to commit. You have implemented a standard algorithm that is close to ready to commit. Complete the cleanup and unless others have technical issues with the code, I will commit it.

          Show
          Phil Steitz added a comment - We are a "do-ocracy" here. Those who actually do the work determine what happens to the code. Ted has a cool new algorithm and implementation that he is willing to grant, but he is not willing to do the work to get it ready to commit. You have implemented a standard algorithm that is close to ready to commit. Complete the cleanup and unless others have technical issues with the code, I will commit it.
          Hide
          Ted Dunning added a comment -

          Venkat,

          Phil's comment is exactly correct. I am merely offering code you can use if you like. I am, sadly, not offering to do the work to integrate t-digest into commons math.

          This means that you are the one to choose how you would like to go forward.

          Show
          Ted Dunning added a comment - Venkat, Phil's comment is exactly correct. I am merely offering code you can use if you like. I am, sadly, not offering to do the work to integrate t-digest into commons math. This means that you are the one to choose how you would like to go forward.
          Hide
          Venkatesha Murthy TS added a comment -

          Hi Phill

          Please find the revised patch named psquare-patch. In this i have tried to address the following
          a) better code coverage (>93%) for my class
          b) Addressed the PMD, findbugs and checkstyle report issues.

          thanks
          murthy

          Show
          Venkatesha Murthy TS added a comment - Hi Phill Please find the revised patch named psquare-patch. In this i have tried to address the following a) better code coverage (>93%) for my class b) Addressed the PMD, findbugs and checkstyle report issues. thanks murthy
          Hide
          Venkatesha Murthy TS added a comment -

          Modified patch after incorporating review comments

          Show
          Venkatesha Murthy TS added a comment - Modified patch after incorporating review comments
          Hide
          Venkatesha Murthy TS added a comment -

          Hi

          Keen to know if anything else i need to modify. Please help

          thanks
          murthy

          Show
          Venkatesha Murthy TS added a comment - Hi Keen to know if anything else i need to modify. Please help thanks murthy
          Hide
          Venkatesha Murthy TS added a comment -

          The attached <b>psquared-patch is the latest</b> and hence please review that file. I shall remove the old patch if it confuses.

          Show
          Venkatesha Murthy TS added a comment - The attached <b>psquared-patch is the latest</b> and hence please review that file. I shall remove the old patch if it confuses.
          Hide
          Gilles added a comment -

          Thanks for you effort up to now. There are however some remarks about the coding style, to try and ensure consistency throughout the Commons Math codebase:

          • You should not use an underscore as a suffix or prefix (for method or variable names)
          • For constants, use the "final" keyword in the declaration
          • Do not overuse <p> in comments (especially in protected and private classes)
          • No initial uppercase letter after @return
          • Do not use protected fields (i.e. create setters instead)
          • Conform to the natural rules for capitalization in comments (e.g. no uppercase for the second word in a sentence)
          • Remove code that was commented-out (and indicated as unnecessary)
          • Do not start a @param or @throws with the word "is" (or "passed in")
          • Do not use a single uppercase letter as a variable name
          • Prefer (moderately) long names for instance fields and methods wherever it would make the code more self-documenting (cf. "ns", "nps", "qs")
          • Avoid consecutive uppercase letters (e.g. "processADataPoint" should rather be " "processDataPoint").
          • (Personal preference) The first line of a Javadoc comment should not be broken after "@param x"; it is harder to read and needlessly increases the height of the text.
          • Indicates the purpose of "package" visibility when it is used instead of "private" or "protected".
          • Fully comment public fields, parameters, constructor and methods: see e.g. the constructor of "Marker" where the reader is referred to the documentation of a private field (which would not appear in the generated HTML)

          I can't comment much about the algorithmic side.
          A few points I noticed:

          1. The input (quantile) is allowed to be either in [0, 1] or in (1, 100]; this is error-prone and this kind of flexibility is not to be handled at the math level (IMHO). The input should be a primitive "double", not a "Number".
          2. At line 1206, the error message is abused, as it should refer to an index.
          3. At line 1211, use "NullArgumentException" in place of "MathIllegalArgumentException".
          4. At line 1214, do not combine different types of failures, and use "DimensionMismatchException" in place of "MathIllegalArgumentException".
          Show
          Gilles added a comment - Thanks for you effort up to now. There are however some remarks about the coding style, to try and ensure consistency throughout the Commons Math codebase: You should not use an underscore as a suffix or prefix (for method or variable names) For constants, use the "final" keyword in the declaration Do not overuse <p> in comments (especially in protected and private classes) No initial uppercase letter after @return Do not use protected fields (i.e. create setters instead) Conform to the natural rules for capitalization in comments (e.g. no uppercase for the second word in a sentence) Remove code that was commented-out (and indicated as unnecessary) Do not start a @param or @throws with the word "is" (or "passed in") Do not use a single uppercase letter as a variable name Prefer (moderately) long names for instance fields and methods wherever it would make the code more self-documenting (cf. "ns", "nps", "qs") Avoid consecutive uppercase letters (e.g. "processADataPoint" should rather be " "processDataPoint"). (Personal preference) The first line of a Javadoc comment should not be broken after "@param x"; it is harder to read and needlessly increases the height of the text. Indicates the purpose of "package" visibility when it is used instead of "private" or "protected". Fully comment public fields, parameters, constructor and methods: see e.g. the constructor of "Marker" where the reader is referred to the documentation of a private field (which would not appear in the generated HTML) I can't comment much about the algorithmic side. A few points I noticed: The input (quantile) is allowed to be either in [0, 1] or in (1, 100]; this is error-prone and this kind of flexibility is not to be handled at the math level (IMHO). The input should be a primitive "double", not a "Number". At line 1206, the error message is abused, as it should refer to an index. At line 1211, use "NullArgumentException" in place of "MathIllegalArgumentException". At line 1214, do not combine different types of failures, and use "DimensionMismatchException" in place of "MathIllegalArgumentException".
          Hide
          Venkatesha Murthy TS added a comment -

          Added the new patch incorporating all the comments. Where necessary the noun forms have retained the capitalization.

          Show
          Venkatesha Murthy TS added a comment - Added the new patch incorporating all the comments. Where necessary the noun forms have retained the capitalization.
          Hide
          Venkatesha Murthy TS added a comment -

          just wanted to add few points here:

          First of, many thanks for suggesting a good set of comments which i have tried incorporating allmost all of them.

          The new patch to be used for review is 418-psquare-patch,

          a) The earlier reasons for using variables such as n, np etc were to indicate the variables as close as possible to resemble variables used in the paper. However i have already incorporated self describing long names in the new patch

          b) is there a way to put these rules in checkstyle or such things to automate; as its hard to remember to manually do these every time. Please let me know if this is available somewhere so i can use

          c) Do i need to remove the earlier patch file as having 2 patch files may confuse. Please let know.

          thanks again
          venkat.

          Show
          Venkatesha Murthy TS added a comment - just wanted to add few points here: First of, many thanks for suggesting a good set of comments which i have tried incorporating allmost all of them. The new patch to be used for review is 418-psquare-patch, a) The earlier reasons for using variables such as n, np etc were to indicate the variables as close as possible to resemble variables used in the paper. However i have already incorporated self describing long names in the new patch b) is there a way to put these rules in checkstyle or such things to automate; as its hard to remember to manually do these every time. Please let me know if this is available somewhere so i can use c) Do i need to remove the earlier patch file as having 2 patch files may confuse. Please let know. thanks again venkat.
          Hide
          Venkatesha Murthy TS added a comment -

          Hi All,

          Any updates on this please.

          thanks
          venkat

          Show
          Venkatesha Murthy TS added a comment - Hi All, Any updates on this please. thanks venkat
          Hide
          Gilles added a comment -

          There are still things to fix.

          • Usage of CM exceptions:
            throw new NullArgumentException(LocalizedFormats.NULL_NOT_ALLOWED, "estimator")
            

            must rather be

            throw new NullArgumentException();
            

            Indeed the message is the default one, and it takes no argument (hence the string "estimator" is useless).

          • You use "package" visibility in places where it must be "private" (e.g. "postConstruct").
          • I don't get why "Marker", "Markers", "PSquareInterpolatorEvaluator", ... are "protected". It is not helpful to indicate that it is for extensibility (as this is obvious); rather you should indicate what customization may make sense. Until an actual need arises (TBD on the ML), all the implementation details should be inside a "black-box" (to allow modifications without breaking compatibility).
          • IMHO, formatting should be taken care of in a separate class. The "toString" method should not do fancy formatting. (But this issue should probably be raised on the ML.)
          • No comment in all uppercase; please. This akin to shouting.
          • What is the purpose of "toMap"?
          • Keyword "final" is missing in several places (see e.g. method "estimator(PSquareEstimator)", at line 1035).
          • Is it really necessary to override "equals" (see e.g. at line 1066)? Since you make strict comparisons of double values, it is unlikely that intances are ever going to be equal...
          • In Javadoc comment of public methods, do not use a @link to a "private" field or method.
          • Missing comment at lines 669 and 1368. Yes, everything must be documented.
          • Citing the paper more than once is redundant. The "href" link should appear in the top class's Javadoc; then it is sufficient (and more readable) to use a short-hand like "P-square".

          Do you run the "site" target of maven and make sure that the reports are clean? (Among other things, this runs "CheckStyle" and "FindBugs".)

          A general remark: when introducing a new functionality, you should first focus on the correctness of the implementation's main purpose, leaving out everything that is not strictly necessary (cf. formatting, provision for extensibility, ...). The longer the code, the longer it will take to review... Whereas once a solid base is committed, anyone can easily polish it, adding bells and whistles.

          Show
          Gilles added a comment - There are still things to fix. Usage of CM exceptions: throw new NullArgumentException(LocalizedFormats.NULL_NOT_ALLOWED, "estimator" ) must rather be throw new NullArgumentException(); Indeed the message is the default one, and it takes no argument (hence the string "estimator" is useless). You use "package" visibility in places where it must be "private" (e.g. "postConstruct"). I don't get why "Marker", "Markers", "PSquareInterpolatorEvaluator", ... are "protected". It is not helpful to indicate that it is for extensibility (as this is obvious); rather you should indicate what customization may make sense. Until an actual need arises (TBD on the ML), all the implementation details should be inside a "black-box" (to allow modifications without breaking compatibility). IMHO, formatting should be taken care of in a separate class. The "toString" method should not do fancy formatting. (But this issue should probably be raised on the ML.) No comment in all uppercase; please. This akin to shouting. What is the purpose of "toMap"? Keyword "final" is missing in several places (see e.g. method "estimator(PSquareEstimator)", at line 1035). Is it really necessary to override "equals" (see e.g. at line 1066)? Since you make strict comparisons of double values, it is unlikely that intances are ever going to be equal... In Javadoc comment of public methods, do not use a @link to a "private" field or method. Missing comment at lines 669 and 1368. Yes, everything must be documented. Citing the paper more than once is redundant. The "href" link should appear in the top class's Javadoc; then it is sufficient (and more readable) to use a short-hand like "P-square". Do you run the "site" target of maven and make sure that the reports are clean? (Among other things, this runs "CheckStyle" and "FindBugs".) A general remark: when introducing a new functionality, you should first focus on the correctness of the implementation's main purpose, leaving out everything that is not strictly necessary (cf. formatting, provision for extensibility, ...). The longer the code, the longer it will take to review... Whereas once a solid base is committed, anyone can easily polish it, adding bells and whistles.
          Hide
          Venkatesha Murthy TS added a comment -

          I have incorporated almost all comments. Please find the latest patch by date 30-may-2014-418-psquare-patch.

          Please suggest for further changes and i am greatly indebted to the review comments and hope to have much less going further.

          One question i have though is if i need to major algorithm part which is under private static now; i am under the impression to provide a creation method and expose the object as an protected interface. In order to do coverage and tests ; was forced to do this. Please help me if this is ok. Now with this i have made sure to not impact existing instance.

          Summary:
          -------------

          a) Modified to use appropriate exception such as NullArgumentException etc
          where required
          b) Turned all inner classes as private static; however keeping a thin interfaces as protected. Even this will not impact the existing instance of PSquaredPercentile.
          c) I have cleared findbugs, checkstyle, pmd and about 93% of jacoco code coverage.

          Thanks a ton. Please let me know if i could do a quick chat over so that i can quickly understand context and rationale of any further comments.

          Regards
          Venkat.

          Show
          Venkatesha Murthy TS added a comment - I have incorporated almost all comments. Please find the latest patch by date 30-may-2014-418-psquare-patch. Please suggest for further changes and i am greatly indebted to the review comments and hope to have much less going further. One question i have though is if i need to major algorithm part which is under private static now; i am under the impression to provide a creation method and expose the object as an protected interface. In order to do coverage and tests ; was forced to do this. Please help me if this is ok. Now with this i have made sure to not impact existing instance. Summary: ------------- a) Modified to use appropriate exception such as NullArgumentException etc where required b) Turned all inner classes as private static; however keeping a thin interfaces as protected. Even this will not impact the existing instance of PSquaredPercentile. c) I have cleared findbugs, checkstyle, pmd and about 93% of jacoco code coverage. Thanks a ton. Please let me know if i could do a quick chat over so that i can quickly understand context and rationale of any further comments. Regards Venkat.
          Hide
          Venkatesha Murthy TS added a comment -

          Sorry i meant to ask the question as follows:

          The context is : I have modeled the algorithm into 2 nested static classes(i.e Markers and Marker) as i couldnt possibly move all of them to one big main class (The main class is PSquarePercentile). Since the bulk of logic needs to be tested and coveraged; i am forced to expose the logic as a protected inner interface(Which is PSquareMarkers where Markers implements PSquareMarkers). This interface is instantiated using a creation method (newMarkers) that returns an instance of Markers. Even the PSquarePercentile makes use of newMarkers for instantiating Markers,while PSquarePercentile class per-se doesnt allow external setting of Markers.

          So my thought here has been to model marker and markers as separate and clear entities/classes (infact private static as they are really independent from main class) ; yet however allow sufficient hooks for testing (for instance to cover for more than 90%) by means of exposing them as one minimal interface(i.e PSquareMarkers). Even yet; do not allow external setting of markers into PSquarePercentile to keep up the compatability and safety. I am assuming this to be the black box. So for no extensibility at the moment except for clients to look at an interface. Infact an earlier interface PSquareEstimator has been subsumed inside Marker class. so essentially its all in these 2 nested classes.

          I am most willing to understand solutions to my problem here. please help if there are other simplifications that i can add here ; but just thought of sharing what has been going on in my mind when i am doing this. Some where though i tend to think if i merge all the nested classes to main, its going to be one big class (akin to GOD class anti pattern) which is where i am at cross roads unable to decide.

          Yes i have run mvn site target and taken care of pmd, checkstyle, findbug and jacoco code coverage

          I have also
          a) removed toMap() method,
          b) all the private javadoc @link parameters
          c) removed all caps comments,
          d) removed second word upper case starting etc in javadoc.
          e) turned almost all noninterface methods to private
          f) turned all nested classes to private static

          Gilles,Phill: Please let me know if i could contact you over messenger such as yahoo! etc..

          Show
          Venkatesha Murthy TS added a comment - Sorry i meant to ask the question as follows: The context is : I have modeled the algorithm into 2 nested static classes(i.e Markers and Marker) as i couldnt possibly move all of them to one big main class (The main class is PSquarePercentile). Since the bulk of logic needs to be tested and coveraged; i am forced to expose the logic as a protected inner interface(Which is PSquareMarkers where Markers implements PSquareMarkers). This interface is instantiated using a creation method (newMarkers) that returns an instance of Markers. Even the PSquarePercentile makes use of newMarkers for instantiating Markers,while PSquarePercentile class per-se doesnt allow external setting of Markers. So my thought here has been to model marker and markers as separate and clear entities/classes (infact private static as they are really independent from main class) ; yet however allow sufficient hooks for testing (for instance to cover for more than 90%) by means of exposing them as one minimal interface(i.e PSquareMarkers). Even yet; do not allow external setting of markers into PSquarePercentile to keep up the compatability and safety. I am assuming this to be the black box. So for no extensibility at the moment except for clients to look at an interface. Infact an earlier interface PSquareEstimator has been subsumed inside Marker class. so essentially its all in these 2 nested classes. I am most willing to understand solutions to my problem here. please help if there are other simplifications that i can add here ; but just thought of sharing what has been going on in my mind when i am doing this. Some where though i tend to think if i merge all the nested classes to main, its going to be one big class (akin to GOD class anti pattern) which is where i am at cross roads unable to decide. Yes i have run mvn site target and taken care of pmd, checkstyle, findbug and jacoco code coverage I have also a) removed toMap() method, b) all the private javadoc @link parameters c) removed all caps comments, d) removed second word upper case starting etc in javadoc. e) turned almost all noninterface methods to private f) turned all nested classes to private static Gilles,Phill: Please let me know if i could contact you over messenger such as yahoo! etc..
          Hide
          Phil Steitz added a comment -

          The last patch looks good to me, modulo a few typos in the javadoc that I can fix in the commit and converting the random data generation to use a fixed seed (we like to do that to avoid random test failures).

          I will wait a day or two to see if there is other feedback; otherwise will commit. Thanks for the patch!

          Show
          Phil Steitz added a comment - The last patch looks good to me, modulo a few typos in the javadoc that I can fix in the commit and converting the random data generation to use a fixed seed (we like to do that to avoid random test failures). I will wait a day or two to see if there is other feedback; otherwise will commit. Thanks for the patch!
          Hide
          Venkatesha Murthy TS added a comment -

          Phills : Thanks for taking this patch thus far. Please let know if anything that i can do / add on to this

          Show
          Venkatesha Murthy TS added a comment - Phills : Thanks for taking this patch thus far. Please let know if anything that i can do / add on to this
          Hide
          Venkatesha Murthy TS added a comment -

          Hi

          Would like to know if there are any further opinions on the patch

          thanks
          venkatesh

          Show
          Venkatesha Murthy TS added a comment - Hi Would like to know if there are any further opinions on the patch thanks venkatesh
          Hide
          Venkatesha Murthy TS added a comment -

          Hi

          Any other comments or if otherwise everything is fine; please let me know when it would be committed

          thanks
          venkatesh

          Show
          Venkatesha Murthy TS added a comment - Hi Any other comments or if otherwise everything is fine; please let me know when it would be committed thanks venkatesh
          Hide
          Venkatesha Murthy TS added a comment -

          Hi

          Just checking if anything else i need to do here. if things look ok;Please let know as to when i could check back on the status of commit on this

          Thanks
          Venkat.

          Show
          Venkatesha Murthy TS added a comment - Hi Just checking if anything else i need to do here. if things look ok;Please let know as to when i could check back on the status of commit on this Thanks Venkat.
          Hide
          Phil Steitz added a comment -

          Sorry for delay. Working on this now...

          Show
          Phil Steitz added a comment - Sorry for delay. Working on this now...
          Hide
          Phil Steitz added a comment -

          30-may patch with the following modifications committed in r 1604443:

          • Changed class names to "PSquare" to match references
          • Changed default quantile to 50 to be consistent with Percentile
          • Changed some hashcode implementations to use jdk Arrays.hashcode
          • Extracted constants to avoid repeated initialization (low, high marker indexes)
          • Changed random data tests to use a fixed seed
          • Miscellaneous javadoc edits

          Thanks for the patch!

          Show
          Phil Steitz added a comment - 30-may patch with the following modifications committed in r 1604443: Changed class names to "PSquare" to match references Changed default quantile to 50 to be consistent with Percentile Changed some hashcode implementations to use jdk Arrays.hashcode Extracted constants to avoid repeated initialization (low, high marker indexes) Changed random data tests to use a fixed seed Miscellaneous javadoc edits Thanks for the patch!
          Hide
          Venkatesha Murthy TS added a comment -

          Hi Phil,

          Thanks for committing.

          A small change however and i have attached the patch against today.
          a) the public double quantile() method is a simple getter method which just returns the set quantile during construction. So its not the result. Basically its the p/100 value which is within [0.0-1.0]

          b) Next in the test method testDistribution() of PSquarePercentileTest.java the fixed seed random generator need to be used even in here as well. '

          Regards
          Venkatesh

          Show
          Venkatesha Murthy TS added a comment - Hi Phil, Thanks for committing. A small change however and i have attached the patch against today. a) the public double quantile() method is a simple getter method which just returns the set quantile during construction. So its not the result. Basically its the p/100 value which is within [0.0-1.0] b) Next in the test method testDistribution() of PSquarePercentileTest.java the fixed seed random generator need to be used even in here as well. ' Regards Venkatesh
          Hide
          Phil Steitz added a comment -

          Thanks for the patch, Venkat! I applied the test patch, but I think the javadoc for quantile() is correct as is.

          Show
          Phil Steitz added a comment - Thanks for the patch, Venkat! I applied the test patch, but I think the javadoc for quantile() is correct as is.
          Hide
          Venkatesha Murthy TS added a comment -

          Hi Phil,

          Thanks for taking the patch.

          Some how it seemed initially as though getResult value being stored in quantile and is being returned.

          Well, now after i re-read the javadoc it kind of conveys its just a getter. I was just trying to make it bit more explicit by saying that @return will return (or get ) just the desired input quantile set during construction.
          I guess we can live without that explicit comment.

          Show
          Venkatesha Murthy TS added a comment - Hi Phil, Thanks for taking the patch. Some how it seemed initially as though getResult value being stored in quantile and is being returned. Well, now after i re-read the javadoc it kind of conveys its just a getter. I was just trying to make it bit more explicit by saying that @return will return (or get ) just the desired input quantile set during construction. I guess we can live without that explicit comment.

            People

            • Assignee:
              Unassigned
              Reporter:
              Luc Maisonobe
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development