Details

Type: New Feature

Status: Closed

Priority: Minor

Resolution: Fixed

Affects Version/s: 2.1

Fix Version/s: 2.1

Labels:None
Description
I've added semivariance calculations to my local build of commonsmath and I would like to contribute them.
Semivariance is described a little bit on http://en.wikipedia.org/wiki/Semivariance , but a real reason you would use them is in finance in order to compute the Sortino ratio rather than the Sharpe ratio.
http://en.wikipedia.org/wiki/Sortino_ratio gives an explanation of the Sortino ratio and why you would choose to use that rather than the Sharpe ratio. (There are other ways to measure the performance of your portfolio, but I wont bore everybody with that stuff)
I've already got the coding completed along with the test cases and building using mvn site.
The only two files I've modified is src/main/java/org/apache/commons/stat/StatUtils.java and src/test/java/org/apache/commons/math/stat/StatUtilsTest.java

 patch.txt
 12 kB
 Larry Diamond

 patch2.txt
 12 kB
 Larry Diamond

 SemiVariance.java
 12 kB
 Larry Diamond

 SemiVariance.java
 10 kB
 Larry Diamond

 SemiVariance.java
 5 kB
 Larry Diamond

 SemiVarianceTest.java
 3 kB
 Larry Diamond

 SemiVarianceTest.java
 2 kB
 Larry Diamond

 SemiVarianceTest.java
 2 kB
 Larry Diamond

 StatUtils.java
 33 kB
 Larry Diamond

 StatUtils.java
 33 kB
 Larry Diamond

 StatUtilsTest.java
 16 kB
 Larry Diamond

 StatUtilsTest.java
 16 kB
 Larry Diamond
Activity
 All
 Comments
 Work Log
 History
 Activity
 Transitions
And here's the patch file as per the Math developers contribution guideline
I have just a comment on this proposal, but beware I have almost no knowledge in the field of statistics.
In your implementation, you seem to use 0 as the mean. Shouldn't the mean be computed beforehand to know where to put the cutoff value ?
I let the real statisticians give their feeling about this feature.
Looks like a good addition to me. Will complete review and commit (possibly with some mods) this weekend assuming others are OK with the addition.
Oh my this is embarrassing. You're right Luc. I messed up the algo when copying it from my source.
When computing the Sortino ratio, you're eliminating the losses  so the code is correct for the Sortino ratio but is not truly semivariance.
For anything below the mean, you should replace the value with the mean for the original distribution. I had zero adjusted the distribution before I called the method.
I'll do a rewrite on this code and reattach it to this issue. Thank you for reviewing the code! I am so embarrassed!
Here's an update where the zero adjustment isn't presumed elsewhere.
I've also added more test cases.
Please feel free to contact me if you have any questions about this submission.
Thank you very much for your time and attention.
First, thanks for the patch. I am having a hard time validating the computing formula, based on the description in the references. The formula provided in the Wikipedia reference appears to be different from what the patch is computing. That is OK if what we want to compute is something different. We can provide the formula directly in the javadoc. But if there is a standard formula, we should try to implement something equivalent.
If the intent is to estimate the Sortino ratio, it would seem to me that the values above the mean should be excluded, rather than collapsed to the mean. If I understand correctly what the code is trying to compute, these values should be excluded, not recoded. The Sortino ratio also would seem to require another parameter  the "target" return value. I guess this is why in your application you shift the mean to zero?
Thanks again for the patch and sorry for the delay reviewing it.
Two comments on the code in the latest patch:
1) the mean() function (actually the Mean statistic) will perform the array bounds check, so if you move that outside the if statement, you can eliminate the test function.
2) If this is a widely used statistic and/or the computation becomes any more involved than the patch, we should implement a UnivariateStatisic class to represent it.
Hi Phil. No worries  its the holiday season and it's an all volunteer project. I find you and Luc to be very responsive all things given.
I actually find the process of contributing to the project very straightforward  it's intimidating until you try, but you both have been very easy to work with.
Short Answer  we'll be talking about a UnivariateStatistic really soon.
Long Answer  Addressing your questions (in no particular order)....
This link will help somewhat  http://thecuriousinvestor.com/2007/10/03/sortinoratio/
1. Some of the tests will occur twice in some of the functions. I copied the four methods for variance to create the semivariance methods  I'm having second thoughts about that since I really would only call the one that takes the array or possibly the array and the mean. I've never needed to compute semivariance for a part of an array.
I'm very flexible on this  I went for consistency rather than the best performance possible  feel free to make changes as you see fit.
2. The Sortino ratio uses the downward standard deviation, which is the square root of the semivariance. You definitely would want to keep the values above the mean and collapse them rather than exclude them altogether.
A simple explanation of the Sortino Ratio is probably in order to explain why.
The Sortino Ratio comes from the Sharpe Ratio. The Sharpe Ratio is used to rate how much reward you're getting for the risk you're taking. Standard Deviation is the divisor. Higher is better.
One criticism of the Sharpe Ratio is that returns in excess of the mean increase your standard deviation, but you don't getting big rewards from time to time  those periods shouldnt count.
An example helps here
Here's your monthly returns.
1
0
1
2
2
20
That 20% return is nice (I'd like a 20% monthly return too!) but that 20% makes your standard deviation higher and your Sharpe Ratio lower.
But nobody minds an occassional blowout return  that doesnt increase the risk of the fund in the view of people who prefer the Sortino Ratio.
Mean = 4, Variance = 52.333, Std Dev = 7.234
But for the semivariance calculation, that 20 becomes a 4 and stays in the set, so semivariance becomes 9.666 and the downward standard deviation becomes 3.109. Dropping out the 20% isn't appropriate because it's part of your return.
That's a big difference in how your performance looks  7.234 / 3.109 = 2.326. Your fund now looks 2.3 times better than it did before!
That "other" parameter is often called the "minimum acceptable return". Some people prefer to look at only when you lost money or would fail to meet some regular performance metric. Pension managers get estimates on what they need to make that year and will look for people who have the best chance of making that return.
Instead of looking at the mean, they have a minimum return that they want to measure off of.
So, from the example above, let's say that the pension manager has a minimum acceptable return of 2%.
In that case, the 2% and 20% returns dont add into the semivariance so you get a semivariance of 8.333
This is where that zero initially came from.
Okay, now here's the little curveball that initially sounds bad but really turns into something easy:
Some years, you lose money. The mean is below zero. In those cases, that minimum acceptable return is "sometimes" the minimum of the mean and zero. Rather than putting in code that automatically checks the mean and replaces the minimum acceptable return, if we have code that takes in the double array and the MAR, I think we're fine.
Okay, now that I've made a really long JIRA comment, what's the next step?
Should I rewrite as a UnivariateStatistic? Should I just make the semivariance code two methods  one that takes the array and one that takes the array and a MAR? Or has this entry been so long I've made you want to hit the eggnog?
Thanks, Larry. I am happy that you are not finding it too hard to get started contributing. We appreciate and welcome your contributions!
Now to the eggnog...er, I mean issue at hand....
I now (think I) understand what you are trying to compute and get why you leave the topcoded entries in place. What now looks funny to me is to recode and then just compute ordinary variance. That will not give you E(X  MAR)^2, but rather E(X  E(recoded X))^2. I think you may need to directly compute the squared deviations from the MAR (or the mean with the topcoded entries contributing 0) instead of computing the variance on the recoded data. That seems to be what your second reference above is describing. Consider the influence of the original values greater than or equal to the mean in the result computed below:
for (int loop = 0; loop < values.length; loop++) { if (values [loop] < mean) semivariancevalues [loop] = values [loop]; else semivariancevalues [loop] = mean; } return VARIANCE.evaluate(semivariancevalues, mean);
The topcoded values will not contribute 0, but will instead contribute whatever their deviation is above the mean of the recoded dataset. Is this what you really want? It would seem to me that the more natural measure would be E(X  original mean)^2
Sorry to ask so many questions. Could well be I am just misunderstanding what the statistic is trying to estimate. I just want to make sure we are computing something that we can easily describe and more importantly what is really useful.
Regarding the UnivariateStatistic, I think we should go ahead and do that and include the target as an optional constructor argument.
The definition provided here: http://www.jstor.org/pss/2330500 makes sense to me as a "semivariance" measure. An unbiased estimator would be the sum of the squared deviations from the target of the values below the target divided by the total number of values minus one.
I'm working on a SemiVariance.java class and SemiVarianceTest.java class for the second try on this.
It's a bit different working on code for public reuse than it is working on code for reuse within the firm or for your own use.
As regards to the biased vs unbiased thing, I feel that Variance does it right  start it off bias corrected and allow the caller to change it if it's appropriate for their use.
There's also an upside standard deviation which is the same thing except you accept only the data elements above the cutoff value.
I should have it ready tomorrow
It might make sense to introduce another boolean constructor parameter to indicate upper or lower semivariance. Assuming that lower semivariance is more common, lower could be the default. Alternatively, this can always just be computed by subtracting the lower value from the unconditioned variance, so it might not be necessary. Could always be added later.
How do the files I'm about to attach look? I believe this is the right direction  I still need to hook this into the class hierarchy, write up the documentation, and possibly put in something into the StatUtils class.
I went with an independent class that's a peer of Variance.
This code does require Java 5 to build  I hope Math is building that way.
Please find attached the new class for SemiVariance and the testing class. I'll get going on the remaining items I listed. Can you please take a look and confirm this is the right direction please?
Thank you very much!
Larry Diamond
(PS  Happy New Year if we dont email before then!)
Happy almost New Year to you, Larry!
Great progress here. I like the the less smelly approach to defining upper and lower than a boolean. I do have a couple of comments:
 This class should extend either AbstractStorelessUnivatiateStatistic or AbstractUnivariateStatistic. The former is for stats that do not require the full array of values to be provided and stored. I am honestly not sure at this point whether we are going to be able to define increment() methods for this statistic, so it could be AbstractUnivariateStatistic is a better parent. Once you do this, you can replace your argument checking with calls to the test() method defined in AbstractUnivariateStatistic. Have a look at how Variance does this.
 The cutoff and direction should be optional constructor parameters. The UnivariateStatistic interface requires that statistics have evaluate methods that require no parameters beyond the input array and subarray indices. To implement this interface while supporting the other config options, you need to supply the config parameters in constructor arguments.
 I am still struggling a little with the definition. This is why the "documentation goes here" bits would really help . I was assuming that when you use a cutoff value in place of the mean, you compute deviations from the cutoff rather than from the mean. That would correspond more neatly to the variance decomposition that I thought this statistic was supposed to measure. If I have this right, you don't need to compute the mean when a cutoff value is provided and deviations should be computed against the cutoff.
Thanks  I'll hopefully get this over the weekend. Those annoying people who pay me a salary have been actually consuming me time recently and I havent gotten done what I hoped.
Clearly, step one is those "Documentation Goes Here" bits. Getting those in there would clarify your third point and I see is **KEY** to publicly reusable software.
I agree with not using AbstractStorelessUnivariateStatistic  I have to read over AbstractUnivariateStatistic to make sure I can use it. I'd like to extend something in the package.
I'll look at the UnivariateStatistic to see the constructors. I can see that you'd want to calculate Upward and Downward on the same data (and really the full variance too), although yeah, once you have one you automatically have the other (full  downward = upward).
Clearly, those documentation bits would make what I'm talking about much clearer to all readers of this post.
Thanks for the note on the not using boolean there. There was just no way that I could have people try to remember which direction was true and which direction was false. That's just silly and cruel.
Hi again.
The class is now documented, hopefully I have made the concept easier to understand.
Most of the work on this class has been documentation, most of the code changes are to ensure the class extends AbstractUnivariateStatistic and fills in any missing methods that AbstractUnivariateStatistic requires.
I've broken every last one of my test cases and I need to rework them from scratch now, but I think I'm at a point now where I can post the work for review.
And now the tests are complete!
I'm glad I did all this  the code itself was not difficult, explaining the concept and the documentation for the code and all the "stuff" around it was much more extensive than what I've worked with in the past.
Thank you very much  I hope my code makes it into the next release  it certainly was interesting, fun, and "cool" to do from my perspective.
Larry Diamond
Added more documentation for the class and some more test cases.
Hopefully, these additions will make the contribution easier to understand.
Apologies for the response latency on the latest patch. Almost there.
I still think we should be computing deviations from the cutoff, rather than the mean when a cutoff is provided. The version of evaluate that takes a cutoff value should not require the mean as a parameter. We can also improve efficiency in evaluate(...h...) by computing the deviation and incrementing the SS only for values on the correct side of the cutoff.
We need to add references and formulas to the javadoc. I can take care of that as long as we agree that we are using the mathematical formula here for lower (aka downside) semivariance: http://www.jstor.org/pss/2330500. That formula says that the lower semivariance is the expected squared deviation of a value below the cutoff from the cutoff.
This can wait, but we should also see if we can get better numerics on the SS computation by using a twopass algorithm as we do in Variance
No worries. I'm on the apache mailing lists  you've been busy.
Sure  I'm on board. How can I help make it happen?
Last patch committed with the following changes in r910264:
 per comments above, used deviations from cutoff rather than the mean when a cutoff is provided
 improved efficiency of evaluate loop
 added convenience evaluate methods (different sets of parameters)
 conform to Commons Math coding standards (CheckStyle)
I did look into improved numerics similar to Variance on the sum of squares computation, but without assumptions on cutoff, I do not see a way to improve accuracy in the sum.
Thanks for the patch!
Thank you!
This was a great experience for me and I appreciate your time and effort in making this happen.
I was more than happy to contribute and will probably do it again (I've done other quantitative work that I'd like to contribute).
I have a better idea on how to format the contribution from the initial proposal so it can be included quicker.
Thank you for your time and effort  this was a good learning experience for me.
Please find attached the modified StatUtils.java and StatUtilsTest.java classes for semivariance