Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.5.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:

      UDF, written in Pig 0.5 contrib/

    • Tags:
      contrib udf variance standard deviation

      Description

      I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates variance in a distributed manner, based on the AVG() builtin. It works by calculating the count, sum and sum of squares, as described here: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm

      Is this a worthwhile contribution? Taking the square root of this value using the contrib SQRT() function gives Standard Deviation, which is missing from Pig.

      1. PIG-1150.patch
        9 kB
        Jenny Thompson
      2. var.patch
        11 kB
        rjurney

        Activity

        Hide
        Olga Natkovich added a comment -

        Yes, it is definitely worse while to contribute!

        Show
        Olga Natkovich added a comment - Yes, it is definitely worse while to contribute!
        Hide
        Olga Natkovich added a comment -

        Updating the fix version since it will go into the future version and will not be backported

        Show
        Olga Natkovich added a comment - Updating the fix version since it will go into the future version and will not be backported
        Hide
        rjurney added a comment -

        This patch will not cut the mustard - it lacks Javadoc, and test cases and its just plain ugly.

        That being said, people requested this on twitter, so I'm pushing this one for people to use if they want to. Will get a passable patch up later this week.

        Show
        rjurney added a comment - This patch will not cut the mustard - it lacks Javadoc, and test cases and its just plain ugly. That being said, people requested this on twitter, so I'm pushing this one for people to use if they want to. Will get a passable patch up later this week.
        Hide
        rjurney added a comment -

        Oh - one other thing - I've read that this naive parallel method of calculating variance can have precision problems - all those double's getting subtracted from one another and then squared. I've thought of using BigDecimal, which can handle arbitrary precision numbers. My understanding is that this would be slow, but that it would probably still be IO bound.

        Is that something people would like to see? I could maybe make another UDF that uses BigDecimal or something. I've never actually encountered the precision problems in practice, but I can see how that might be a big problem for some people.

        Show
        rjurney added a comment - Oh - one other thing - I've read that this naive parallel method of calculating variance can have precision problems - all those double's getting subtracted from one another and then squared. I've thought of using BigDecimal, which can handle arbitrary precision numbers. My understanding is that this would be slow, but that it would probably still be IO bound. Is that something people would like to see? I could maybe make another UDF that uses BigDecimal or something. I've never actually encountered the precision problems in practice, but I can see how that might be a big problem for some people.
        Hide
        Alan Gates added a comment -

        We've had some requests to add a fixed point type for issues where precision problems aren't acceptable. Another option for you would be to add the variance now in double and when a fixed point type is added add a fix point version of variance.

        Show
        Alan Gates added a comment - We've had some requests to add a fixed point type for issues where precision problems aren't acceptable. Another option for you would be to add the variance now in double and when a fixed point type is added add a fix point version of variance.
        Hide
        Tamir Kamara added a comment -

        This can be very useful for me so I tested your patch but got weird results. I believe that the problem is at combine method - it treats the tuple as if it contains the original values but to my understanding it should work with the intermediate output and do something like this:

        static protected Tuple combine(DataBag values) throws ExecException {
        	double sum = 0;
        	long count = 0;
        	double sumOfSquares = 0;
        
        	Tuple output = mTupleFactory.newTuple(3);
        
        	for (Iterator<Tuple> it = values.iterator(); it.hasNext();) {
        		Tuple t = it.next();
        
        		sum += (Double) t.get(0);
        		count += (Long) t.get(1);
        		sumOfSquares += (Double) t.get(2);
        		
        	}
        
        	output.set(0, sum);
        	output.set(1, count);
        	output.set(2, sumOfSquares);
        
        	return output;
        }
        
        Show
        Tamir Kamara added a comment - This can be very useful for me so I tested your patch but got weird results. I believe that the problem is at combine method - it treats the tuple as if it contains the original values but to my understanding it should work with the intermediate output and do something like this: static protected Tuple combine(DataBag values) throws ExecException { double sum = 0; long count = 0; double sumOfSquares = 0; Tuple output = mTupleFactory.newTuple(3); for (Iterator<Tuple> it = values.iterator(); it.hasNext();) { Tuple t = it.next(); sum += ( Double ) t.get(0); count += ( Long ) t.get(1); sumOfSquares += ( Double ) t.get(2); } output.set(0, sum); output.set(1, count); output.set(2, sumOfSquares); return output; }
        Hide
        Olga Natkovich added a comment -

        We would like to cut release branch next Monday. This mean that the code needs to be committed by the end of the week. Is this likely to happen?

        If not, I would like to unlink this from 0.7.0 release and leave for inclusion in one of the future releases when this patch is ready.

        Show
        Olga Natkovich added a comment - We would like to cut release branch next Monday. This mean that the code needs to be committed by the end of the week. Is this likely to happen? If not, I would like to unlink this from 0.7.0 release and leave for inclusion in one of the future releases when this patch is ready.
        Hide
        Dmitriy V. Ryaboy added a comment -

        Florian Leibert and I have a better implementation that we'd like to post. We will try to get it done before Monday.

        Show
        Dmitriy V. Ryaboy added a comment - Florian Leibert and I have a better implementation that we'd like to post. We will try to get it done before Monday.
        Hide
        rjurney added a comment -

        Yes, this sounds like the thing to do

        On Tue, Mar 16, 2010 at 5:29 PM, Dmitriy V. Ryaboy (JIRA)

        Show
        rjurney added a comment - Yes, this sounds like the thing to do On Tue, Mar 16, 2010 at 5:29 PM, Dmitriy V. Ryaboy (JIRA)
        Hide
        Dmitriy V. Ryaboy added a comment -

        Changed the target to 0.8 – we won't have time to finish splitting this out by Monday.

        Show
        Dmitriy V. Ryaboy added a comment - Changed the target to 0.8 – we won't have time to finish splitting this out by Monday.
        Hide
        Dmitriy V. Ryaboy added a comment -

        Jay, there may be – I only glanced at the code here. The real problem is this: http://planetmath.org/encyclopedia/OnePassAlgorithmToComputeSampleVariance.html – you are going to get round-off errors, and possibly overflow errors, using this approach.

        Thanks for reminding me that I promised this, I'll work on open-sourcing our code.

        -Dmitriy

        Show
        Dmitriy V. Ryaboy added a comment - Jay, there may be – I only glanced at the code here. The real problem is this: http://planetmath.org/encyclopedia/OnePassAlgorithmToComputeSampleVariance.html – you are going to get round-off errors, and possibly overflow errors, using this approach. Thanks for reminding me that I promised this, I'll work on open-sourcing our code. -Dmitriy
        Hide
        hc busy added a comment -

        similarly, there's some code here on numerically stable and distributed calculation: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance

        I mean, while we're at it, why not calculate all central moments?

        centralMoments(x, y)
        

        returns central moments of x up to y

        centralMoments(x,3)
        

        will return a tuple containing

        (mean, variance, skew)

        Show
        hc busy added a comment - similarly, there's some code here on numerically stable and distributed calculation: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance I mean, while we're at it, why not calculate all central moments? centralMoments(x, y) returns central moments of x up to y centralMoments(x,3) will return a tuple containing (mean, variance, skew)
        Hide
        Russell Jurney added a comment -

        What do y'all think about making this a builtin based on the Chan method? It seems like a lot of the UDFs can be improved and made into built-ins. ROUND for instance, seems like it should be included in the language.

        ftp://reports.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf

        Show
        Russell Jurney added a comment - What do y'all think about making this a builtin based on the Chan method? It seems like a lot of the UDFs can be improved and made into built-ins. ROUND for instance, seems like it should be included in the language. ftp://reports.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf
        Hide
        Alan Gates added a comment -

        I filed PIG-1405 to track standard function that need to be added to PIG. It has VAR and ROUND.

        Show
        Alan Gates added a comment - I filed PIG-1405 to track standard function that need to be added to PIG. It has VAR and ROUND.
        Hide
        Olga Natkovich added a comment -

        Dmitry, is patch ready to be committed or are you planning to submit a new one? Thanks

        Show
        Olga Natkovich added a comment - Dmitry, is patch ready to be committed or are you planning to submit a new one? Thanks
        Hide
        Dmitriy V. Ryaboy added a comment -

        Meh. Go ahead and commit. Don't put it into builtin, since it has math problems at scale. Ok for piggybank.

        Show
        Dmitriy V. Ryaboy added a comment - Meh. Go ahead and commit. Don't put it into builtin, since it has math problems at scale. Ok for piggybank.
        Hide
        Olga Natkovich added a comment -

        Dmitry, the patch is missing unit tests. Once, you add them, I will commit it.

        Show
        Olga Natkovich added a comment - Dmitry, the patch is missing unit tests. Once, you add them, I will commit it.
        Hide
        Olga Natkovich added a comment -

        Dmitry, are you planning to add unit tests? Do we still want this in for 0.8? (Since it is going into piggybank, we can do this post branching but then we need to test in 2 places.)

        Show
        Olga Natkovich added a comment - Dmitry, are you planning to add unit tests? Do we still want this in for 0.8? (Since it is going into piggybank, we can do this post branching but then we need to test in 2 places.)
        Hide
        Dmitriy V. Ryaboy added a comment -

        I won't have time before the 30th.

        BTW one doesn't even need a udf if using the sum of squares approach.. just generate the square and the sum in the foreach (it will perform the algebraic decomposition automatically)

        Show
        Dmitriy V. Ryaboy added a comment - I won't have time before the 30th. BTW one doesn't even need a udf if using the sum of squares approach.. just generate the square and the sum in the foreach (it will perform the algebraic decomposition automatically)
        Hide
        Olga Natkovich added a comment -

        So should we unlink this from the release?

        Show
        Olga Natkovich added a comment - So should we unlink this from the release?
        Hide
        Dmitriy V. Ryaboy added a comment -

        Yeah I think it's not a big deal if we are splitting piggybank out soon anyway.

        Show
        Dmitriy V. Ryaboy added a comment - Yeah I think it's not a big deal if we are splitting piggybank out soon anyway.
        Hide
        Olga Natkovich added a comment -

        Unlinking from the release because there is not much activity on the ticket and we want to move piggybank out soon

        Show
        Olga Natkovich added a comment - Unlinking from the release because there is not much activity on the ticket and we want to move piggybank out soon
        Hide
        Jayadev Chandrasekhar added a comment -

        [From Yahoo! Labs]
        Just checking whether variance (or std deviation) UDF got finally implemented in pig, from this ticket it appears not, but not sure whether a similar item is tracked somewhere else. I will need to either use it or implement it if not present, can try to contribute my implementation if it is not too difficult to make it generic, please point me to some guidelines while contributing.

        Show
        Jayadev Chandrasekhar added a comment - [From Yahoo! Labs] Just checking whether variance (or std deviation) UDF got finally implemented in pig, from this ticket it appears not, but not sure whether a similar item is tracked somewhere else. I will need to either use it or implement it if not present, can try to contribute my implementation if it is not too difficult to make it generic, please point me to some guidelines while contributing.
        Hide
        Jayadev Chandrasekhar added a comment -

        Sorry was just checking the whole chain of comments , it appears that piggybank is not maintained by the pig team any more ?

        Show
        Jayadev Chandrasekhar added a comment - Sorry was just checking the whole chain of comments , it appears that piggybank is not maintained by the pig team any more ?
        Hide
        Daniel Dai added a comment -

        We are still maintaining piggybank and we will review and commit code once there's a patch. The var patch attached has not been checked in. It needs rework mentioned by Dmitriy.

        Show
        Daniel Dai added a comment - We are still maintaining piggybank and we will review and commit code once there's a patch. The var patch attached has not been checked in. It needs rework mentioned by Dmitriy.
        Hide
        Dmitriy V. Ryaboy added a comment -

        An accumulative version has been checked into the LinkedIn DataFu library (a contrib from Cloudera). You can use that.
        It's still not as efficient as one would like, I might have some news on that front once internal testing at Twitter is done .

        Show
        Dmitriy V. Ryaboy added a comment - An accumulative version has been checked into the LinkedIn DataFu library (a contrib from Cloudera). You can use that. It's still not as efficient as one would like, I might have some news on that front once internal testing at Twitter is done .
        Hide
        Russell Jurney added a comment -

        I can't find variance in DataFu. I'm going to fix this in Piggybank, one day.

        Show
        Russell Jurney added a comment - I can't find variance in DataFu. I'm going to fix this in Piggybank, one day.
        Hide
        Russell Jurney added a comment -

        Does this look right? http://introcs.cs.princeton.edu/java/97data/OnePass.java.html

        It looks right to me. I think?

        Show
        Russell Jurney added a comment - Does this look right? http://introcs.cs.princeton.edu/java/97data/OnePass.java.html It looks right to me. I think?
        Hide
        Jenny Thompson added a comment -

        I have attached an UDF that uses the Chan method. I used AVG as a guide, so I implemented Algebraic and Accumulator. If this is to be committed, I can add javadocs and test cases.

        Show
        Jenny Thompson added a comment - I have attached an UDF that uses the Chan method. I used AVG as a guide, so I implemented Algebraic and Accumulator. If this is to be committed, I can add javadocs and test cases.

          People

          • Assignee:
            Dmitriy V. Ryaboy
            Reporter:
            Russell Jurney
          • Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:

              Development