Pig
  1. Pig
  2. PIG-277

UDF for computing correlation and covariance between data sets

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.1.0
    • Component/s: None
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      UDFs for computing correlation and covariance between data sets. Use following commands to compute covariance
      A = load 'input.xml' using PigStorage(':');
      B = group A all;
      define c COV('a','b','c');
      D = foreach B generate group,c(A.$0,A.$1,A.$2);

      1. stat.patch
        27 kB
        Ajay Garg
      2. newStats.patch
        30 kB
        Ajay Garg

        Activity

        Hide
        Ajay Garg added a comment -

        Patch attached...

        Show
        Ajay Garg added a comment - Patch attached...
        Hide
        Pi Song added a comment -

        Good work

        • Please be a bit more careful with code formatting
        • Please convert tabs to spaces (We use 1 tab = 4 spaces)

        Covariance

        • COV.combine: What does this do?
          Tuple tuple = new Tuple(Integer.valueOf(values.size()+"").intValue());
        • This looks a bit ugly:-
          catch(RuntimeException t) {
                          throw new RuntimeException(t.getMessage() + ": " + input, t);
                      }
          

        Correlation
        int totalSchemas = Double.valueOf(((1+Math.sqrt(1+4*combined.arity()))/2)).intValue();
        I think we may have problems with this line. Javadoc says .intValue() will truncate the fractional part.

        Show
        Pi Song added a comment - Good work Please be a bit more careful with code formatting Please convert tabs to spaces (We use 1 tab = 4 spaces) Covariance COV.combine: What does this do? Tuple tuple = new Tuple(Integer.valueOf(values.size()+"").intValue()); This looks a bit ugly:- catch(RuntimeException t) { throw new RuntimeException(t.getMessage() + ": " + input, t); } Correlation int totalSchemas = Double.valueOf(((1+Math.sqrt(1+4*combined.arity()))/2)).intValue(); I think we may have problems with this line. Javadoc says .intValue() will truncate the fractional part.
        Hide
        Olga Natkovich added a comment -

        Couple of additional comments:

        • Some of the functions throw RuntimeExcepion. As the case with math functions, we should only through IOExceptions.
        • It would be nice to include in the comments the link to a page that defines the functions
        Show
        Olga Natkovich added a comment - Couple of additional comments: Some of the functions throw RuntimeExcepion. As the case with math functions, we should only through IOExceptions. It would be nice to include in the comments the link to a page that defines the functions
        Hide
        Pi Song added a comment -

        Since we're not gonna do correlation between too many datasets I think we could just do

        x*x - x - n =0 => ( x )(x-1) = n

        and then keep substituting x by 1,2,3, ... until it's got a match or ( x )(x-1) > n
        we can do it like a binary search to get O(log n) if you like but won't get much of out it.

        Show
        Pi Song added a comment - Since we're not gonna do correlation between too many datasets I think we could just do x*x - x - n =0 => ( x )(x-1) = n and then keep substituting x by 1,2,3, ... until it's got a match or ( x )(x-1) > n we can do it like a binary search to get O(log n) if you like but won't get much of out it.
        Hide
        Ajay Garg added a comment -

        new patch (newStats.patch) attached with suggested modifications.

        Show
        Ajay Garg added a comment - new patch (newStats.patch) attached with suggested modifications.
        Hide
        Pi Song added a comment -

        Looks good. +1

        If no objection within 48 hours, I will commit this.

        Show
        Pi Song added a comment - Looks good. +1 If no objection within 48 hours, I will commit this.
        Hide
        Olga Natkovich added a comment -

        I have committed the changes. Thanks Ajay!

        Show
        Olga Natkovich added a comment - I have committed the changes. Thanks Ajay!

          People

          • Assignee:
            Ajay Garg
            Reporter:
            Ajay Garg
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development