Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-3668

COR built-in function when atleast one of the coefficient values is NaN

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • 0.12.0, 0.11.1, 0.12.1
    • None
    • internal-udfs
    • None
    • Patch Available

    Description

      When passing multiple column keys for Correlation analysis, if coefficient value of one of the combinations is NaN, then the value for all other combinations is not computed.

      Pearson Co-efficient value is NaN if all values for a given column are the same.

      Example:
      A = LOAD 'myData' USING org.apache.hcatalog.pig.HCatLoader();
      B = group A all;
      c = foreach B generate group, FLATTEN(COR((bag

      {tuple(double)}) A.col_1,(bag{tuple(double)}

      ) A.col_2, (bag

      {tuple(double)}) A.col_3, (bag{tuple(double)}

      ) A.col_4));

      If the value of pearson coefficient for col_1 and col_2 is NaN, then value of co-efficients for all combinations is NaN

      This is happening because of 'return null' statement in catch block on lines 157 and 235 in file org.apache.pig.builtin.COR.java
      If the catch block is removed, then the correlation analysis would continue for the remaining columns. (ApachePig 0.12.0)

      Attachments

        1. CORR.diff
          1.0 kB
          Hiten Java

        Activity

          People

            hitenjava Hiten Java
            hitenjava Hiten Java
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: