Description
Apache pig provides in-built function 'COR' (correlation). COR is used to calculate the correlation between various variables.
COR function does not work if we provide any variable of datatype int or long. We need to explicitly cast the variables as double in the pig script. Which is never a good idea on the UI end.
I have tried to unit test the correlation function by supplying some int values and it fails to iterate the bag. Same is the case, when supplying some int,long and double variables as input parameters to the COR function. However, my unit test for doubles gives the correct output.
I have also tried to run the script on Hadoop Cluster, it fails if we have any variable other than double.
It shows the following error on Hadoop cluster:
ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2999: Unexpected internal error. null
or sometimes ERROR 1066: Unable to open iterator for alias aliasName. Backend error : null
In the Java Code of COR function, it casts everything to double, which is correct.But in the computeAll(-,-) function, the cast on iterators to yield x and y does creates a problem.
exact code :
double x =(Double)iterator_x.next().get(0); // error when int or long
double y =(Double)iterator_y.next().get(0); // error when int or long
Solutions: could be overriding the method getArgToFuncMapping() and defining Various classes IntCOR, LongCOR,FloatCOR. As it is done for some other UDFs like VAR.
Please, fix the issue in piggybank as well as in Built-in Library of Pig.
I am using Apache pig 0.11.0