Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
0.2.0
-
None
-
None
Description
I have a UDF known as INSETFROMFILE, which matches data against a set of values stored in an HDFS file. The INSETFROMFILE extends FilterFunc. Here is a sample pig script which uses it.
register util.jar; define InQuerySet util.INSETFROMFILE('/user/viraj/insetfilterfile'); A = load '/user/viraj/myurldata.txt' using PigStorage() as (url, bcookie); B = group A by (url); C = foreach B generate ((InQuerySet(A.bcookie))?1:0) as inset, A; dump C;
This script fails with the following exception in the reducer:
================================================================================================================
java.lang.NullPointerException
at org.apache.pig.backend.hadoop.datastorage.ConfigurationUtil.toProperties(ConfigurationUtil.java:45)
at util.INSETFROMFILE.init(INSETFROMFILE.java:79)
at util.INSETFROMFILE.exec(INSETFROMFILE.java:99)
at util.INSETFROMFILE.exec(INSETFROMFILE.java:61)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:185)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:223)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POBinCond.getNext(POBinCond.java:92)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:259)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:197)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:280)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:247)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:224)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:136)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209)
================================================================================================================
To avoid this error we use the INSETFROMFILE UDF in the Filter statement of Pig and it works.
register util.jar; define InQuerySet util.INSETFROMFILE('/user/viraj/insetfilterfile'); A = load '/user/viraj/myurldata.txt' using PigStorage() as (url, bcookie); B = filter A by InQuerySet(bcookie); dump B;
The result is:
(www.yahoo.com,12344)
Problems:
1) Why does the FilterFunc UDF, INSETFROMFILE show inconsistent behaviour when used in the FOREACH?
2) Is there a rule that FilterFunc UDF should be used in Filter statement?
3) Properties props = ConfigurationUtil.toProperties(PigInputFormat.sJob) is null when the FilterFunc UDF is called within ForEach
Attaching data and script file for testing.