Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-594

Inconsistent behaviour of FilterFunc UDF when used in the Filter and ForEach statements

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 0.2.0
    • None
    • impl
    • None

    Description

      I have a UDF known as INSETFROMFILE, which matches data against a set of values stored in an HDFS file. The INSETFROMFILE extends FilterFunc. Here is a sample pig script which uses it.

      register util.jar;
      define InQuerySet util.INSETFROMFILE('/user/viraj/insetfilterfile');
      A = load '/user/viraj/myurldata.txt' using PigStorage() as (url, bcookie);
      B = group A by (url);
      C = foreach B generate ((InQuerySet(A.bcookie))?1:0) as inset, A;
      dump C;
      

      This script fails with the following exception in the reducer:
      ================================================================================================================
      java.lang.NullPointerException
      at org.apache.pig.backend.hadoop.datastorage.ConfigurationUtil.toProperties(ConfigurationUtil.java:45)
      at util.INSETFROMFILE.init(INSETFROMFILE.java:79)
      at util.INSETFROMFILE.exec(INSETFROMFILE.java:99)
      at util.INSETFROMFILE.exec(INSETFROMFILE.java:61)
      at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:185)
      at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:223)
      at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POBinCond.getNext(POBinCond.java:92)
      at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:259)
      at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:197)
      at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:280)
      at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:247)
      at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:224)
      at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:136)
      at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
      at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209)
      ================================================================================================================
      To avoid this error we use the INSETFROMFILE UDF in the Filter statement of Pig and it works.

      register util.jar;
      define InQuerySet util.INSETFROMFILE('/user/viraj/insetfilterfile');
      A = load '/user/viraj/myurldata.txt' using PigStorage() as (url, bcookie);
      B = filter A by InQuerySet(bcookie);
      dump B;
      

      The result is:
      (www.yahoo.com,12344)

      Problems:
      1) Why does the FilterFunc UDF, INSETFROMFILE show inconsistent behaviour when used in the FOREACH?
      2) Is there a rule that FilterFunc UDF should be used in Filter statement?
      3) Properties props = ConfigurationUtil.toProperties(PigInputFormat.sJob) is null when the FilterFunc UDF is called within ForEach

      Attaching data and script file for testing.

      Attachments

        1. INSETFROMFILE.java
          2 kB
          Viraj Bhat
        2. insetfilterfile
          0.0 kB
          Viraj Bhat
        3. myurldata.txt
          0.1 kB
          Viraj Bhat

        Activity

          People

            Unassigned Unassigned
            viraj Viraj Bhat
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated: