Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-738

Regexp passed from pigscript fails in UDF

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.3.0
    • 0.6.0
    • grunt
    • None
    • Reviewed

    Description

      Consider a pig script which parses and counts regular expressions from a text file.
      The regular expression supplied in the Pig script needs to escape the "." (dot) character.

      register myregexp.jar;
      
      -- pattern not picked up
      
      define minelogs ci_pig_udfs.RegexGroupCount('www\\.yahoo\\.com/sports');
      
      A = load '/user/viraj/regexpinput.txt'  using PigStorage() as (source : chararray);
      
      B = foreach A generate minelogs(source) as sportslogs;
      
      dump B;
      
      

      Snippet of UDF RegexGroupCount.java

      public class RegexGroupCount extends EvalFunc<Integer> {
      
          private final Pattern pattern_;
      
          public RegexGroupCount(String patternStr) {
      
             System.out.println("My pattern supplied is "+patternStr);
      
             System.out.println("Equality test "+patternStr.equals("www\\.yahoo\\.com/sports"));
      
             pattern_ = Pattern.compile(patternStr, Pattern.DOTALL|Pattern.CASE_INSENSITIVE);
      
         }
        public Integer exec(Tuple input)  throws IOException {
         }
      }
      

      Running the above script on the following dataset :
      ====================================================================================================
      dshfdskfwww.yahoo.com/sportsjoadfjdslpdshfdskfwww.yahoo.com/sportsjoadfjdsl
      kas;dka;sd
      jsjsjwww.yahoo.com/sports
      jsdLSJDcom/sports
      wwwJyahooMcom/sports
      ====================================================================================================

      Results in the following:

      My pattern supplied is www\\.yahoo
      .com/sports
      Equality test false
      My pattern supplied is www\\.yahoo
      .com/sports
      Equality test false
      My pattern supplied is www\\.yahoo
      .com/sports
      Equality test false
      My pattern supplied is www\\.yahoo
      .com/sports
      Equality test false
      My pattern supplied is www\\.yahoo
      .com/sports
      Equality test false
      My pattern supplied is www\\.yahoo
      .com/sports
      Equality test false
      Userfunc: (Name: UserFunc viraj-Sat Mar 28 02:06:31 PDT 2009-14 function: ci_pig_udfs.RegexGroupCount('www\\.yahoo
      .com/sports') Operator Key: viraj-Sat Mar 28 02:06:31 PDT 2009-14)
      Userfunc fs: int
      My pattern supplied is www\\.yahoo
      .com/sports
      Equality test false
      My pattern supplied is www\\.yahoo
      .com/sports
      Equality test false
      My pattern supplied is www\\.yahoo
      .com/sports
      Equality test false

      2009-03-28 02:06:43,923 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
      2009-03-28 02:06:43,923 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
      (0)
      (0)
      (0)
      (0)
      (0)
      ====================================================================================================

      In essence there seems to be no way of passing this type of constructor argument through the Pig script. The only workaround seems to be hard coding the values in the UDF!!

      Attachments

        1. regexpinput.txt
          0.2 kB
          Viraj Bhat
        2. regexp.pig
          0.3 kB
          Viraj Bhat
        3. RegexGroupCount.java
          1 kB
          Viraj Bhat
        4. PIG-738.patch
          7 kB
          Pradeep Kamath
        5. myregexp.jar
          2 kB
          Viraj Bhat

        Activity

          People

            pkamath Pradeep Kamath
            viraj Viraj Bhat
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: