Pig
  1. Pig
  2. PIG-2745

Pig e2e test RubyUDFs fails in MR mode when running from tarball

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.10.1
    • Fix Version/s: 0.11, 0.10.1
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      To reproduce the issue, please run the e2e test "RubyUDFs_1" in MR mode from the tarball (not from installed Pig - please see why below). Either pseudo-distributed-mode or full-mode Hadoop can be used.

      ant -Dhadoopversion=23 -Dharness.old.pig=`pwd` -Dharness.cluster.conf=/etc/hadoop/conf/ -Dharness.cluster.bin=/usr/lib/hadoop/bin/hadoop test-e2e -Dtests.to.run="-t RubyUDFs_1"
      

      The test fails with the following error:

      java.lang.IllegalStateException: Could not initialize interpreter (from file system or classpath) with /home/cheolsoo/pig-0.10/test/e2e/pig/testdist/libexec/ruby/scriptingudfs.rb
      

      Looking at the job jar generated by Pig, "scriptingudfs.rb" can be found as follows:

      [cheolsoo@c1405 pig-cheolsoo]$ jar tvf bad.jar | grep scriptingudfs.rb
        2491 Fri Jun 08 15:52:08 PDT 2012 /home/cheolsoo/pig-0.10/test/e2e/pig/testdist/libexec/ruby/scriptingudfs.rb
      

      Looking at getScriptAsStream() method in ScriptEngine.java, "scriptingudfs.rb" is supposed to be read from the job jar, but it is not. The reason is because getResourceAsStream("/x") looks for "x" (without the leading "/") not "/x". Since "scriptingudfs.rb" is stored with it absolute path, it ends up being not found by getResourceAsStream(scriptPath).

      File file = new File(scriptPath);
      if (file.exists()) {
          try {
              is = new FileInputStream(file);
          } catch (FileNotFoundException e) {
              throw new IllegalStateException("could not find existing file "+scriptPath, e);
          }
      } else {
          if (file.isAbsolute()) {
              is = ScriptEngine.class.getResourceAsStream(scriptPath);
          } else {
              is = ScriptEngine.class.getResourceAsStream("/" + scriptPath);
          }
      }
      

      In fact, the test passes if you run in local mode or from installed Pig. The reason is because "scriptingudfs.rb" is found in local file system (e.g /usr/share/pig/test/e2e/pig/udfs/ruby/scriptingudfs.rb).

      The fix seems straightforward. Attached is the patch that removes the leading "/" when registering UDF scripts so that they are stored without the leading "/" in the job jar as follows:

      [cheolsoo@c1405 pig-cheolsoo]$ jar tvf good.jar | grep scriptingudfs.rb
        2491 Fri Jun 08 15:52:08 PDT 2012 home/cheolsoo/pig-0.10/test/e2e/pig/testdist/libexec/ruby/scriptingudfs.rb
      

      Thanks!

      1. enable_scripting_tests_23.patch
        5 kB
        Daniel Dai
      2. Test001.java
        2 kB
        Daniel Dai
      3. PIG-2745-2.patch
        0.7 kB
        Cheolsoo Park
      4. PIG-2745.patch
        0.7 kB
        Cheolsoo Park

        Issue Links

          Activity

          Daniel Dai made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Rohini Palaniswamy made changes -
          Link This issue relates to PIG-2761 [ PIG-2761 ]
          Hide
          Rohini Palaniswamy added a comment -

          Cheolsoo,
          I am looking into the issue Daniel mentioned. Should have a fix soon.

          Show
          Rohini Palaniswamy added a comment - Cheolsoo, I am looking into the issue Daniel mentioned. Should have a fix soon.
          Cheolsoo Park made changes -
          Link This issue relates to PIG-2623 [ PIG-2623 ]
          Hide
          Cheolsoo Park added a comment -

          Thanks for your explanation, Daniel.

          I will open a new jira and look into that unless someone's already doing.

          Btw, there is another jira PIG-2760 related to the UDF script path. That seems a valid bug as well.

          Show
          Cheolsoo Park added a comment - Thanks for your explanation, Daniel. I will open a new jira and look into that unless someone's already doing. Btw, there is another jira PIG-2760 related to the UDF script path. That seems a valid bug as well.
          Cheolsoo Park made changes -
          Link This issue relates to PIG-2760 [ PIG-2760 ]
          Daniel Dai made changes -
          Attachment enable_scripting_tests_23.patch [ 12532590 ]
          Daniel Dai made changes -
          Attachment enable_scripting_tests_23.patch [ 12532588 ]
          Daniel Dai made changes -
          Attachment enable_scripting_tests_23.patch [ 12532588 ]
          Hide
          Daniel Dai added a comment -

          Hi, Cheolsoo,
          You are right. This issue is fixed as a byproduct of PIG-2623, which convert the relative path to absolute path. All the Scripting tests pass for hadoop 23 now. I will enable those tests for 23.

          However, there is one another hole left. If we import another python module, Pig cannot pack/refer the path of dependent python module correctly. Here is one example:

          udf.py:
          from base import square

          @outputSchemaFunction("squaresquareSchema")
          def squaresquare(num):
          if num == None:
          return None
          return (square(num)*square(num))

          @schemaFunction("squaresquareSchema")
          def squaresquareSchema(input):
          return input

          base.py
          def square(num):
          if num == None:
          return None
          return ((num)*(num))

          Pig script:
          register 'udf.py' using jython as myfuncs;

          a = load '1.txt' as (a0:int);
          b = foreach a generate myfuncs.squaresquare(a0);
          dump b;

          Pig incorrectly pack the base.py as /base.py in job.jar, and fail to refer it in backend. It happens in both 20 and 23.

          Show
          Daniel Dai added a comment - Hi, Cheolsoo, You are right. This issue is fixed as a byproduct of PIG-2623 , which convert the relative path to absolute path. All the Scripting tests pass for hadoop 23 now. I will enable those tests for 23. However, there is one another hole left. If we import another python module, Pig cannot pack/refer the path of dependent python module correctly. Here is one example: udf.py: from base import square @outputSchemaFunction("squaresquareSchema") def squaresquare(num): if num == None: return None return (square(num)*square(num)) @schemaFunction("squaresquareSchema") def squaresquareSchema(input): return input base.py def square(num): if num == None: return None return ((num)*(num)) Pig script: register 'udf.py' using jython as myfuncs; a = load '1.txt' as (a0:int); b = foreach a generate myfuncs.squaresquare(a0); dump b; Pig incorrectly pack the base.py as /base.py in job.jar, and fail to refer it in backend. It happens in both 20 and 23.
          Hide
          Cheolsoo Park added a comment -

          Hi Daniel, thanks for submitting my patch!

          I am wondering why you think that the issue with relative paths for Python still exists. In my YARN cluster, the Scripting_* tests (excluded due to MAPREDUCE-3700) all pass. (Technically, I am using Hadoop-2.0.0, but that shouldn't make a difference.) I can also manually verify that it works in Grunt shell.

          My fix shouldn't be Ruby-specific since the problem is with PigServer stuffing any UDF scripts into the job jar.

          Looking at your test code, one thing that I haven't thought about is "../" although that shouldn't be an issue now as in the registerCode() method, relative paths are always converted to absolute paths by FileLocalizer.fetchFile(). Nevertheless, handling "../" as well might be a good idea to make that method more robust.

          Thanks!

          Show
          Cheolsoo Park added a comment - Hi Daniel, thanks for submitting my patch! I am wondering why you think that the issue with relative paths for Python still exists. In my YARN cluster, the Scripting_* tests (excluded due to MAPREDUCE-3700 ) all pass. (Technically, I am using Hadoop-2.0.0, but that shouldn't make a difference.) I can also manually verify that it works in Grunt shell. My fix shouldn't be Ruby-specific since the problem is with PigServer stuffing any UDF scripts into the job jar. Looking at your test code, one thing that I haven't thought about is "../" although that shouldn't be an issue now as in the registerCode() method, relative paths are always converted to absolute paths by FileLocalizer.fetchFile(). Nevertheless, handling "../" as well might be a good idea to make that method more robust. Thanks!
          Daniel Dai made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Hadoop Flags Reviewed [ 10343 ]
          Assignee Cheolsoo Park [ cheolsoo ]
          Fix Version/s 0.11 [ 12318878 ]
          Fix Version/s 0.10.1 [ 12320547 ]
          Resolution Fixed [ 1 ]
          Hide
          Daniel Dai added a comment -

          Patch committed to 0.10/trunk. Thanks Cheolsoo!

          Show
          Daniel Dai added a comment - Patch committed to 0.10/trunk. Thanks Cheolsoo!
          Daniel Dai made changes -
          Attachment Test001.java [ 12532358 ]
          Hide
          Daniel Dai added a comment -

          Looks good. I also attach a java code to demonstrate the problem. The patch fix the issue in Ruby. The issue for Python relative path is still there.

          Show
          Daniel Dai added a comment - Looks good. I also attach a java code to demonstrate the problem. The patch fix the issue in Ruby. The issue for Python relative path is still there.
          Hide
          Rohini Palaniswamy added a comment -

          +1. Tested this patch with relative path and absolute path for 20.205 and 23. Works fine.

          Daniel,
          Can you include this in 0.10 also. Without this scripting udfs do not work in 23 for both relative and absolute path. e2e tests currently marked ignored for MAPREDUCE-3700 also will have to be enabled again.

          Show
          Rohini Palaniswamy added a comment - +1. Tested this patch with relative path and absolute path for 20.205 and 23. Works fine. Daniel, Can you include this in 0.10 also. Without this scripting udfs do not work in 23 for both relative and absolute path. e2e tests currently marked ignored for MAPREDUCE-3700 also will have to be enabled again.
          Cheolsoo Park made changes -
          Attachment PIG-2745-2.patch [ 12532228 ]
          Hide
          Cheolsoo Park added a comment -

          I updated the patch to handle not only a leading "/" but also "./".

          In fact, it is not necessary to worry about "./" since FileLocalizer.fetchFile() already converts relative paths to absolute paths; nevertheless, it seems like a good idea to make this method more robust anyway.

          In addition, I replaced substring(1) with

          replaceFirst("^\\./|^/", "")

          because the latter seems more intuitive.

          Show
          Cheolsoo Park added a comment - I updated the patch to handle not only a leading "/" but also "./". In fact, it is not necessary to worry about "./" since FileLocalizer.fetchFile() already converts relative paths to absolute paths; nevertheless, it seems like a good idea to make this method more robust anyway. In addition, I replaced substring(1) with replaceFirst( "^\\./|^/" , "") because the latter seems more intuitive.
          Alan Gates made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Cheolsoo Park made changes -
          Description To reproduce the issue, please run the e2e test "RubyUDFs_1" in MR mode from the tarball (not from installed Pig - please see why below). Either pseudo-distributed-mode or full-mode Hadoop can be used.

          {code}
          ant -Dhadoopversion=23 -Dharness.old.pig=`pwd` -Dharness.cluster.conf=/etc/hadoop/conf/ -Dharness.cluster.bin=/usr/lib/hadoop/bin/hadoop test-e2e -Dtests.to.run="-t RubyUDFs_1"
          {code}

          The test fails with the following error:

          {code}
          java.lang.IllegalStateException: Could not initialize interpreter (from file system or classpath) with /home/cheolsoo/pig-0.10/test/e2e/pig/testdist/libexec/ruby/scriptingudfs.rb
          {code}

          Now look at the job jar generated by Pig, and search for "scriptingudfs.rb" that the error complains about.

          To save the job jar in /tmp, I had to comment out the following line in JobComtrolCompiler.java:

          {code}
          submitJarFile.deleteOnExit();
          {code}

          It can be seen that the absolute path of the script is stored in the job jar as follows:

          {code}
          [cheolsoo@c1405 pig-cheolsoo]$ jar tvf bad.jar | grep scriptingudfs.rb
            2491 Fri Jun 08 15:52:08 PDT 2012 /home/cheolsoo/pig-0.10/test/e2e/pig/testdist/libexec/ruby/scriptingudfs.rb
          {code}

          Looking at getScriptAsStream() method in ScriptEngine.java, "scriptingudfs.rb" seems supposed to be able to be found from the jar, but it is not. The reason is because getResourceAsStream("/x") looks for "x" (without the leading "/") not "/x" in the jar. Since "scriptingudfs.rb" is stored as the absolute path with the leading "/", it ends up being not found by getResourceAsStream(scriptPath).

          {code}
          File file = new File(scriptPath);
          if (file.exists()) {
              try {
                  is = new FileInputStream(file);
              } catch (FileNotFoundException e) {
                  throw new IllegalStateException("could not find existing file "+scriptPath, e);
              }
          } else {
              if (file.isAbsolute()) {
                  is = ScriptEngine.class.getResourceAsStream(scriptPath);
              } else {
                  is = ScriptEngine.class.getResourceAsStream("/" + scriptPath);
              }
          }
          {code}

          In fact, the test appears to pass if you run in local mode or from installed Pig. The reason is because "scriptingudfs.rb" exists in local file system (e.g /usr/share/pig/test/e2e/pig/udfs/ruby/scriptingudfs.rb), so it is found in file system.

          The fix in UNIX seems straightforward. When registering UDF scripts, we can simply remove the leading "/". For example,

          {code:title=src/org/apache/pig/PigServer.java}
          - pigContext.addScriptFile(f.getPath());
          + String key = f.isAbsolute() ? f.getPath().substring(1) : f.getPath();
          + pigContext.addScriptFile(key, f.getPath());
          {code}

          This results in that the UDF scripts are stored without the leading "/" in the job jar as follows:

          {code}
          [cheolsoo@c1405 pig-cheolsoo]$ jar tvf good.jar | grep scriptingudfs.rb
            2491 Fri Jun 08 15:52:08 PDT 2012 home/cheolsoo/pig-0.10/test/e2e/pig/testdist/libexec/ruby/scriptingudfs.rb
          {code}

          But this won't work with Windows and S3 as their root dir is not "/".

          Alternatively, we could store the UDF scripts with the file name instead of the full absolute path in the job jar. But this will disallow more than one UDF scripts with the same name but in different paths to be registered.

          I am wondering if anyone has a better suggestion. Thanks!
          To reproduce the issue, please run the e2e test "RubyUDFs_1" in MR mode from the tarball (not from installed Pig - please see why below). Either pseudo-distributed-mode or full-mode Hadoop can be used.

          {code}
          ant -Dhadoopversion=23 -Dharness.old.pig=`pwd` -Dharness.cluster.conf=/etc/hadoop/conf/ -Dharness.cluster.bin=/usr/lib/hadoop/bin/hadoop test-e2e -Dtests.to.run="-t RubyUDFs_1"
          {code}

          The test fails with the following error:

          {code}
          java.lang.IllegalStateException: Could not initialize interpreter (from file system or classpath) with /home/cheolsoo/pig-0.10/test/e2e/pig/testdist/libexec/ruby/scriptingudfs.rb
          {code}

          Looking at the job jar generated by Pig, "scriptingudfs.rb" can be found as follows:

          {code}
          [cheolsoo@c1405 pig-cheolsoo]$ jar tvf bad.jar | grep scriptingudfs.rb
            2491 Fri Jun 08 15:52:08 PDT 2012 /home/cheolsoo/pig-0.10/test/e2e/pig/testdist/libexec/ruby/scriptingudfs.rb
          {code}

          Looking at getScriptAsStream() method in ScriptEngine.java, "scriptingudfs.rb" is supposed to be read from the job jar, but it is not. The reason is because getResourceAsStream("/x") looks for "x" (without the leading "/") not "/x". Since "scriptingudfs.rb" is stored with it absolute path, it ends up being not found by getResourceAsStream(scriptPath).

          {code}
          File file = new File(scriptPath);
          if (file.exists()) {
              try {
                  is = new FileInputStream(file);
              } catch (FileNotFoundException e) {
                  throw new IllegalStateException("could not find existing file "+scriptPath, e);
              }
          } else {
              if (file.isAbsolute()) {
                  is = ScriptEngine.class.getResourceAsStream(scriptPath);
              } else {
                  is = ScriptEngine.class.getResourceAsStream("/" + scriptPath);
              }
          }
          {code}

          In fact, the test passes if you run in local mode or from installed Pig. The reason is because "scriptingudfs.rb" is found in local file system (e.g /usr/share/pig/test/e2e/pig/udfs/ruby/scriptingudfs.rb).

          The fix seems straightforward. Attached is the patch that removes the leading "/" when registering UDF scripts so that they are stored without the leading "/" in the job jar as follows:

          {code}
          [cheolsoo@c1405 pig-cheolsoo]$ jar tvf good.jar | grep scriptingudfs.rb
            2491 Fri Jun 08 15:52:08 PDT 2012 home/cheolsoo/pig-0.10/test/e2e/pig/testdist/libexec/ruby/scriptingudfs.rb
          {code}

          Thanks!
          Cheolsoo Park made changes -
          Field Original Value New Value
          Attachment PIG-2745.patch [ 12531620 ]
          Hide
          Cheolsoo Park added a comment -

          I also see the same issue with e2e Scripting tests where Jython UDF scripts are not found in classpath. Applying the change that I described let those test pass as well.

          Show
          Cheolsoo Park added a comment - I also see the same issue with e2e Scripting tests where Jython UDF scripts are not found in classpath. Applying the change that I described let those test pass as well.
          Cheolsoo Park created issue -

            People

            • Assignee:
              Cheolsoo Park
              Reporter:
              Cheolsoo Park
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development