Pig
  1. Pig
  2. PIG-2745

Pig e2e test RubyUDFs fails in MR mode when running from tarball

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.10.1
    • Fix Version/s: 0.11, 0.10.1
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      To reproduce the issue, please run the e2e test "RubyUDFs_1" in MR mode from the tarball (not from installed Pig - please see why below). Either pseudo-distributed-mode or full-mode Hadoop can be used.

      ant -Dhadoopversion=23 -Dharness.old.pig=`pwd` -Dharness.cluster.conf=/etc/hadoop/conf/ -Dharness.cluster.bin=/usr/lib/hadoop/bin/hadoop test-e2e -Dtests.to.run="-t RubyUDFs_1"
      

      The test fails with the following error:

      java.lang.IllegalStateException: Could not initialize interpreter (from file system or classpath) with /home/cheolsoo/pig-0.10/test/e2e/pig/testdist/libexec/ruby/scriptingudfs.rb
      

      Looking at the job jar generated by Pig, "scriptingudfs.rb" can be found as follows:

      [cheolsoo@c1405 pig-cheolsoo]$ jar tvf bad.jar | grep scriptingudfs.rb
        2491 Fri Jun 08 15:52:08 PDT 2012 /home/cheolsoo/pig-0.10/test/e2e/pig/testdist/libexec/ruby/scriptingudfs.rb
      

      Looking at getScriptAsStream() method in ScriptEngine.java, "scriptingudfs.rb" is supposed to be read from the job jar, but it is not. The reason is because getResourceAsStream("/x") looks for "x" (without the leading "/") not "/x". Since "scriptingudfs.rb" is stored with it absolute path, it ends up being not found by getResourceAsStream(scriptPath).

      File file = new File(scriptPath);
      if (file.exists()) {
          try {
              is = new FileInputStream(file);
          } catch (FileNotFoundException e) {
              throw new IllegalStateException("could not find existing file "+scriptPath, e);
          }
      } else {
          if (file.isAbsolute()) {
              is = ScriptEngine.class.getResourceAsStream(scriptPath);
          } else {
              is = ScriptEngine.class.getResourceAsStream("/" + scriptPath);
          }
      }
      

      In fact, the test passes if you run in local mode or from installed Pig. The reason is because "scriptingudfs.rb" is found in local file system (e.g /usr/share/pig/test/e2e/pig/udfs/ruby/scriptingudfs.rb).

      The fix seems straightforward. Attached is the patch that removes the leading "/" when registering UDF scripts so that they are stored without the leading "/" in the job jar as follows:

      [cheolsoo@c1405 pig-cheolsoo]$ jar tvf good.jar | grep scriptingudfs.rb
        2491 Fri Jun 08 15:52:08 PDT 2012 home/cheolsoo/pig-0.10/test/e2e/pig/testdist/libexec/ruby/scriptingudfs.rb
      

      Thanks!

      1. Test001.java
        2 kB
        Daniel Dai
      2. PIG-2745-2.patch
        0.7 kB
        Cheolsoo Park
      3. PIG-2745.patch
        0.7 kB
        Cheolsoo Park
      4. enable_scripting_tests_23.patch
        5 kB
        Daniel Dai

        Issue Links

          Activity

          Cheolsoo Park created issue -
          Cheolsoo Park made changes -
          Field Original Value New Value
          Attachment PIG-2745.patch [ 12531620 ]
          Cheolsoo Park made changes -
          Description To reproduce the issue, please run the e2e test "RubyUDFs_1" in MR mode from the tarball (not from installed Pig - please see why below). Either pseudo-distributed-mode or full-mode Hadoop can be used.

          {code}
          ant -Dhadoopversion=23 -Dharness.old.pig=`pwd` -Dharness.cluster.conf=/etc/hadoop/conf/ -Dharness.cluster.bin=/usr/lib/hadoop/bin/hadoop test-e2e -Dtests.to.run="-t RubyUDFs_1"
          {code}

          The test fails with the following error:

          {code}
          java.lang.IllegalStateException: Could not initialize interpreter (from file system or classpath) with /home/cheolsoo/pig-0.10/test/e2e/pig/testdist/libexec/ruby/scriptingudfs.rb
          {code}

          Now look at the job jar generated by Pig, and search for "scriptingudfs.rb" that the error complains about.

          To save the job jar in /tmp, I had to comment out the following line in JobComtrolCompiler.java:

          {code}
          submitJarFile.deleteOnExit();
          {code}

          It can be seen that the absolute path of the script is stored in the job jar as follows:

          {code}
          [cheolsoo@c1405 pig-cheolsoo]$ jar tvf bad.jar | grep scriptingudfs.rb
            2491 Fri Jun 08 15:52:08 PDT 2012 /home/cheolsoo/pig-0.10/test/e2e/pig/testdist/libexec/ruby/scriptingudfs.rb
          {code}

          Looking at getScriptAsStream() method in ScriptEngine.java, "scriptingudfs.rb" seems supposed to be able to be found from the jar, but it is not. The reason is because getResourceAsStream("/x") looks for "x" (without the leading "/") not "/x" in the jar. Since "scriptingudfs.rb" is stored as the absolute path with the leading "/", it ends up being not found by getResourceAsStream(scriptPath).

          {code}
          File file = new File(scriptPath);
          if (file.exists()) {
              try {
                  is = new FileInputStream(file);
              } catch (FileNotFoundException e) {
                  throw new IllegalStateException("could not find existing file "+scriptPath, e);
              }
          } else {
              if (file.isAbsolute()) {
                  is = ScriptEngine.class.getResourceAsStream(scriptPath);
              } else {
                  is = ScriptEngine.class.getResourceAsStream("/" + scriptPath);
              }
          }
          {code}

          In fact, the test appears to pass if you run in local mode or from installed Pig. The reason is because "scriptingudfs.rb" exists in local file system (e.g /usr/share/pig/test/e2e/pig/udfs/ruby/scriptingudfs.rb), so it is found in file system.

          The fix in UNIX seems straightforward. When registering UDF scripts, we can simply remove the leading "/". For example,

          {code:title=src/org/apache/pig/PigServer.java}
          - pigContext.addScriptFile(f.getPath());
          + String key = f.isAbsolute() ? f.getPath().substring(1) : f.getPath();
          + pigContext.addScriptFile(key, f.getPath());
          {code}

          This results in that the UDF scripts are stored without the leading "/" in the job jar as follows:

          {code}
          [cheolsoo@c1405 pig-cheolsoo]$ jar tvf good.jar | grep scriptingudfs.rb
            2491 Fri Jun 08 15:52:08 PDT 2012 home/cheolsoo/pig-0.10/test/e2e/pig/testdist/libexec/ruby/scriptingudfs.rb
          {code}

          But this won't work with Windows and S3 as their root dir is not "/".

          Alternatively, we could store the UDF scripts with the file name instead of the full absolute path in the job jar. But this will disallow more than one UDF scripts with the same name but in different paths to be registered.

          I am wondering if anyone has a better suggestion. Thanks!
          To reproduce the issue, please run the e2e test "RubyUDFs_1" in MR mode from the tarball (not from installed Pig - please see why below). Either pseudo-distributed-mode or full-mode Hadoop can be used.

          {code}
          ant -Dhadoopversion=23 -Dharness.old.pig=`pwd` -Dharness.cluster.conf=/etc/hadoop/conf/ -Dharness.cluster.bin=/usr/lib/hadoop/bin/hadoop test-e2e -Dtests.to.run="-t RubyUDFs_1"
          {code}

          The test fails with the following error:

          {code}
          java.lang.IllegalStateException: Could not initialize interpreter (from file system or classpath) with /home/cheolsoo/pig-0.10/test/e2e/pig/testdist/libexec/ruby/scriptingudfs.rb
          {code}

          Looking at the job jar generated by Pig, "scriptingudfs.rb" can be found as follows:

          {code}
          [cheolsoo@c1405 pig-cheolsoo]$ jar tvf bad.jar | grep scriptingudfs.rb
            2491 Fri Jun 08 15:52:08 PDT 2012 /home/cheolsoo/pig-0.10/test/e2e/pig/testdist/libexec/ruby/scriptingudfs.rb
          {code}

          Looking at getScriptAsStream() method in ScriptEngine.java, "scriptingudfs.rb" is supposed to be read from the job jar, but it is not. The reason is because getResourceAsStream("/x") looks for "x" (without the leading "/") not "/x". Since "scriptingudfs.rb" is stored with it absolute path, it ends up being not found by getResourceAsStream(scriptPath).

          {code}
          File file = new File(scriptPath);
          if (file.exists()) {
              try {
                  is = new FileInputStream(file);
              } catch (FileNotFoundException e) {
                  throw new IllegalStateException("could not find existing file "+scriptPath, e);
              }
          } else {
              if (file.isAbsolute()) {
                  is = ScriptEngine.class.getResourceAsStream(scriptPath);
              } else {
                  is = ScriptEngine.class.getResourceAsStream("/" + scriptPath);
              }
          }
          {code}

          In fact, the test passes if you run in local mode or from installed Pig. The reason is because "scriptingudfs.rb" is found in local file system (e.g /usr/share/pig/test/e2e/pig/udfs/ruby/scriptingudfs.rb).

          The fix seems straightforward. Attached is the patch that removes the leading "/" when registering UDF scripts so that they are stored without the leading "/" in the job jar as follows:

          {code}
          [cheolsoo@c1405 pig-cheolsoo]$ jar tvf good.jar | grep scriptingudfs.rb
            2491 Fri Jun 08 15:52:08 PDT 2012 home/cheolsoo/pig-0.10/test/e2e/pig/testdist/libexec/ruby/scriptingudfs.rb
          {code}

          Thanks!
          Alan Gates made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Cheolsoo Park made changes -
          Attachment PIG-2745-2.patch [ 12532228 ]
          Daniel Dai made changes -
          Attachment Test001.java [ 12532358 ]
          Daniel Dai made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Hadoop Flags Reviewed [ 10343 ]
          Assignee Cheolsoo Park [ cheolsoo ]
          Fix Version/s 0.11 [ 12318878 ]
          Fix Version/s 0.10.1 [ 12320547 ]
          Resolution Fixed [ 1 ]
          Daniel Dai made changes -
          Attachment enable_scripting_tests_23.patch [ 12532588 ]
          Daniel Dai made changes -
          Attachment enable_scripting_tests_23.patch [ 12532588 ]
          Daniel Dai made changes -
          Attachment enable_scripting_tests_23.patch [ 12532590 ]
          Cheolsoo Park made changes -
          Link This issue relates to PIG-2760 [ PIG-2760 ]
          Cheolsoo Park made changes -
          Link This issue relates to PIG-2623 [ PIG-2623 ]
          Rohini Palaniswamy made changes -
          Link This issue relates to PIG-2761 [ PIG-2761 ]
          Daniel Dai made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Cheolsoo Park
              Reporter:
              Cheolsoo Park
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development