Pig
  1. Pig
  2. PIG-2433

Jython import module not working if module path is in classpath

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.10.0
    • Fix Version/s: 0.12.0
    • Component/s: impl
    • Labels:
      None

      Description

      This is a hole of PIG-1824. If the path of python module is in classpath, job die with the message could not instantiate 'org.apache.pig.scripting.jython.JythonFunction'.

      Here is my observation:
      If the path of python module is in classpath, fileEntry we got in JythonScriptEngine:236 is _pyclasspath_/script$py.class instead of the script itself. Thus we cannot locate the script and skip the script in job.xml.

      For example:

      register 'scriptB.py' using org.apache.pig.scripting.jython.JythonScriptEngine as pig
      
      A = LOAD 'table_testPythonNestedImport' as (a0:long, a1:long);
      B = foreach A generate pig.square(a0);
      
      dump B;
      
      scriptB.py:
      
      #!/usr/bin/python
      import scriptA
      @outputSchema("x:{t:(num:double)}")
      def sqrt(number):
       return (number ** .5)
      @outputSchema("x:{t:(num:long)}")
      def square(number):
       return long(scriptA.square(number))
      
      scriptA.py:
      
      #!/usr/bin/python
      def square(number):
       return (number * number)
      

      When we register scriptB.py, we use jython library to figure out the dependent modules scriptB relies on, in this case, scriptA. However, if current directory is in classpath, instead of scriptA.py, we get _pyclasspath/scriptA.class. Then we try to put __pyclasspath/script$py.class into job.jar, Pig complains __pyclasspath_/script$py.class does not exist.

      This is exactly TestScriptUDF.testPythonNestedImport is doing. In hadoop 20.x, the test still success because MiniCluster will take local classpath so it can still find scriptA.py even if it is not in job.jar. However, the script will fail in real cluster and MiniMRYarnCluster of hadoop 23.

      1. PIG-2433-1.patch
        13 kB
        Rohini Palaniswamy
      2. good.log
        202 kB
        Cheolsoo Park
      3. bad.log
        155 kB
        Cheolsoo Park
      4. TEST-org.apache.pig.test.TestScriptUDF.txt
        331 kB
        Cheolsoo Park
      5. PIG-2433.patch
        11 kB
        Rohini Palaniswamy

        Issue Links

          Activity

          Daniel Dai created issue -
          Daniel Dai made changes -
          Field Original Value New Value
          Link This issue relates to PIG-1824 [ PIG-1824 ]
          Daniel Dai made changes -
          Link This issue is related to PIG-2347 [ PIG-2347 ]
          Hide
          Rohini Palaniswamy added a comment -

          Fixed the issue and added unit tests to import os and re.

          Note: If jython-standalone.jar is in pig classpath, found that in real cluster had to add -Dmapred.child.env="JYTHONPATH=job.jar/Lib" to pick up the builtin modules as the jar gets extracted on the datanode and Lib is not in classpath. Might apply to using with oozie too. Could not simulate the error in unit test environment even after removing jython jar from mr-apps-classpath. If the extracted Lib directory is in classpath instead of standalone jar while launching pig the env setting is not required.

          Show
          Rohini Palaniswamy added a comment - Fixed the issue and added unit tests to import os and re. Note: If jython-standalone.jar is in pig classpath, found that in real cluster had to add -Dmapred.child.env="JYTHONPATH=job.jar/Lib" to pick up the builtin modules as the jar gets extracted on the datanode and Lib is not in classpath. Might apply to using with oozie too. Could not simulate the error in unit test environment even after removing jython jar from mr-apps-classpath. If the extracted Lib directory is in classpath instead of standalone jar while launching pig the env setting is not required.
          Rohini Palaniswamy made changes -
          Attachment PIG-2433.patch [ 12550504 ]
          Rohini Palaniswamy made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Assignee Rohini Palaniswamy [ rohini ]
          Fix Version/s 0.12 [ 12323380 ]
          Hide
          Rohini Palaniswamy added a comment -

          Seeing some errors with import unicodedata. Will update the patch after fixing that case too.

          Show
          Rohini Palaniswamy added a comment - Seeing some errors with import unicodedata. Will update the patch after fixing that case too.
          Rohini Palaniswamy made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Rohini Palaniswamy added a comment -

          Changing status to Patch Available again. Issue was something that cannot be fixed in code and can be worked around.

          For anyone interested in the issue and the solution.
          The issue had to do with unicodedata.py loading UnicodeData.txt and EastAsianWidth.txt files inside its code. There is no way to determine them like imports and ship them with the jar. Also note that this happens when Lib directory is in classpath and not with standalone jython jar file.

           
          loader = pkgutil.get_loader('unicodedata')
          init_unicodedata(StringIO.StringIO(loader.get_data(os.path.join(my_path,'UnicodeData.txt'))))
          init_east_asian_width(StringIO.StringIO(loader.get_data(os.path.join(my_path,'EastAsianWidth.txt'))))
          

          The workaround for that is to ship those two files with hadoop's tmpfiles or mapred.cache.files option and set -Dmapred.child.env="JYTHONPATH=."

          pig -Dmapred.child.env="JYTHONPATH=."
          -Dtmpfiles="file:///homes/rohinip/jython/UnicodeData.txt,file:///homes/rohinip/jython/EastAsianWidth.txt"
          norm_test.pig
          

          On a different note, found that progress is not reported in case of jython functions. Is this a known issue? Could not find any jiras.

          Show
          Rohini Palaniswamy added a comment - Changing status to Patch Available again. Issue was something that cannot be fixed in code and can be worked around. For anyone interested in the issue and the solution. The issue had to do with unicodedata.py loading UnicodeData.txt and EastAsianWidth.txt files inside its code. There is no way to determine them like imports and ship them with the jar. Also note that this happens when Lib directory is in classpath and not with standalone jython jar file. loader = pkgutil.get_loader('unicodedata') init_unicodedata(StringIO.StringIO(loader.get_data(os.path.join(my_path,'UnicodeData.txt')))) init_east_asian_width(StringIO.StringIO(loader.get_data(os.path.join(my_path,'EastAsianWidth.txt')))) The workaround for that is to ship those two files with hadoop's tmpfiles or mapred.cache.files option and set -Dmapred.child.env="JYTHONPATH=." pig -Dmapred.child.env="JYTHONPATH=." -Dtmpfiles="file:///homes/rohinip/jython/UnicodeData.txt,file:///homes/rohinip/jython/EastAsianWidth.txt" norm_test.pig On a different note, found that progress is not reported in case of jython functions. Is this a known issue? Could not find any jiras.
          Rohini Palaniswamy made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Rohini Palaniswamy added a comment -

          bump. Review anyone?

          Show
          Rohini Palaniswamy added a comment - bump. Review anyone?
          Hide
          Cheolsoo Park added a comment -

          Hi Rohini,

          After applying the patch to trunk, I see the following error in TestScriptUDF.testPythonNestedImportClassPath:

          Testcase: testPythonNestedImportClassPath took 0.182 sec
              Caused an ERROR
          Python Error. Traceback (most recent call last):
            File "/home/cheolsoo/workspace/pig-svn/scriptB.py", line 2, in <module>
              import scriptA
            File "__pyclasspath__/scriptA.py", line 3, in <module>
          NameError: name 'outputSchema' is not defined
          

          Does this test pass for you?

          Show
          Cheolsoo Park added a comment - Hi Rohini, After applying the patch to trunk, I see the following error in TestScriptUDF.testPythonNestedImportClassPath: Testcase: testPythonNestedImportClassPath took 0.182 sec Caused an ERROR Python Error. Traceback (most recent call last): File "/home/cheolsoo/workspace/pig-svn/scriptB.py" , line 2, in <module> import scriptA File "__pyclasspath__/scriptA.py" , line 3, in <module> NameError: name 'outputSchema' is not defined Does this test pass for you?
          Hide
          Rohini Palaniswamy added a comment -

          ant clean test -Dtestcase=TestScriptUDF passes for me.

          Show
          Rohini Palaniswamy added a comment - ant clean test -Dtestcase=TestScriptUDF passes for me.
          Hide
          Rohini Palaniswamy added a comment -

          One cause for this error could be that your python cache dir is not writable and so the pig jar was not processed. Try running with -Dpython.cachedir=/<dir with write perms> if that is the case. Or are you running from eclipse?

          Show
          Rohini Palaniswamy added a comment - One cause for this error could be that your python cache dir is not writable and so the pig jar was not processed. Try running with -Dpython.cachedir=/<dir with write perms> if that is the case. Or are you running from eclipse?
          Hide
          Cheolsoo Park added a comment -

          Hi Rohini,

          I tried what you suggested, but I still get the same error.

          ant clean test -Dtestcase=TestScriptUDF -Dpython.cachedir=/home/cheolsoo
          

          I see that the test fails on Mac, CentOS 6, and Ubuntu 12. It's not clear what's the root cause. I am attaching my test log.

          Show
          Cheolsoo Park added a comment - Hi Rohini, I tried what you suggested, but I still get the same error. ant clean test -Dtestcase=TestScriptUDF -Dpython.cachedir=/home/cheolsoo I see that the test fails on Mac, CentOS 6, and Ubuntu 12. It's not clear what's the root cause. I am attaching my test log.
          Cheolsoo Park made changes -
          Hide
          Rohini Palaniswamy added a comment -

          Suspecting that the following code execution is failing for you based on the stack trace. But the attached log does not have any error and the comment also says it will fail silently.

          // attempt addition of schema decorator handler, fail silently
                          interpreter.exec("def outputSchema(schema_def):\n"
                                  + "    def decorator(func):\n"
                                  + "        func.outputSchema = schema_def\n"
                                  + "        return func\n"
                                  + "    return decorator\n\n");
          

          Test ran fine for me in Mac and RHEL 5. I will see if I can try and reproduce. Can you add org.python.core.Options.verbose = Py.DEBUG; in the static block of JythonScriptEngine and see if that gives any other additional error messages for you?

          Show
          Rohini Palaniswamy added a comment - Suspecting that the following code execution is failing for you based on the stack trace. But the attached log does not have any error and the comment also says it will fail silently. // attempt addition of schema decorator handler, fail silently interpreter.exec( "def outputSchema(schema_def):\n" + " def decorator(func):\n" + " func.outputSchema = schema_def\n" + " return func\n" + " return decorator\n\n" ); Test ran fine for me in Mac and RHEL 5. I will see if I can try and reproduce. Can you add org.python.core.Options.verbose = Py.DEBUG; in the static block of JythonScriptEngine and see if that gives any other additional error messages for you?
          Hide
          Cheolsoo Park added a comment -

          Hi Rohini,

          I found that the order in which test cases run matters. I am attaching two log files: good.log and bad.log. If I forced using OrderedJUnit4Runner that testPythonNestedImportClassPath runs before
          testPythonBuiltinModuleImport1, they all pass. But if testPythonBuiltinModuleImport1 runs before testPythonNestedImportClassPath, testPythonNestedImportClassPath fails:

          good.log
          Testcase: testPythonNestedImportClassPath took 38.565 sec
          Testcase: testPythonBuiltinModuleImport1 took 35.904 sec
          
          good.log
          Testcase: testPythonBuiltinModuleImport1 took 38.756 sec
          Testcase: testPythonNestedImportClassPath took 0.124 sec
              Caused an ERROR
          Python Error. Traceback (most recent call last):
            File "/Users/cheolsoo/workspace/pig/scriptB.py", line 2, in <module>
              import scriptA
             File "__pyclasspath__/scriptA.py", line 3, in <module>
          NameError: name 'outputSchema' is not defined
          
          Show
          Cheolsoo Park added a comment - Hi Rohini, I found that the order in which test cases run matters. I am attaching two log files: good.log and bad.log. If I forced using OrderedJUnit4Runner that testPythonNestedImportClassPath runs before testPythonBuiltinModuleImport1, they all pass. But if testPythonBuiltinModuleImport1 runs before testPythonNestedImportClassPath, testPythonNestedImportClassPath fails: good.log Testcase: testPythonNestedImportClassPath took 38.565 sec Testcase: testPythonBuiltinModuleImport1 took 35.904 sec good.log Testcase: testPythonBuiltinModuleImport1 took 38.756 sec Testcase: testPythonNestedImportClassPath took 0.124 sec Caused an ERROR Python Error. Traceback (most recent call last): File "/Users/cheolsoo/workspace/pig/scriptB.py" , line 2, in <module> import scriptA File "__pyclasspath__/scriptA.py" , line 3, in <module> NameError: name 'outputSchema' is not defined
          Cheolsoo Park made changes -
          Attachment bad.log [ 12563609 ]
          Attachment good.log [ 12563610 ]
          Hide
          Cheolsoo Park added a comment -

          I also turned on DEBUG as per your request, so you can see extra debug messages in the log files.

          Show
          Cheolsoo Park added a comment - I also turned on DEBUG as per your request, so you can see extra debug messages in the log files.
          Hide
          Rohini Palaniswamy added a comment -

          Thanks Cheolsoo. I think this has something to do with PythonInterpreter being static in JythonScriptEngine. And you must be running with jdk7 so the test order was different. I was running with jdk6 and that's why did not see it. Will investigate and fix it.

          Show
          Rohini Palaniswamy added a comment - Thanks Cheolsoo. I think this has something to do with PythonInterpreter being static in JythonScriptEngine. And you must be running with jdk7 so the test order was different. I was running with jdk6 and that's why did not see it. Will investigate and fix it.
          Hide
          Rohini Palaniswamy added a comment -

          Used different names for different modules. Tests pass when run with jdk7 now.

          Show
          Rohini Palaniswamy added a comment - Used different names for different modules. Tests pass when run with jdk7 now.
          Rohini Palaniswamy made changes -
          Attachment PIG-2433-1.patch [ 12563658 ]
          Hide
          Cheolsoo Park added a comment -

          +1.

          Thanks for the fix. The test passes for me too. I also ran e2e test and found no failure.

          Minor comment:
          When you commit the patch, can you remove a tab char in the following line?

          +    	<!-- Remove jython jar from mrapp-generated-classpath -->
          
          Show
          Cheolsoo Park added a comment - +1. Thanks for the fix. The test passes for me too. I also ran e2e test and found no failure. Minor comment: When you commit the patch, can you remove a tab char in the following line? + <!-- Remove jython jar from mrapp-generated-classpath -->
          Hide
          Rohini Palaniswamy added a comment -

          Thanks for the review Cheolsoo. Removed the tab before committing. Committed to trunk.

          Show
          Rohini Palaniswamy added a comment - Thanks for the review Cheolsoo. Removed the tab before committing. Committed to trunk.
          Rohini Palaniswamy made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Daniel Dai made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Rohini Palaniswamy
              Reporter:
              Daniel Dai
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development