Pig
  1. Pig
  2. PIG-2576

Change in behavior for UDFContext.getUDFContext().getJobConf() in front-end

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.9.3
    • Fix Version/s: 0.10.0, 0.9.3, 0.11
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      We read a file in the UDF constructor. (The file is transferred to the compute nodes via distache)
      To avoid this case in the front-end while the script is in the compile stage,
      we differentiate between front end and back end execution depending upon a condition ( UDFContext.getUDFContext().getJobConf() == null )

      This was working till Pig 0.9.1, in the current Pig 0.9 version this is breaking.
      ie, If I have any 'fs' commands after the STORE statement, the GruntParser invokes the udf constructor again and the above condition check returns false causing errors.

      1. PIG-2576_e2e.patch
        2 kB
        Thomas Weise
      2. PIG-2576_0.9.patch
        2 kB
        Thomas Weise
      3. PIG-2576_Script_UDF.txt
        3 kB
        Vivek Padmanabhan

        Issue Links

          Activity

          Hide
          Vivek Padmanabhan added a comment -

          Attaching a sample script, udf and exception trace to illustrate the issue.
          Looks like the change came in as part of PIG-2532 in which Pig MR Launcher will set the current UDFContext before the MR jobs are
          launched.
          Hence after the job launch , the UDFContext.getUDFContext().getJobConf() will
          always be a non null value.

          Show
          Vivek Padmanabhan added a comment - Attaching a sample script, udf and exception trace to illustrate the issue. Looks like the change came in as part of PIG-2532 in which Pig MR Launcher will set the current UDFContext before the MR jobs are launched. Hence after the job launch , the UDFContext.getUDFContext().getJobConf() will always be a non null value.
          Hide
          Daniel Dai added a comment -

          This might hardly count as defect. But if we absolutely need to bring the old behavior, we shall clone UDFContext instead of passing reference.

          Show
          Daniel Dai added a comment - This might hardly count as defect. But if we absolutely need to bring the old behavior, we shall clone UDFContext instead of passing reference.
          Hide
          Vivek Padmanabhan added a comment -

          Daniel, currently is there any api calls to differentiate the local and MR mode ?
          If not how about providing this information in UDFContext.

          Show
          Vivek Padmanabhan added a comment - Daniel, currently is there any api calls to differentiate the local and MR mode ? If not how about providing this information in UDFContext.
          Hide
          Thomas Weise added a comment -

          For the user it is a regression or incompatibility introduced in a minor/patch release. Whether the previous behavior of UDFContext.getUDFContext().getJobConf() was intended/documented and what workarounds there are (see below) is not really important, it was in place for long and users have come to rely on it. It is impossible for us to find out how many UDFs have been written with this dependency or change them.

          A better way for UDF developers to ensure initialization only runs in the backend is another issue. It should be possible to determine that easily and in a standard way. The UDF constructor is executed by the frontend (several times) and mapred, hence there is a need to determine where the code runs (for access to files etc.)

          UDF developer can workaround this by just being a bit more "safe" with the "where do I run" check:

          1) The initialization can be done lazily from exec(...), which is never called in the frontend?

          2) Instead of just checking for null, do this:

          
          Configuration jobConf = UDFContext.getUDFContext().getJobConf();
          if (jobConf != null && jobConf.get("mapred.task.id") != null) {
             System.err.println("Executing in BACKEND:" + jobConf.get("mapred.task.id"));    
          } else {
             System.err.println("Executing in FRONTEND" + jobConf);    
          }
          
          

          Both of the above work on 0.9.2 and previous releases.

          Show
          Thomas Weise added a comment - For the user it is a regression or incompatibility introduced in a minor/patch release. Whether the previous behavior of UDFContext.getUDFContext().getJobConf() was intended/documented and what workarounds there are (see below) is not really important, it was in place for long and users have come to rely on it. It is impossible for us to find out how many UDFs have been written with this dependency or change them. A better way for UDF developers to ensure initialization only runs in the backend is another issue. It should be possible to determine that easily and in a standard way. The UDF constructor is executed by the frontend (several times) and mapred, hence there is a need to determine where the code runs (for access to files etc.) UDF developer can workaround this by just being a bit more "safe" with the "where do I run" check: 1) The initialization can be done lazily from exec(...), which is never called in the frontend? 2) Instead of just checking for null, do this: Configuration jobConf = UDFContext.getUDFContext().getJobConf(); if (jobConf != null && jobConf.get( "mapred.task.id" ) != null ) { System .err.println( "Executing in BACKEND:" + jobConf.get( "mapred.task.id" )); } else { System .err.println( "Executing in FRONTEND" + jobConf); } Both of the above work on 0.9.2 and previous releases.
          Hide
          Ashutosh Chauhan added a comment -

          Until PIG-2344 is resolved this (wanting to know running in front/backend) will keep coming back, since state is exposed to user, where it should not have been. Long term fix of this is PIG-2344, in short term we can provide a static method which can return boolean telling its running in frontend or not, instead of user doing it themselves and getting into issues.

          Show
          Ashutosh Chauhan added a comment - Until PIG-2344 is resolved this (wanting to know running in front/backend) will keep coming back, since state is exposed to user, where it should not have been. Long term fix of this is PIG-2344 , in short term we can provide a static method which can return boolean telling its running in frontend or not, instead of user doing it themselves and getting into issues.
          Hide
          Thomas Weise added a comment -

          Patch to clone UDFContext. Also adding convenience method for frontend check.

          Patch tested on cluster and run ant test -Dtestcase=TestRegisteredJarVisibility

          Show
          Thomas Weise added a comment - Patch to clone UDFContext. Also adding convenience method for frontend check. Patch tested on cluster and run ant test -Dtestcase=TestRegisteredJarVisibility
          Hide
          Daniel Dai added a comment -

          +1

          Show
          Daniel Dai added a comment - +1
          Hide
          Thomas Weise added a comment -

          Attached e2e test to verify conf is always null in frontend.

          Show
          Thomas Weise added a comment - Attached e2e test to verify conf is always null in frontend.
          Hide
          Daniel Dai added a comment -

          +1. Patch committed to 0.9/0.10/trunk.

          Show
          Daniel Dai added a comment - +1. Patch committed to 0.9/0.10/trunk.

            People

            • Assignee:
              Thomas Weise
              Reporter:
              Vivek Padmanabhan
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development