Uploaded image for project: 'Oozie'
  1. Oozie
  2. OOZIE-994

ActionCheckXCommand does not handle failures properly

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 3.2.0
    • Fix Version/s: 3.3.0
    • Component/s: workflow
    • Labels:
      None

      Description

      If the JT restarts or dies and running jobs are lost or the JT is not reachable, Oozie ActionCheckXCommand will never fail the workflow job.

      There seem to be 2 issues here:

      • convertException is not receiving the root cause exception anytmore, but alway HadoopAccessorException wrapping the root cause exception. We should modify the convertException to inspect the cause exception as well.
      • ActionCheckXCommand does not do the handle retry logic of ActionStartXCommand.
      1. OOZIE-994.patch
        28 kB
        Robert Kanter
      2. OOZIE-994.patch
        28 kB
        Robert Kanter
      3. OOZIE-994.patch
        27 kB
        Robert Kanter
      4. OOZIE-994.patch
        26 kB
        Robert Kanter
      5. OOZIE-994.patch
        22 kB
        Robert Kanter
      6. OOZIE-994.patch
        22 kB
        Robert Kanter
      7. OOZIE-994.patch
        22 kB
        Robert Kanter

        Issue Links

          Activity

          Hide
          hadoopqa Hadoop QA added a comment -

          Testing JIRA OOZIE-994

          Cleaning local svn workspace

          ----------------------------
          
          +1 PATCH_APPLIES
             CLEAN cleaned target directories
          -1 RAW_PATCH_ANALYSIS
              +1 the patch does not introduce any @author tags
              +1 the patch does not introduce any tabs
              +1 the patch does not introduce any trailing spaces
              -1 the patch contains 1 line(s) longer than 132 characters
              +1 the patch does adds/modifies 3 testcase(s)
          +1 RAT
              +1 the patch does not seem to introduce new RAT warnings
          +1 JAVADOC
              +1 the patch does not seem to introduce new Javadoc warnings
          +1 COMPILE
              +1 HEAD compiles
              +1 patch compiles
              +1 the patch does not seem to introduce new javac warnings
          +1 BACKWARDS_COMPATIBILITY
              +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
              +1 the patch does not modify JPA files
          +1 TESTS
             Tests run: 912
             Tests failures: 0
             Tests errors: 0
          +1 DISTRO
              +1 distro tarball builds with the patch 
          
          ----------------------------
          

          The full output of the test-patch run is available at

          https://builds.apache.org/job/oozie-trunk-precommit-build/105/

          Show
          hadoopqa Hadoop QA added a comment - Testing JIRA OOZIE-994 Cleaning local svn workspace ---------------------------- +1 PATCH_APPLIES CLEAN cleaned target directories -1 RAW_PATCH_ANALYSIS +1 the patch does not introduce any @author tags +1 the patch does not introduce any tabs +1 the patch does not introduce any trailing spaces -1 the patch contains 1 line(s) longer than 132 characters +1 the patch does adds/modifies 3 testcase(s) +1 RAT +1 the patch does not seem to introduce new RAT warnings +1 JAVADOC +1 the patch does not seem to introduce new Javadoc warnings +1 COMPILE +1 HEAD compiles +1 patch compiles +1 the patch does not seem to introduce new javac warnings +1 BACKWARDS_COMPATIBILITY +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations +1 the patch does not modify JPA files +1 TESTS Tests run: 912 Tests failures: 0 Tests errors: 0 +1 DISTRO +1 distro tarball builds with the patch ---------------------------- The full output of the test-patch run is available at https://builds.apache.org/job/oozie-trunk-precommit-build/105/
          Hide
          virag Virag Kothari added a comment -

          If the error is TRANSIENT, shoudn't the status of action be START_MANUAL/START_RETRY instead of RUNNING in ActionCheck?

          Show
          virag Virag Kothari added a comment - If the error is TRANSIENT, shoudn't the status of action be START_MANUAL/START_RETRY instead of RUNNING in ActionCheck?
          Hide
          rkanter Robert Kanter added a comment -

          I believe the START_MANUAL/START_RETRY statuses are for when the action is being started, right? (i.e. ActionStartXCommand). At this point, the action has already started running before it gets the error and would behave more like when you suspend a workflow (in which case, the action's status is still RUNNING).

          Show
          rkanter Robert Kanter added a comment - I believe the START_MANUAL/START_RETRY statuses are for when the action is being started, right? (i.e. ActionStartXCommand). At this point, the action has already started running before it gets the error and would behave more like when you suspend a workflow (in which case, the action's status is still RUNNING).
          Hide
          virag Virag Kothari added a comment -

          The ActionStartX also checks the status of the hadoop job (check() method) and in some cases may fail while the action is RUNNING.

          If the action's status is kept as RUNNING and job as SUSPENDED, then I believe we wont have a way to distinguish between user issuing a 'suspend' cmd and the workflow suspended due to transient error.

          Show
          virag Virag Kothari added a comment - The ActionStartX also checks the status of the hadoop job (check() method) and in some cases may fail while the action is RUNNING. If the action's status is kept as RUNNING and job as SUSPENDED, then I believe we wont have a way to distinguish between user issuing a 'suspend' cmd and the workflow suspended due to transient error.
          Hide
          rkanter Robert Kanter added a comment -

          I guess there isn't an "easy" way to distinguish them. There is a "transient" counter and the number of retries that both get incremented when this happens that shouldn't be incremented when the 'suspend' command is used, but I'm not sure if these are exposed to the user on the web UI or CLI.

          Show
          rkanter Robert Kanter added a comment - I guess there isn't an "easy" way to distinguish them. There is a "transient" counter and the number of retries that both get incremented when this happens that shouldn't be incremented when the 'suspend' command is used, but I'm not sure if these are exposed to the user on the web UI or CLI.
          Hide
          virag Virag Kothari added a comment -

          Yes, I think those stats are internal and should not be exposed to user.

          Even though a transient error occurred while the action was RUNNING, the action still has to be started again and has to go through all lifecycle steps (PREP -> RUNNING ->etc.). So why not call the status as START_RETRY/START_MANUAL? (Retry manually (ResumeX) or through Recovery service). In this way, the user will be able to know that a transient error occurred.

          Show
          virag Virag Kothari added a comment - Yes, I think those stats are internal and should not be exposed to user. Even though a transient error occurred while the action was RUNNING, the action still has to be started again and has to go through all lifecycle steps (PREP -> RUNNING ->etc.). So why not call the status as START_RETRY/START_MANUAL? (Retry manually (ResumeX) or through Recovery service). In this way, the user will be able to know that a transient error occurred.
          Hide
          virag Virag Kothari added a comment -

          Had a discussion with Mohammad. Will summarize it

          When the retry is being done, the status can be kept RUNNING (The good thing is ActionCheckerService will re-queue RUNNING actions if something bad happens)
          Once its suspended and retry is done, the status can be set to 'START_MANUAL' (As the action has to be started through ActionStartX in ResumeX)

          Basically, referring to the following code change.

           if (!handleTransient(context, executor, wfAction.getStatus)) // wfAction.getStatus() should be always RUNNING I believe?{
          handleNonTransient(context, executor, START_MANUAL); // Having START_MANUAL will make the user know he has to resume the action manually
          
          Show
          virag Virag Kothari added a comment - Had a discussion with Mohammad. Will summarize it When the retry is being done, the status can be kept RUNNING (The good thing is ActionCheckerService will re-queue RUNNING actions if something bad happens) Once its suspended and retry is done, the status can be set to 'START_MANUAL' (As the action has to be started through ActionStartX in ResumeX) Basically, referring to the following code change. if (!handleTransient(context, executor, wfAction.getStatus)) // wfAction.getStatus() should be always RUNNING I believe?{ handleNonTransient(context, executor, START_MANUAL); // Having START_MANUAL will make the user know he has to resume the action manually
          Hide
          rkanter Robert Kanter added a comment -

          Those are good points Virag, I've uploaded a new patch accordingly.

          Show
          rkanter Robert Kanter added a comment - Those are good points Virag, I've uploaded a new patch accordingly.
          Hide
          hadoopqa Hadoop QA added a comment -

          Testing JIRA OOZIE-994

          Cleaning local svn workspace

          ----------------------------
          
          +1 PATCH_APPLIES
             CLEAN cleaned target directories
          +1 RAW_PATCH_ANALYSIS
              +1 the patch does not introduce any @author tags
              +1 the patch does not introduce any tabs
              +1 the patch does not introduce any trailing spaces
              +1 the patch does not introduce any line longer than 132
              +1 the patch does adds/modifies 3 testcase(s)
          +1 RAT
              +1 the patch does not seem to introduce new RAT warnings
          +1 JAVADOC
              +1 the patch does not seem to introduce new Javadoc warnings
          +1 COMPILE
              +1 HEAD compiles
              +1 patch compiles
              +1 the patch does not seem to introduce new javac warnings
          +1 BACKWARDS_COMPATIBILITY
              +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
              +1 the patch does not modify JPA files
          +1 TESTS
             Tests run: 912
             Tests failures: 40
             Tests errors: 5
          +1 DISTRO
              +1 distro tarball builds with the patch 
          
          ----------------------------
          

          The full output of the test-patch run is available at

          https://builds.apache.org/job/oozie-trunk-precommit-build/107/

          Show
          hadoopqa Hadoop QA added a comment - Testing JIRA OOZIE-994 Cleaning local svn workspace ---------------------------- +1 PATCH_APPLIES CLEAN cleaned target directories +1 RAW_PATCH_ANALYSIS +1 the patch does not introduce any @author tags +1 the patch does not introduce any tabs +1 the patch does not introduce any trailing spaces +1 the patch does not introduce any line longer than 132 +1 the patch does adds/modifies 3 testcase(s) +1 RAT +1 the patch does not seem to introduce new RAT warnings +1 JAVADOC +1 the patch does not seem to introduce new Javadoc warnings +1 COMPILE +1 HEAD compiles +1 patch compiles +1 the patch does not seem to introduce new javac warnings +1 BACKWARDS_COMPATIBILITY +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations +1 the patch does not modify JPA files +1 TESTS Tests run: 912 Tests failures: 40 Tests errors: 5 +1 DISTRO +1 distro tarball builds with the patch ---------------------------- The full output of the test-patch run is available at https://builds.apache.org/job/oozie-trunk-precommit-build/107/
          Hide
          rkanter Robert Kanter added a comment -

          Re-attaching same patch so it will run again to double check; I don't think those tests should have failed like that...

          Show
          rkanter Robert Kanter added a comment - Re-attaching same patch so it will run again to double check; I don't think those tests should have failed like that...
          Hide
          hadoopqa Hadoop QA added a comment -

          Testing JIRA OOZIE-994

          Cleaning local svn workspace

          ----------------------------
          
          +1 PATCH_APPLIES
             CLEAN cleaned target directories
          +1 RAW_PATCH_ANALYSIS
              +1 the patch does not introduce any @author tags
              +1 the patch does not introduce any tabs
              +1 the patch does not introduce any trailing spaces
              +1 the patch does not introduce any line longer than 132
              +1 the patch does adds/modifies 3 testcase(s)
          +1 RAT
              +1 the patch does not seem to introduce new RAT warnings
          +1 JAVADOC
              +1 the patch does not seem to introduce new Javadoc warnings
          +1 COMPILE
              +1 HEAD compiles
              +1 patch compiles
              +1 the patch does not seem to introduce new javac warnings
          +1 BACKWARDS_COMPATIBILITY
              +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
              +1 the patch does not modify JPA files
          +1 TESTS
             Tests run: 912
             Tests failures: 0
             Tests errors: 0
          +1 DISTRO
              +1 distro tarball builds with the patch 
          
          ----------------------------
          

          The full output of the test-patch run is available at

          https://builds.apache.org/job/oozie-trunk-precommit-build/108/

          Show
          hadoopqa Hadoop QA added a comment - Testing JIRA OOZIE-994 Cleaning local svn workspace ---------------------------- +1 PATCH_APPLIES CLEAN cleaned target directories +1 RAW_PATCH_ANALYSIS +1 the patch does not introduce any @author tags +1 the patch does not introduce any tabs +1 the patch does not introduce any trailing spaces +1 the patch does not introduce any line longer than 132 +1 the patch does adds/modifies 3 testcase(s) +1 RAT +1 the patch does not seem to introduce new RAT warnings +1 JAVADOC +1 the patch does not seem to introduce new Javadoc warnings +1 COMPILE +1 HEAD compiles +1 patch compiles +1 the patch does not seem to introduce new javac warnings +1 BACKWARDS_COMPATIBILITY +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations +1 the patch does not modify JPA files +1 TESTS Tests run: 912 Tests failures: 0 Tests errors: 0 +1 DISTRO +1 distro tarball builds with the patch ---------------------------- The full output of the test-patch run is available at https://builds.apache.org/job/oozie-trunk-precommit-build/108/
          Hide
          virag Virag Kothari added a comment -

          Some questions:

          Is incRetries() required after the action has auto retried max amount of times? Shouldn't the retries be set to O (same like ActionStartX)?

          Can the two for loops in ActionExecutor combined in one?

          Show
          virag Virag Kothari added a comment - Some questions: Is incRetries() required after the action has auto retried max amount of times? Shouldn't the retries be set to O (same like ActionStartX)? Can the two for loops in ActionExecutor combined in one?
          Hide
          rkanter Robert Kanter added a comment -

          It increments the retries count so that in ResumeXCommand, we have a way of differentiating this case from resuming from other reasons. We then set it to 0 in there. In other words, I'm kinda using it like a flag; if there's a better way to detect this case in ResumeXCommand I'd have no problem with changing it.

          The reason I used two for loops was because of the case where the outer and wrapped exceptions are both registered. If we check both in one for loop, then the order that we iterate through the registered exceptions would determine which one we picked. For example, if type1 was caused by type2 and both were registered, we want type2 to be chosen but it could choose type1 depending on their ordering in ERROR_INFOS.

          Show
          rkanter Robert Kanter added a comment - It increments the retries count so that in ResumeXCommand, we have a way of differentiating this case from resuming from other reasons. We then set it to 0 in there. In other words, I'm kinda using it like a flag; if there's a better way to detect this case in ResumeXCommand I'd have no problem with changing it. The reason I used two for loops was because of the case where the outer and wrapped exceptions are both registered. If we check both in one for loop, then the order that we iterate through the registered exceptions would determine which one we picked. For example, if type1 was caused by type2 and both were registered, we want type2 to be chosen but it could choose type1 depending on their ordering in ERROR_INFOS .
          Hide
          virag Virag Kothari added a comment - - edited

          I think its okay if ResumeXCommand doesn't know a way to distinguish between workflow suspended by ActionCheckX or some other reason. If is sees the action status as START_MANUAL, it can queue the ActionStartX. (clean up of actionDir can be done before starting)

          If you want all of the checks to be done against the wrapped exceptions first, can we have

          for (){
             if( match (Exception.getcause()){  
                return new AEException("..")   // Return immediately.
              }
              if (match (Exception)){
                Exception e = new AEException ("..") //dont return immediately 
              }
          }
          
          if (e!=null){
          return e;
          }
          
          Show
          virag Virag Kothari added a comment - - edited I think its okay if ResumeXCommand doesn't know a way to distinguish between workflow suspended by ActionCheckX or some other reason. If is sees the action status as START_MANUAL, it can queue the ActionStartX. (clean up of actionDir can be done before starting) If you want all of the checks to be done against the wrapped exceptions first, can we have for (){ if ( match (Exception.getcause()){ return new AEException( ".." ) // Return immediately. } if (match (Exception)){ Exception e = new AEException ( ".." ) //dont return immediately } } if (e!= null ){ return e; }
          Hide
          rkanter Robert Kanter added a comment -

          That's a good idea for the exception loop. Doing that exposed a new issue though. Suppose you register IOException and RemoteException (like in TestActionExecutor). In the for loop, we check errorInfo.getKey().isInstance(ex) but because RemoteException is a subclass of IOException, this will pass when ex is a RemoteException and errrorInfo.getKey() is an IOException, which is not what we want. Do you know of a good way in Java to get the "strictest" subclass of a class? I have an idea that might work that I'll try, but its probably not the cleanest solution.

          Show
          rkanter Robert Kanter added a comment - That's a good idea for the exception loop. Doing that exposed a new issue though. Suppose you register IOException and RemoteException (like in TestActionExecutor). In the for loop, we check errorInfo.getKey().isInstance(ex) but because RemoteException is a subclass of IOException, this will pass when ex is a RemoteException and errrorInfo.getKey() is an IOException , which is not what we want. Do you know of a good way in Java to get the "strictest" subclass of a class? I have an idea that might work that I'll try, but its probably not the cleanest solution.
          Hide
          virag Virag Kothari added a comment -

          As a LinkedHashMap is used to store the registeredExceptions and as an entrySet() is being used to iterate over the exceptions, the ordering of the exceptions will be preserved. Also, I see that exceptions like IOException are registered in the end. (In JavaActionExecutor).

          So we can avoid this subclass match problem by checking whether the exception is set or not. If its set, it should not be set again.
          So just making this change in the code from above would work:

          Exception e = null;
          
          if (match (Exception)){
            if (e==null){
                e = new AEException ("..") 
            }
          }
          }
          
          Show
          virag Virag Kothari added a comment - As a LinkedHashMap is used to store the registeredExceptions and as an entrySet() is being used to iterate over the exceptions, the ordering of the exceptions will be preserved. Also, I see that exceptions like IOException are registered in the end. (In JavaActionExecutor). So we can avoid this subclass match problem by checking whether the exception is set or not. If its set, it should not be set again. So just making this change in the code from above would work: Exception e = null ; if (match (Exception)){ if (e== null ){ e = new AEException ( ".." ) } } }
          Hide
          rkanter Robert Kanter added a comment -

          That relies on the exceptions being registered in a specific order though, which works fine now, but in the long run could cause confusion if someone registers new exceptions or adds a new Executor. I came up with a different way that seems to be working; once I'm done fixing the retries thing, I'll post it in the new patch.

          Show
          rkanter Robert Kanter added a comment - That relies on the exceptions being registered in a specific order though, which works fine now, but in the long run could cause confusion if someone registers new exceptions or adds a new Executor. I came up with a different way that seems to be working; once I'm done fixing the retries thing, I'll post it in the new patch.
          Hide
          rkanter Robert Kanter added a comment -

          updated patch

          Show
          rkanter Robert Kanter added a comment - updated patch
          Hide
          hadoopqa Hadoop QA added a comment -

          Testing JIRA OOZIE-994

          Cleaning local svn workspace

          ----------------------------
          
          +1 PATCH_APPLIES
             CLEAN cleaned target directories
          +1 RAW_PATCH_ANALYSIS
              +1 the patch does not introduce any @author tags
              +1 the patch does not introduce any tabs
              +1 the patch does not introduce any trailing spaces
              +1 the patch does not introduce any line longer than 132
              +1 the patch does adds/modifies 3 testcase(s)
          +1 RAT
              +1 the patch does not seem to introduce new RAT warnings
          +1 JAVADOC
              +1 the patch does not seem to introduce new Javadoc warnings
          +1 COMPILE
              +1 HEAD compiles
              +1 patch compiles
              +1 the patch does not seem to introduce new javac warnings
          +1 BACKWARDS_COMPATIBILITY
              +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
              +1 the patch does not modify JPA files
          +1 TESTS
             Tests run: 912
             Tests failures: 0
             Tests errors: 1
          +1 DISTRO
              +1 distro tarball builds with the patch 
          
          ----------------------------
          

          The full output of the test-patch run is available at

          https://builds.apache.org/job/oozie-trunk-precommit-build/109/

          Show
          hadoopqa Hadoop QA added a comment - Testing JIRA OOZIE-994 Cleaning local svn workspace ---------------------------- +1 PATCH_APPLIES CLEAN cleaned target directories +1 RAW_PATCH_ANALYSIS +1 the patch does not introduce any @author tags +1 the patch does not introduce any tabs +1 the patch does not introduce any trailing spaces +1 the patch does not introduce any line longer than 132 +1 the patch does adds/modifies 3 testcase(s) +1 RAT +1 the patch does not seem to introduce new RAT warnings +1 JAVADOC +1 the patch does not seem to introduce new Javadoc warnings +1 COMPILE +1 HEAD compiles +1 patch compiles +1 the patch does not seem to introduce new javac warnings +1 BACKWARDS_COMPATIBILITY +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations +1 the patch does not modify JPA files +1 TESTS Tests run: 912 Tests failures: 0 Tests errors: 1 +1 DISTRO +1 distro tarball builds with the patch ---------------------------- The full output of the test-patch run is available at https://builds.apache.org/job/oozie-trunk-precommit-build/109/
          Hide
          rkanter Robert Kanter added a comment -

          The test failure is caused by:

          Caused by: java.lang.IllegalArgumentException: Pathname /user/test/oozie-jenk/0000000-120925224647405-oozie-jenk-W/:start:--:START: from hdfs://localhost:44005/user/test/oozie-jenk/0000000-120925224647405-oozie-jenk-W/:start:--:START: is not a valid DFS filename.
          	at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:132)
          	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:513)
          	at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:768)
          	at org.apache.oozie.command.wf.ResumeXCommand.execute(ResumeXCommand.java:89)
          ....
          

          Line 89 in ResumeXCommand.java is:

          ActionExecutorContext context = new ActionXCommand.ActionExecutorContext(workflow, action, false,false); // line 88
          if (context.getAppFileSystem().exists(context.getActionDir())) { // line 89
              context.getAppFileSystem().delete(context.getActionDir(), true); // line 90
          }
          

          I'm not sure why this didn't happen locally. It looks like the problem is that this was called on a start action, which is :start: and I imagine that the colon is not allowed in a filename; I'll have to make it check for these types of actions (e.g. start, end, etc).

          Show
          rkanter Robert Kanter added a comment - The test failure is caused by: Caused by: java.lang.IllegalArgumentException: Pathname /user/test/oozie-jenk/0000000-120925224647405-oozie-jenk-W/:start:--:START: from hdfs: //localhost:44005/user/test/oozie-jenk/0000000-120925224647405-oozie-jenk-W/:start:--:START: is not a valid DFS filename. at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:132) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:513) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:768) at org.apache.oozie.command.wf.ResumeXCommand.execute(ResumeXCommand.java:89) .... Line 89 in ResumeXCommand.java is: ActionExecutorContext context = new ActionXCommand.ActionExecutorContext(workflow, action, false , false ); // line 88 if (context.getAppFileSystem().exists(context.getActionDir())) { // line 89 context.getAppFileSystem().delete(context.getActionDir(), true ); // line 90 } I'm not sure why this didn't happen locally. It looks like the problem is that this was called on a start action, which is :start: and I imagine that the colon is not allowed in a filename; I'll have to make it check for these types of actions (e.g. start, end, etc).
          Hide
          rkanter Robert Kanter added a comment -

          updated patch to skip control nodes (e.g. start, end, etc) when cleaning up the action dir.

          Show
          rkanter Robert Kanter added a comment - updated patch to skip control nodes (e.g. start, end, etc) when cleaning up the action dir.
          Hide
          hadoopqa Hadoop QA added a comment -

          Testing JIRA OOZIE-994

          Cleaning local svn workspace

          ----------------------------
          
          +1 PATCH_APPLIES
             CLEAN cleaned target directories
          +1 RAW_PATCH_ANALYSIS
              +1 the patch does not introduce any @author tags
              +1 the patch does not introduce any tabs
              +1 the patch does not introduce any trailing spaces
              +1 the patch does not introduce any line longer than 132
              +1 the patch does adds/modifies 3 testcase(s)
          +1 RAT
              +1 the patch does not seem to introduce new RAT warnings
          +1 JAVADOC
              +1 the patch does not seem to introduce new Javadoc warnings
          +1 COMPILE
              +1 HEAD compiles
              +1 patch compiles
              +1 the patch does not seem to introduce new javac warnings
          +1 BACKWARDS_COMPATIBILITY
              +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
              +1 the patch does not modify JPA files
          +1 TESTS
             Tests run: 912
             Tests failures: 0
             Tests errors: 0
          +1 DISTRO
              +1 distro tarball builds with the patch 
          
          ----------------------------
          

          The full output of the test-patch run is available at

          https://builds.apache.org/job/oozie-trunk-precommit-build/110/

          Show
          hadoopqa Hadoop QA added a comment - Testing JIRA OOZIE-994 Cleaning local svn workspace ---------------------------- +1 PATCH_APPLIES CLEAN cleaned target directories +1 RAW_PATCH_ANALYSIS +1 the patch does not introduce any @author tags +1 the patch does not introduce any tabs +1 the patch does not introduce any trailing spaces +1 the patch does not introduce any line longer than 132 +1 the patch does adds/modifies 3 testcase(s) +1 RAT +1 the patch does not seem to introduce new RAT warnings +1 JAVADOC +1 the patch does not seem to introduce new Javadoc warnings +1 COMPILE +1 HEAD compiles +1 patch compiles +1 the patch does not seem to introduce new javac warnings +1 BACKWARDS_COMPATIBILITY +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations +1 the patch does not modify JPA files +1 TESTS Tests run: 912 Tests failures: 0 Tests errors: 0 +1 DISTRO +1 distro tarball builds with the patch ---------------------------- The full output of the test-patch run is available at https://builds.apache.org/job/oozie-trunk-precommit-build/110/
          Hide
          virag Virag Kothari added a comment -

          The changes related to retries look good.

          Some questions related to changes in registration of Exceptions.

          -  ERROR_INFOS.put(getType(), new LinkedHashMap<Class, ErrorInfo>());
          +  ERROR_INFOS.put(getType(), new LinkedHashMap<String, Set<ErrorInfo>>());
          

          registerError() will register a different Exception each time. I dont understand why this change from a single ErrorInfo to a set of ErrorInfo is required.
          Also, the case when a super class is registered but the subclass is not will not be handled.
          E.g: Register only IOException. In the test case, throw RemoteException. Previously the convertException() would catch the RemoteException as a subclass of IO. But now it wont as all the error's are not being iterated over.

          Show
          virag Virag Kothari added a comment - The changes related to retries look good. Some questions related to changes in registration of Exceptions. - ERROR_INFOS.put(getType(), new LinkedHashMap< Class , ErrorInfo>()); + ERROR_INFOS.put(getType(), new LinkedHashMap< String , Set<ErrorInfo>>()); registerError() will register a different Exception each time. I dont understand why this change from a single ErrorInfo to a set of ErrorInfo is required. Also, the case when a super class is registered but the subclass is not will not be handled. E.g: Register only IOException. In the test case, throw RemoteException. Previously the convertException() would catch the RemoteException as a subclass of IO. But now it wont as all the error's are not being iterated over.
          Hide
          rkanter Robert Kanter added a comment -

          The reason I used a set of ErrorInfos was from the javadoc of registerError(): @param exClass excpetion class name (to work in case of a particular exception not being in the classpath, needed to be able to handle multiple version of Hadoop or other JARs used by executors with the same codebase). I'm not entirely sure, but it seems to imply that you could have multiple versions of the same exception class (e.g. multiple implementations of the same exception class). Because I changed the key of the ERROR_INFOS to be the name of the exceptions (i.e. org.blah.blah.WhateverException), two implementations of the same exception class would have the same name. Is that not what the javadoc is saying?

          I didn't think about subclasses not being handled when I made my changes. Thinking about this now it seems tricky because:
          1) If only the superclass is registered, we should catch subclasses as that superclass
          2) If both the superclass and subclass are registered, we should catch them separately
          I think this was working before because, as you pointed out earlier, the exceptions are registered in the "correct" order when it was just checking isInstance(); I suppose we can keep this approach, but it doesn't seem ideal.

          Show
          rkanter Robert Kanter added a comment - The reason I used a set of ErrorInfos was from the javadoc of registerError(): @param exClass excpetion class name (to work in case of a particular exception not being in the classpath, needed to be able to handle multiple version of Hadoop or other JARs used by executors with the same codebase). I'm not entirely sure, but it seems to imply that you could have multiple versions of the same exception class (e.g. multiple implementations of the same exception class). Because I changed the key of the ERROR_INFOS to be the name of the exceptions (i.e. org.blah.blah.WhateverException), two implementations of the same exception class would have the same name. Is that not what the javadoc is saying? I didn't think about subclasses not being handled when I made my changes. Thinking about this now it seems tricky because: 1) If only the superclass is registered, we should catch subclasses as that superclass 2) If both the superclass and subclass are registered, we should catch them separately I think this was working before because, as you pointed out earlier, the exceptions are registered in the "correct" order when it was just checking isInstance() ; I suppose we can keep this approach, but it doesn't seem ideal.
          Hide
          rkanter Robert Kanter added a comment -

          Actually, for the subclass issue, where we check if the name is in the ERROR_INFOS, I can just add an else that checks the subclass (though if you have an exception, its parent, and its parent's parent, you could still run into the issue where it depends on the order they were registered, but worrying about this is probably being paranoid).

          Show
          rkanter Robert Kanter added a comment - Actually, for the subclass issue, where we check if the name is in the ERROR_INFOS, I can just add an else that checks the subclass (though if you have an exception, its parent, and its parent's parent, you could still run into the issue where it depends on the order they were registered, but worrying about this is probably being paranoid).
          Hide
          virag Virag Kothari added a comment -

          I think when the javadoc talks about multiple versions of Hadoop, it means the registered exception class may not be in the classpath for some version of hadoop
          For. eg. in JavaActionExecutor.registorError, there is

           registerError(org.apache.hadoop.hdfs.protocol.QuotaExceededException.class.getName(),
                              ActionExecutorException.ErrorType.NON_TRANSIENT, "JA004");
          

          This class may not exist in some version of hadoop and it needs to be handled. That is what the javadoc is pointing to.

          I believe the mapping of the className to the errorInfo is unique. As, it doesn't make sense to have something like

           registerError(ABC.class.getName(),
                              ActionExecutorException.ErrorType.NON_TRANSIENT, "JA004");
          registerError(ABC.class.getName(),
                              ActionExecutorException.ErrorType.TRANSIENT, "JA007");
          
          

          Even if there are multiple implementations of same class, the class name would be different. So, no need of using Set of error infos.

          I would prefer that we go back to the same earlier approach that was working before even though it requires exception to be registered in correct order.

          Show
          virag Virag Kothari added a comment - I think when the javadoc talks about multiple versions of Hadoop, it means the registered exception class may not be in the classpath for some version of hadoop For. eg. in JavaActionExecutor.registorError, there is registerError(org.apache.hadoop.hdfs.protocol.QuotaExceededException.class.getName(), ActionExecutorException.ErrorType.NON_TRANSIENT, "JA004" ); This class may not exist in some version of hadoop and it needs to be handled. That is what the javadoc is pointing to. I believe the mapping of the className to the errorInfo is unique. As, it doesn't make sense to have something like registerError(ABC.class.getName(), ActionExecutorException.ErrorType.NON_TRANSIENT, "JA004" ); registerError(ABC.class.getName(), ActionExecutorException.ErrorType.TRANSIENT, "JA007" ); Even if there are multiple implementations of same class, the class name would be different. So, no need of using Set of error infos. I would prefer that we go back to the same earlier approach that was working before even though it requires exception to be registered in correct order.
          Hide
          rkanter Robert Kanter added a comment -

          I see; that makes sense. I'll remove the Set thing and upload a new patch.
          I'm going to first try the approach I suggested in my previous comment; if you still think it would be better to go back to the old approach, I can go back to that way.

          Show
          rkanter Robert Kanter added a comment - I see; that makes sense. I'll remove the Set thing and upload a new patch. I'm going to first try the approach I suggested in my previous comment; if you still think it would be better to go back to the old approach, I can go back to that way.
          Hide
          rkanter Robert Kanter added a comment -

          The new patch gets rid of the Set stuff.

          I kept it using the class name because:
          1) The order that the exceptions are registered in doesn't matter as much
          2) If the exception is registered, its more efficient because we can directly look up in the ERROR_INFOS instead of iterating through it.

          Let me know if you think it would be better to go back to the old way and I can make a new patch.

          Show
          rkanter Robert Kanter added a comment - The new patch gets rid of the Set stuff. I kept it using the class name because: 1) The order that the exceptions are registered in doesn't matter as much 2) If the exception is registered, its more efficient because we can directly look up in the ERROR_INFOS instead of iterating through it. Let me know if you think it would be better to go back to the old way and I can make a new patch.
          Hide
          hadoopqa Hadoop QA added a comment -

          Testing JIRA OOZIE-994

          Cleaning local svn workspace

          ----------------------------
          
          +1 PATCH_APPLIES
             CLEAN cleaned target directories
          +1 RAW_PATCH_ANALYSIS
              +1 the patch does not introduce any @author tags
              +1 the patch does not introduce any tabs
              +1 the patch does not introduce any trailing spaces
              +1 the patch does not introduce any line longer than 132
              +1 the patch does adds/modifies 3 testcase(s)
          +1 RAT
              +1 the patch does not seem to introduce new RAT warnings
          +1 JAVADOC
              +1 the patch does not seem to introduce new Javadoc warnings
          +1 COMPILE
              +1 HEAD compiles
              +1 patch compiles
              +1 the patch does not seem to introduce new javac warnings
          +1 BACKWARDS_COMPATIBILITY
              +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
              +1 the patch does not modify JPA files
          +1 TESTS
             Tests run: 912
             Tests failures: 0
             Tests errors: 0
          +1 DISTRO
              +1 distro tarball builds with the patch 
          
          ----------------------------
          

          The full output of the test-patch run is available at

          https://builds.apache.org/job/oozie-trunk-precommit-build/113/

          Show
          hadoopqa Hadoop QA added a comment - Testing JIRA OOZIE-994 Cleaning local svn workspace ---------------------------- +1 PATCH_APPLIES CLEAN cleaned target directories +1 RAW_PATCH_ANALYSIS +1 the patch does not introduce any @author tags +1 the patch does not introduce any tabs +1 the patch does not introduce any trailing spaces +1 the patch does not introduce any line longer than 132 +1 the patch does adds/modifies 3 testcase(s) +1 RAT +1 the patch does not seem to introduce new RAT warnings +1 JAVADOC +1 the patch does not seem to introduce new Javadoc warnings +1 COMPILE +1 HEAD compiles +1 patch compiles +1 the patch does not seem to introduce new javac warnings +1 BACKWARDS_COMPATIBILITY +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations +1 the patch does not modify JPA files +1 TESTS Tests run: 912 Tests failures: 0 Tests errors: 0 +1 DISTRO +1 distro tarball builds with the patch ---------------------------- The full output of the test-patch run is available at https://builds.apache.org/job/oozie-trunk-precommit-build/113/
          Hide
          virag Virag Kothari added a comment -

          This patch looks good. It is similar to your original patch with two for loops. I am fine with either of them. It seems like it is non-trivial to get rid of one loop and also not rely on exception ordering.
          If you are going ahead with this one, some minor comments:

          • Remove unused imports (HashSEt, set)
          • The new parameter 'Class' should be parameterized (Class<?>)
          • The two input parameters to convertExceptionHelper seems dependent on each other. Probably you can pass one and let the other one derive from it.
          • Remove @SuppressWarnings("deprecation") from patch in TestActionCheckX

          +1 after addressing comments

          Show
          virag Virag Kothari added a comment - This patch looks good. It is similar to your original patch with two for loops. I am fine with either of them. It seems like it is non-trivial to get rid of one loop and also not rely on exception ordering. If you are going ahead with this one, some minor comments: Remove unused imports (HashSEt, set) The new parameter 'Class' should be parameterized (Class<?>) The two input parameters to convertExceptionHelper seems dependent on each other. Probably you can pass one and let the other one derive from it. Remove @SuppressWarnings("deprecation") from patch in TestActionCheckX +1 after addressing comments
          Hide
          rkanter Robert Kanter added a comment -

          New patch addresses all of your comments

          Show
          rkanter Robert Kanter added a comment - New patch addresses all of your comments
          Hide
          hadoopqa Hadoop QA added a comment -

          Testing JIRA OOZIE-994

          Cleaning local svn workspace

          ----------------------------
          
          +1 PATCH_APPLIES
             CLEAN cleaned target directories
          +1 RAW_PATCH_ANALYSIS
              +1 the patch does not introduce any @author tags
              +1 the patch does not introduce any tabs
              +1 the patch does not introduce any trailing spaces
              +1 the patch does not introduce any line longer than 132
              +1 the patch does adds/modifies 3 testcase(s)
          +1 RAT
              +1 the patch does not seem to introduce new RAT warnings
          +1 JAVADOC
              +1 the patch does not seem to introduce new Javadoc warnings
          +1 COMPILE
              +1 HEAD compiles
              +1 patch compiles
              +1 the patch does not seem to introduce new javac warnings
          +1 BACKWARDS_COMPATIBILITY
              +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
              +1 the patch does not modify JPA files
          +1 TESTS
             Tests run: 912
             Tests failures: 0
             Tests errors: 0
          +1 DISTRO
              +1 distro tarball builds with the patch 
          
          ----------------------------
          

          The full output of the test-patch run is available at

          https://builds.apache.org/job/oozie-trunk-precommit-build/115/

          Show
          hadoopqa Hadoop QA added a comment - Testing JIRA OOZIE-994 Cleaning local svn workspace ---------------------------- +1 PATCH_APPLIES CLEAN cleaned target directories +1 RAW_PATCH_ANALYSIS +1 the patch does not introduce any @author tags +1 the patch does not introduce any tabs +1 the patch does not introduce any trailing spaces +1 the patch does not introduce any line longer than 132 +1 the patch does adds/modifies 3 testcase(s) +1 RAT +1 the patch does not seem to introduce new RAT warnings +1 JAVADOC +1 the patch does not seem to introduce new Javadoc warnings +1 COMPILE +1 HEAD compiles +1 patch compiles +1 the patch does not seem to introduce new javac warnings +1 BACKWARDS_COMPATIBILITY +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations +1 the patch does not modify JPA files +1 TESTS Tests run: 912 Tests failures: 0 Tests errors: 0 +1 DISTRO +1 distro tarball builds with the patch ---------------------------- The full output of the test-patch run is available at https://builds.apache.org/job/oozie-trunk-precommit-build/115/
          Hide
          virag Virag Kothari added a comment -

          Committed to trunk. Thanks Robert

          Show
          virag Virag Kothari added a comment - Committed to trunk. Thanks Robert
          Hide
          rohini Rohini Palaniswamy added a comment -

          Can this be ported to 3.3? We are hitting this issue very hard with RM restarts.

          Show
          rohini Rohini Palaniswamy added a comment - Can this be ported to 3.3? We are hitting this issue very hard with RM restarts.
          Hide
          tucu00 Alejandro Abdelnur added a comment -

          Closing issue as it has been released as part of 3.3.0 release

          Show
          tucu00 Alejandro Abdelnur added a comment - Closing issue as it has been released as part of 3.3.0 release

            People

            • Assignee:
              rkanter Robert Kanter
              Reporter:
              tucu00 Alejandro Abdelnur
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development