Oozie
  1. Oozie
  2. OOZIE-1319

"LAST_ONLY" in execution control for coordinator job still runs all the actions

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.1.0
    • Component/s: None
    • Labels:
      None

      Description

      In execute() of CoordJobGetReadyActionsJPAExecutor.java, once we retrieve the top item from a "LIFO" query result, we do not discard or delete the remaining items from the result list. As a result, the next time execute() is invoked, we will be retrieving the next item in line. Consequently, LAST_ONLY strategy will also execute all ready actions for a given coordinator job, making it no different than LIFO.

      1. oozie-1319.patch
        9 kB
        Bowen Zhang
      2. OOZIE-1319.patch
        55 kB
        Robert Kanter
      3. OOZIE-1319.patch
        61 kB
        Robert Kanter
      4. OOZIE-1319.patch
        59 kB
        Robert Kanter
      5. OOZIE-1319.patch
        59 kB
        Robert Kanter
      6. OOZIE-1319.patch
        34 kB
        Robert Kanter
      7. OOZIE-1319.patch
        35 kB
        Robert Kanter
      8. OOZIE-1319.patch
        34 kB
        Robert Kanter

        Issue Links

          Activity

          Bowen Zhang created issue -
          Bowen Zhang made changes -
          Field Original Value New Value
          Description In execute() of CoordJobGetReadyActionsJPAExecutor.java, once we retrieve the top item from a "LIFO" query result, we do not discard or delete the remaining items from the result list. As a result, the next time execute() is invoked, we will be retrieving the next item in line. Consequently, LAST_ONLY strategy will also execute all ready actions for a given coordinator job, making it no different than LIFO.
          Hide
          Alejandro Abdelnur added a comment -

          it seems like bug

          Show
          Alejandro Abdelnur added a comment - it seems like bug
          Hide
          Bowen Zhang added a comment -

          Alejandro Abdelnur, what should be the status of the actions we skipped using LAST_ONLY execution? Should we introduce a new status called "SKIPPED"? or do we use "KILLED".

          Show
          Bowen Zhang added a comment - Alejandro Abdelnur , what should be the status of the actions we skipped using LAST_ONLY execution? Should we introduce a new status called "SKIPPED"? or do we use "KILLED".
          Bowen Zhang made changes -
          Assignee Bowen Zhang [ bowenzhangusa ]
          Hide
          Alejandro Abdelnur added a comment -

          It seems that a new 'SKIPPED' status makes more sense. We have to make sure this does not break things.

          Show
          Alejandro Abdelnur added a comment - It seems that a new 'SKIPPED' status makes more sense. We have to make sure this does not break things.
          Bowen Zhang made changes -
          Attachment oozie-1319.patch [ 12584183 ]
          Hide
          Alejandro Abdelnur added a comment -

          Apologies for the delay. patch LGTM. One more thing we may have to take care is in the purge logic, can you please check this?

          Any reason why it has not been marked as 'submit patch' for test-patch to run?

          Show
          Alejandro Abdelnur added a comment - Apologies for the delay. patch LGTM. One more thing we may have to take care is in the purge logic, can you please check this? Any reason why it has not been marked as 'submit patch' for test-patch to run?
          Hide
          Bowen Zhang added a comment -

          Alejandro Abdelnur, I will take a look at the purge logic.

          Show
          Bowen Zhang added a comment - Alejandro Abdelnur , I will take a look at the purge logic.
          Hide
          Robert Kanter added a comment -

          I don't think this should affect the purge logic. I just took a quick look and it doesn't search for coordinator actions; the way it deletes coordinator actions is by searching for the underlying workflows and deleting all of the coordinator actions when we delete the coordinator job (after verifying some other stuff of course).

          Show
          Robert Kanter added a comment - I don't think this should affect the purge logic. I just took a quick look and it doesn't search for coordinator actions; the way it deletes coordinator actions is by searching for the underlying workflows and deleting all of the coordinator actions when we delete the coordinator job (after verifying some other stuff of course).
          Bowen Zhang made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Hadoop QA added a comment -

          Testing JIRA OOZIE-1319

          Cleaning local svn workspace

          ----------------------------

          +1 PATCH_APPLIES
          +1 CLEAN
          -1 RAW_PATCH_ANALYSIS
          . +1 the patch does not introduce any @author tags
          . +1 the patch does not introduce any tabs
          . +1 the patch does not introduce any trailing spaces
          . -1 the patch contains 3 line(s) longer than 132 characters
          . +1 the patch does adds/modifies 2 testcase(s)
          +1 RAT
          . +1 the patch does not seem to introduce new RAT warnings
          +1 JAVADOC
          . +1 the patch does not seem to introduce new Javadoc warnings
          . WARNING: the current HEAD has 1 Javadoc warning(s)
          -1 COMPILE
          . +1 HEAD compiles
          . +1 patch compiles
          . -1 the patch seems to introduce 1 new javac warning(s)
          +1 BACKWARDS_COMPATIBILITY
          . +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
          . +1 the patch does not modify JPA files
          -1 TESTS
          . Tests run: 1240
          . Tests failed: 0
          . Tests errors: 1109

          . The patch failed the following testcases:

          .

          +1 DISTRO
          . +1 distro tarball builds with the patch

          ----------------------------
          -1 Overall result, please check the reported -1(s)

          . There is at least one warning, please check

          The full output of the test-patch run is available at

          . https://builds.apache.org/job/oozie-trunk-precommit-build/546/

          Show
          Hadoop QA added a comment - Testing JIRA OOZIE-1319 Cleaning local svn workspace ---------------------------- +1 PATCH_APPLIES +1 CLEAN -1 RAW_PATCH_ANALYSIS . +1 the patch does not introduce any @author tags . +1 the patch does not introduce any tabs . +1 the patch does not introduce any trailing spaces . -1 the patch contains 3 line(s) longer than 132 characters . +1 the patch does adds/modifies 2 testcase(s) +1 RAT . +1 the patch does not seem to introduce new RAT warnings +1 JAVADOC . +1 the patch does not seem to introduce new Javadoc warnings . WARNING : the current HEAD has 1 Javadoc warning(s) -1 COMPILE . +1 HEAD compiles . +1 patch compiles . -1 the patch seems to introduce 1 new javac warning(s) +1 BACKWARDS_COMPATIBILITY . +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations . +1 the patch does not modify JPA files -1 TESTS . Tests run: 1240 . Tests failed: 0 . Tests errors: 1109 . The patch failed the following testcases: . +1 DISTRO . +1 distro tarball builds with the patch ---------------------------- -1 Overall result, please check the reported -1(s) . There is at least one warning, please check The full output of the test-patch run is available at . https://builds.apache.org/job/oozie-trunk-precommit-build/546/
          Hide
          Bowen Zhang added a comment -

          I don't think the test failures are related to my changes.

          Show
          Bowen Zhang added a comment - I don't think the test failures are related to my changes.
          Hide
          Robert Kanter added a comment -

          Probably not; can you reupload the patch to run the tests again? It's doing that thing were it can't lock some directory for the mini dfs cluster; hopefully its fixed itself by now.

          Show
          Robert Kanter added a comment - Probably not; can you reupload the patch to run the tests again? It's doing that thing were it can't lock some directory for the mini dfs cluster; hopefully its fixed itself by now.
          Hide
          Alejandro Abdelnur added a comment -

          no need to reupload, i've kicked the build for this patch manually.

          Show
          Alejandro Abdelnur added a comment - no need to reupload, i've kicked the build for this patch manually.
          Bowen Zhang made changes -
          Attachment oozie-1319.patch [ 12584183 ]
          Bowen Zhang made changes -
          Attachment oozie-1319.patch [ 12585670 ]
          Hide
          Alejandro Abdelnur added a comment -

          +1, pending jenkins, kicking off jenkins manually.

          Show
          Alejandro Abdelnur added a comment - +1, pending jenkins, kicking off jenkins manually.
          Hide
          Hadoop QA added a comment -

          Testing JIRA OOZIE-1319

          Cleaning local svn workspace

          ----------------------------

          +1 PATCH_APPLIES
          +1 CLEAN
          -1 RAW_PATCH_ANALYSIS
          . +1 the patch does not introduce any @author tags
          . +1 the patch does not introduce any tabs
          . +1 the patch does not introduce any trailing spaces
          . -1 the patch contains 3 line(s) longer than 132 characters
          . +1 the patch does adds/modifies 2 testcase(s)
          +1 RAT
          . +1 the patch does not seem to introduce new RAT warnings
          +1 JAVADOC
          . +1 the patch does not seem to introduce new Javadoc warnings
          +1 COMPILE
          . +1 HEAD compiles
          . +1 patch compiles
          . +1 the patch does not seem to introduce new javac warnings
          +1 BACKWARDS_COMPATIBILITY
          . +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
          . +1 the patch does not modify JPA files
          -1 TESTS
          . Tests run: 1247
          . Tests failed: 1
          . Tests errors: 0

          . The patch failed the following testcases:

          . testCoordinatorActionEvent(org.apache.oozie.event.TestEventGeneration)

          +1 DISTRO
          . +1 distro tarball builds with the patch

          ----------------------------
          -1 Overall result, please check the reported -1(s)

          The full output of the test-patch run is available at

          . https://builds.apache.org/job/oozie-trunk-precommit-build/584/

          Show
          Hadoop QA added a comment - Testing JIRA OOZIE-1319 Cleaning local svn workspace ---------------------------- +1 PATCH_APPLIES +1 CLEAN -1 RAW_PATCH_ANALYSIS . +1 the patch does not introduce any @author tags . +1 the patch does not introduce any tabs . +1 the patch does not introduce any trailing spaces . -1 the patch contains 3 line(s) longer than 132 characters . +1 the patch does adds/modifies 2 testcase(s) +1 RAT . +1 the patch does not seem to introduce new RAT warnings +1 JAVADOC . +1 the patch does not seem to introduce new Javadoc warnings +1 COMPILE . +1 HEAD compiles . +1 patch compiles . +1 the patch does not seem to introduce new javac warnings +1 BACKWARDS_COMPATIBILITY . +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations . +1 the patch does not modify JPA files -1 TESTS . Tests run: 1247 . Tests failed: 1 . Tests errors: 0 . The patch failed the following testcases: . testCoordinatorActionEvent(org.apache.oozie.event.TestEventGeneration) +1 DISTRO . +1 distro tarball builds with the patch ---------------------------- -1 Overall result, please check the reported -1(s) The full output of the test-patch run is available at . https://builds.apache.org/job/oozie-trunk-precommit-build/584/
          Hide
          Hadoop QA added a comment -

          Testing JIRA OOZIE-1319

          Cleaning local svn workspace

          ----------------------------

          +1 PATCH_APPLIES
          +1 CLEAN
          -1 RAW_PATCH_ANALYSIS
          . +1 the patch does not introduce any @author tags
          . +1 the patch does not introduce any tabs
          . +1 the patch does not introduce any trailing spaces
          . -1 the patch contains 3 line(s) longer than 132 characters
          . +1 the patch does adds/modifies 2 testcase(s)
          +1 RAT
          . +1 the patch does not seem to introduce new RAT warnings
          +1 JAVADOC
          . +1 the patch does not seem to introduce new Javadoc warnings
          +1 COMPILE
          . +1 HEAD compiles
          . +1 patch compiles
          . +1 the patch does not seem to introduce new javac warnings
          +1 BACKWARDS_COMPATIBILITY
          . +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
          . +1 the patch does not modify JPA files
          -1 TESTS
          . Tests run: 1248
          . Tests failed: 1
          . Tests errors: 0

          . The patch failed the following testcases:

          . testCoordinatorActionEvent(org.apache.oozie.event.TestEventGeneration)

          +1 DISTRO
          . +1 distro tarball builds with the patch

          ----------------------------
          -1 Overall result, please check the reported -1(s)

          The full output of the test-patch run is available at

          . https://builds.apache.org/job/oozie-trunk-precommit-build/595/

          Show
          Hadoop QA added a comment - Testing JIRA OOZIE-1319 Cleaning local svn workspace ---------------------------- +1 PATCH_APPLIES +1 CLEAN -1 RAW_PATCH_ANALYSIS . +1 the patch does not introduce any @author tags . +1 the patch does not introduce any tabs . +1 the patch does not introduce any trailing spaces . -1 the patch contains 3 line(s) longer than 132 characters . +1 the patch does adds/modifies 2 testcase(s) +1 RAT . +1 the patch does not seem to introduce new RAT warnings +1 JAVADOC . +1 the patch does not seem to introduce new Javadoc warnings +1 COMPILE . +1 HEAD compiles . +1 patch compiles . +1 the patch does not seem to introduce new javac warnings +1 BACKWARDS_COMPATIBILITY . +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations . +1 the patch does not modify JPA files -1 TESTS . Tests run: 1248 . Tests failed: 1 . Tests errors: 0 . The patch failed the following testcases: . testCoordinatorActionEvent(org.apache.oozie.event.TestEventGeneration) +1 DISTRO . +1 distro tarball builds with the patch ---------------------------- -1 Overall result, please check the reported -1(s) The full output of the test-patch run is available at . https://builds.apache.org/job/oozie-trunk-precommit-build/595/
          Hide
          Alejandro Abdelnur added a comment -

          Bowen, could you please check if the failure is related or not?

          Show
          Alejandro Abdelnur added a comment - Bowen, could you please check if the failure is related or not?
          Hide
          Robert Kanter added a comment -

          I tried out the patch by modifying the aggregator example to be LAST_ONLY and removing the input data from HDFS. Once both actions were WAITING, I uploaded the data back into HDFS. I was expecting the first (@1) action to go WAITING --> READY --> SKIPPED and the second (@2) action to go WAITING --> READY --> RUNNING --> SUCCEEDED. However, they both went WAITING --> READY --> RUNNING --> SUCCEEDED; though they did it in LIFO order (that is, @2 then @1). So, it doesn't look like this is working properly. Looking at the code, I think it looks like it should work. Bowen Zhang, any ideas?

          Also, so I don't forget, the patch should also update the documentation to mention the SKIPPED status here:
          https://oozie.apache.org/docs/4.0.1/CoordinatorFunctionalSpec.html#a6.1.3.2._Coordinator_Action_Status
          and also where it talks about LAST_ONLY here:
          https://oozie.apache.org/docs/4.0.1/CoordinatorFunctionalSpec.html#a6.3._Synchronous_Coordinator_Application_Definition

          Show
          Robert Kanter added a comment - I tried out the patch by modifying the aggregator example to be LAST_ONLY and removing the input data from HDFS. Once both actions were WAITING, I uploaded the data back into HDFS. I was expecting the first (@1) action to go WAITING --> READY --> SKIPPED and the second (@2) action to go WAITING --> READY --> RUNNING --> SUCCEEDED. However, they both went WAITING --> READY --> RUNNING --> SUCCEEDED; though they did it in LIFO order (that is, @2 then @1). So, it doesn't look like this is working properly. Looking at the code, I think it looks like it should work. Bowen Zhang , any ideas? Also, so I don't forget, the patch should also update the documentation to mention the SKIPPED status here: https://oozie.apache.org/docs/4.0.1/CoordinatorFunctionalSpec.html#a6.1.3.2._Coordinator_Action_Status and also where it talks about LAST_ONLY here: https://oozie.apache.org/docs/4.0.1/CoordinatorFunctionalSpec.html#a6.3._Synchronous_Coordinator_Application_Definition
          Hide
          Bowen Zhang added a comment -

          I think the concurrency also need to be 2 if you get two actions in waiting. Overall, this execution control is not really a well-defined property.

          Show
          Bowen Zhang added a comment - I think the concurrency also need to be 2 if you get two actions in waiting. Overall, this execution control is not really a well-defined property.
          Hide
          Robert Kanter added a comment -

          I think the concurrency also need to be 2 if you get two actions in waiting.

          That just made both actions do RUNNING at the same time instead of in sequence

          I think the problem is that the query used to get the Jobs is this (for FIFO; FILO is just this with the order as "desc"):

          select a.id, a.jobId, a.statusStr, a.pending, a.nominalTimestamp, a.createdTimestamp from CoordinatorActionBean a where a.jobId = :jobId AND a.statusStr = 'READY' order by a.nominalTimestamp
          

          Even though I made the data available for both actions at the same time, Oozie didn't transition them to READY at the same time, so when this query was executed, there was only ever 1 action in READY – that's why it didn't seem to do anything. I was thinking that this may depend partly on the timing of the actions transitioning from WAITING to READY, which doesn't happen simultaneously for all actions.
          I tried modifying the example to have 4 actions; I then waited for all of them to be WAITING and put up the data; the @4 went to RUNNING and the rest eventually went to READY. I was then expecting those to go to SKIPPED because the query should be looking for them; however, they stayed READY and just went in sequence to RUNNING and SUCCEEDED.

          tldr; the patch isn't working

          Overall, this execution control is not really a well-defined property.

          My understanding of LAST_ONLY is that if multiple actions are WAITING or READY, only one of them should actually run while the rest are skipped. What's not clear to me is if the "LAST" means the oldest action (i.e. the last one in the list) or the newest action (i.e. the last one to be added); though this isn't too hard to switch around once we get the rest of this working. It may make sense to add a "FIRST_ONLY" for the opposite or something.

          Bowen Zhang, are you planning on still doing this? Or do you mind if I take a stab at it?

          Show
          Robert Kanter added a comment - I think the concurrency also need to be 2 if you get two actions in waiting. That just made both actions do RUNNING at the same time instead of in sequence I think the problem is that the query used to get the Jobs is this (for FIFO; FILO is just this with the order as "desc"): select a.id, a.jobId, a.statusStr, a.pending, a.nominalTimestamp, a.createdTimestamp from CoordinatorActionBean a where a.jobId = :jobId AND a.statusStr = 'READY' order by a.nominalTimestamp Even though I made the data available for both actions at the same time, Oozie didn't transition them to READY at the same time, so when this query was executed, there was only ever 1 action in READY – that's why it didn't seem to do anything. I was thinking that this may depend partly on the timing of the actions transitioning from WAITING to READY, which doesn't happen simultaneously for all actions. I tried modifying the example to have 4 actions; I then waited for all of them to be WAITING and put up the data; the @4 went to RUNNING and the rest eventually went to READY. I was then expecting those to go to SKIPPED because the query should be looking for them; however, they stayed READY and just went in sequence to RUNNING and SUCCEEDED. tldr; the patch isn't working Overall, this execution control is not really a well-defined property. My understanding of LAST_ONLY is that if multiple actions are WAITING or READY, only one of them should actually run while the rest are skipped. What's not clear to me is if the "LAST" means the oldest action (i.e. the last one in the list) or the newest action (i.e. the last one to be added); though this isn't too hard to switch around once we get the rest of this working. It may make sense to add a "FIRST_ONLY" for the opposite or something. Bowen Zhang , are you planning on still doing this? Or do you mind if I take a stab at it?
          Hide
          Bowen Zhang added a comment -

          go ahead and take it. But, don't try too hard since I haven't encountered any customers using this property.

          Show
          Bowen Zhang added a comment - go ahead and take it. But, don't try too hard since I haven't encountered any customers using this property.
          Hide
          Shaik Idris Ali added a comment -

          We have some usecases related to realtime data pipeline with frequency around minutes(5), this feature is very handy specially when we have huge backlog of workflow instances, wherein we just pick the latest action and discard the rest.

          Show
          Shaik Idris Ali added a comment - We have some usecases related to realtime data pipeline with frequency around minutes(5), this feature is very handy specially when we have huge backlog of workflow instances, wherein we just pick the latest action and discard the rest.
          Robert Kanter made changes -
          Assignee Bowen Zhang [ bowenzhangusa ] Robert Kanter [ rkanter ]
          Hide
          Shwetha G S added a comment -

          Does purge delete the old skipped coord actions? Please make sure that they are deleted, else it will just fill up the DB

          Show
          Shwetha G S added a comment - Does purge delete the old skipped coord actions? Please make sure that they are deleted, else it will just fill up the DB
          Hide
          Robert Kanter added a comment -

          I'm still working on the patch, but it should delete the SKIPPED actions because when looking for entries to delete, it only looks at the jobs; it then deletes all actions belonging to the job without looking at their status. In other words, the actions' statuses don't actually matter.

          Show
          Robert Kanter added a comment - I'm still working on the patch, but it should delete the SKIPPED actions because when looking for entries to delete, it only looks at the jobs; it then deletes all actions belonging to the job without looking at their status. In other words, the actions' statuses don't actually matter.
          Robert Kanter made changes -
          Link This issue is duplicated by OOZIE-614 [ OOZIE-614 ]
          Hide
          Robert Kanter added a comment -

          I've had to re-work the approach that the original patch was taking. In the original patch, it only looks at the currently READY actions; we actually want to skip any unscheduled actions (i.e. READY or WAITING). So, when I was trying it out, what was happening is that the actions didn't all become READY at the same time, so it wasn't skipping any.

          The new patch keeps the SKIPPED status stuff from the original patch, but we now check when we materialize actions. Basically, whenever we materialize action(s) we want to skip all but the latest unscheduled action.

          Show
          Robert Kanter added a comment - I've had to re-work the approach that the original patch was taking. In the original patch, it only looks at the currently READY actions; we actually want to skip any unscheduled actions (i.e. READY or WAITING). So, when I was trying it out, what was happening is that the actions didn't all become READY at the same time, so it wasn't skipping any. The new patch keeps the SKIPPED status stuff from the original patch, but we now check when we materialize actions. Basically, whenever we materialize action(s) we want to skip all but the latest unscheduled action.
          Robert Kanter made changes -
          Attachment OOZIE-1319.patch [ 12640574 ]
          Hide
          Hadoop QA added a comment -

          Testing JIRA OOZIE-1319

          Cleaning local git workspace

          ----------------------------

          -1 Patch failed to apply to head of branch

          ----------------------------

          Show
          Hadoop QA added a comment - Testing JIRA OOZIE-1319 Cleaning local git workspace ---------------------------- -1 Patch failed to apply to head of branch ----------------------------
          Hide
          Robert Kanter added a comment -

          Rebased patch on latest trunk

          Show
          Robert Kanter added a comment - Rebased patch on latest trunk
          Robert Kanter made changes -
          Attachment OOZIE-1319.patch [ 12640579 ]
          Hide
          Hadoop QA added a comment -

          Testing JIRA OOZIE-1319

          Cleaning local git workspace

          ----------------------------

          +1 PATCH_APPLIES
          +1 CLEAN
          -1 RAW_PATCH_ANALYSIS
          . +1 the patch does not introduce any @author tags
          . +1 the patch does not introduce any tabs
          . +1 the patch does not introduce any trailing spaces
          . -1 the patch contains 4 line(s) longer than 132 characters
          . +1 the patch does adds/modifies 4 testcase(s)
          +1 RAT
          . +1 the patch does not seem to introduce new RAT warnings
          +1 JAVADOC
          . +1 the patch does not seem to introduce new Javadoc warnings
          +1 COMPILE
          . +1 HEAD compiles
          . +1 patch compiles
          . +1 the patch does not seem to introduce new javac warnings
          +1 BACKWARDS_COMPATIBILITY
          . +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
          . +1 the patch does not modify JPA files
          -1 TESTS
          . Tests run: 1422
          . Tests failed: 0
          . Tests errors: 1

          . The patch failed the following testcases:

          .

          +1 DISTRO
          . +1 distro tarball builds with the patch

          ----------------------------
          -1 Overall result, please check the reported -1(s)

          The full output of the test-patch run is available at

          . https://builds.apache.org/job/oozie-trunk-precommit-build/1176/

          Show
          Hadoop QA added a comment - Testing JIRA OOZIE-1319 Cleaning local git workspace ---------------------------- +1 PATCH_APPLIES +1 CLEAN -1 RAW_PATCH_ANALYSIS . +1 the patch does not introduce any @author tags . +1 the patch does not introduce any tabs . +1 the patch does not introduce any trailing spaces . -1 the patch contains 4 line(s) longer than 132 characters . +1 the patch does adds/modifies 4 testcase(s) +1 RAT . +1 the patch does not seem to introduce new RAT warnings +1 JAVADOC . +1 the patch does not seem to introduce new Javadoc warnings +1 COMPILE . +1 HEAD compiles . +1 patch compiles . +1 the patch does not seem to introduce new javac warnings +1 BACKWARDS_COMPATIBILITY . +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations . +1 the patch does not modify JPA files -1 TESTS . Tests run: 1422 . Tests failed: 0 . Tests errors: 1 . The patch failed the following testcases: . +1 DISTRO . +1 distro tarball builds with the patch ---------------------------- -1 Overall result, please check the reported -1(s) The full output of the test-patch run is available at . https://builds.apache.org/job/oozie-trunk-precommit-build/1176/
          Hide
          Robert Kanter added a comment -

          The 4 lines that are too long are for namedQueries and twiki formatting. Test failure is unrelated.

          Show
          Robert Kanter added a comment - The 4 lines that are too long are for namedQueries and twiki formatting. Test failure is unrelated.
          Hide
          Shwetha G S added a comment -

          I'm still working on the patch, but it should delete the SKIPPED actions because when looking for entries to delete, it only looks at the jobs; it then deletes all actions belonging to the job without looking at their status. In other words, the actions' statuses don't actually matter.

          Since skipped coord actions will not have any workflow jobs, just want to make sure this is handled. Great if its already taken care. Thanks

          Show
          Shwetha G S added a comment - I'm still working on the patch, but it should delete the SKIPPED actions because when looking for entries to delete, it only looks at the jobs; it then deletes all actions belonging to the job without looking at their status. In other words, the actions' statuses don't actually matter. Since skipped coord actions will not have any workflow jobs, just want to make sure this is handled. Great if its already taken care. Thanks
          Hide
          Robert Kanter added a comment -

          Since skipped coord actions will not have any workflow jobs, just want to make sure this is handled. Great if its already taken care. Thanks

          Yup, actions are deleted when their associated jobs are deleted. For example, all Coordinator actions, regardless of status or anything else, are deleted when their associated Coordinator Job is deleted.

          Show
          Robert Kanter added a comment - Since skipped coord actions will not have any workflow jobs, just want to make sure this is handled. Great if its already taken care. Thanks Yup, actions are deleted when their associated jobs are deleted. For example, all Coordinator actions, regardless of status or anything else, are deleted when their associated Coordinator Job is deleted.
          Hide
          Robert Kanter added a comment -

          Rebased on latest trunk

          Show
          Robert Kanter added a comment - Rebased on latest trunk
          Robert Kanter made changes -
          Attachment OOZIE-1319.patch [ 12641776 ]
          Hide
          Hadoop QA added a comment -

          Testing JIRA OOZIE-1319

          Cleaning local git workspace

          ----------------------------

          +1 PATCH_APPLIES
          +1 CLEAN
          -1 RAW_PATCH_ANALYSIS
          . +1 the patch does not introduce any @author tags
          . +1 the patch does not introduce any tabs
          . +1 the patch does not introduce any trailing spaces
          . -1 the patch contains 4 line(s) longer than 132 characters
          . +1 the patch does adds/modifies 4 testcase(s)
          +1 RAT
          . +1 the patch does not seem to introduce new RAT warnings
          +1 JAVADOC
          . +1 the patch does not seem to introduce new Javadoc warnings
          +1 COMPILE
          . +1 HEAD compiles
          . +1 patch compiles
          . +1 the patch does not seem to introduce new javac warnings
          +1 BACKWARDS_COMPATIBILITY
          . +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
          . +1 the patch does not modify JPA files
          -1 TESTS
          . Tests run: 1426
          . Tests failed: 3
          . Tests errors: 4

          . The patch failed the following testcases:

          . testConcurrencyReachedAndChooseNextEligible(org.apache.oozie.service.TestCallableQueueService)
          . testTimeOutWithUnresolvedMissingDependencies(org.apache.oozie.command.coord.TestCoordPushDependencyCheckXCommand)
          . testCoordinatorActionCommandsSubmitAndStart(org.apache.oozie.sla.TestSLAEventGeneration)

          +1 DISTRO
          . +1 distro tarball builds with the patch

          ----------------------------
          -1 Overall result, please check the reported -1(s)

          The full output of the test-patch run is available at

          . https://builds.apache.org/job/oozie-trunk-precommit-build/1197/

          Show
          Hadoop QA added a comment - Testing JIRA OOZIE-1319 Cleaning local git workspace ---------------------------- +1 PATCH_APPLIES +1 CLEAN -1 RAW_PATCH_ANALYSIS . +1 the patch does not introduce any @author tags . +1 the patch does not introduce any tabs . +1 the patch does not introduce any trailing spaces . -1 the patch contains 4 line(s) longer than 132 characters . +1 the patch does adds/modifies 4 testcase(s) +1 RAT . +1 the patch does not seem to introduce new RAT warnings +1 JAVADOC . +1 the patch does not seem to introduce new Javadoc warnings +1 COMPILE . +1 HEAD compiles . +1 patch compiles . +1 the patch does not seem to introduce new javac warnings +1 BACKWARDS_COMPATIBILITY . +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations . +1 the patch does not modify JPA files -1 TESTS . Tests run: 1426 . Tests failed: 3 . Tests errors: 4 . The patch failed the following testcases: . testConcurrencyReachedAndChooseNextEligible(org.apache.oozie.service.TestCallableQueueService) . testTimeOutWithUnresolvedMissingDependencies(org.apache.oozie.command.coord.TestCoordPushDependencyCheckXCommand) . testCoordinatorActionCommandsSubmitAndStart(org.apache.oozie.sla.TestSLAEventGeneration) +1 DISTRO . +1 distro tarball builds with the patch ---------------------------- -1 Overall result, please check the reported -1(s) The full output of the test-patch run is available at . https://builds.apache.org/job/oozie-trunk-precommit-build/1197/
          Hide
          Bowen Zhang added a comment -

          By spotting your patch, I see two things.

          • By definition, "execution order" means "the execution order if multiple instances of the coordinator job have satisfied their execution criteria" which means instances that fulfill both data and time dependency and ready to run. You added the "handleLastOnly()" in CoordMaterializeTransitionXCommand which changes the logic of regular materialization. After your changes, any coordinator job with "LAST_ONLY" execution will only execute the last action in EACH MATERIALIZATION WINDOW. For example, suppose right now is 9am and I submit a coordinator job with "LAST_ONLY" and 10 minutes frequency which starts at 10 am and ends at 1pm. As a result, among all the 18 actions, only actions with nominal time of 10:50am, 11:50am, and 12:50pm will run. All the others will be SKIPPED. That is not correct behavior.
          • We should not create JPAExecutor class anymore. Your query name "GET_UNSCHEDULED_ACTIONS" should be put into CoordActionQueryExecutor.java.
          Show
          Bowen Zhang added a comment - By spotting your patch, I see two things. By definition, "execution order" means "the execution order if multiple instances of the coordinator job have satisfied their execution criteria" which means instances that fulfill both data and time dependency and ready to run. You added the "handleLastOnly()" in CoordMaterializeTransitionXCommand which changes the logic of regular materialization. After your changes, any coordinator job with "LAST_ONLY" execution will only execute the last action in EACH MATERIALIZATION WINDOW. For example, suppose right now is 9am and I submit a coordinator job with "LAST_ONLY" and 10 minutes frequency which starts at 10 am and ends at 1pm. As a result, among all the 18 actions, only actions with nominal time of 10:50am, 11:50am, and 12:50pm will run. All the others will be SKIPPED. That is not correct behavior. We should not create JPAExecutor class anymore. Your query name "GET_UNSCHEDULED_ACTIONS" should be put into CoordActionQueryExecutor.java.
          Hide
          Robert Kanter added a comment -

          I think this will work if I update handleLastOnly() to check if the nominal time for the action is in the past. If it's in the past, then Oozie is doing "catch up" and the current behavior of handleLastOnly() is correct. If it's in the future, then handleLastOnly() should do nothing to it (so the original behavior remains and it will be WAITING/READY). Does this sound right?

          I'll get rid of the JPAExecutor in the next patch.

          Show
          Robert Kanter added a comment - I think this will work if I update handleLastOnly() to check if the nominal time for the action is in the past. If it's in the past, then Oozie is doing "catch up" and the current behavior of handleLastOnly() is correct. If it's in the future, then handleLastOnly() should do nothing to it (so the original behavior remains and it will be WAITING/READY). Does this sound right? I'll get rid of the JPAExecutor in the next patch.
          Hide
          Bowen Zhang added a comment -

          The right logic change should really happen in CoordActionReadyXCommand instead of materialization stage. Your new approach will not work in the following two scenarios:

          • If a coordinator job starts at 9am and ends at 12pm with frequency of 10 minutes. And oozie server is back up at 11am. It is supposed to only run the action at 10:50 am and forward. However, your new code will run the action at 10:50am AS WELL AS the action at 9:50am.
          • Same coordinator job with a data dependency and oozie server is always up. The job starts at 9am, but the data dependency fulfills only at 11am. ALL actions will run since they are all marked as "WAITING" and no logic in CoordActionREadyXCommand.
          Show
          Bowen Zhang added a comment - The right logic change should really happen in CoordActionReadyXCommand instead of materialization stage. Your new approach will not work in the following two scenarios: If a coordinator job starts at 9am and ends at 12pm with frequency of 10 minutes. And oozie server is back up at 11am. It is supposed to only run the action at 10:50 am and forward. However, your new code will run the action at 10:50am AS WELL AS the action at 9:50am. Same coordinator job with a data dependency and oozie server is always up. The job starts at 9am, but the data dependency fulfills only at 11am. ALL actions will run since they are all marked as "WAITING" and no logic in CoordActionREadyXCommand.
          Hide
          Robert Kanter added a comment -

          The second part of handleLastOnly() actually retrieves all READY or WAITING actions already created and sets them to SKIPPED. In other words, when materializing actions, if it sees any previously materialized actions READY/WAITING, they'll be set to SKIPPED. So in those two scenarios, I believe it will actually work correctly.

          The only time we need to set an action to SKIPPED is if its READY/WAITING and a newer action gets materialized. So I think putting the code in CoordMaterializeTransitionXCommand makes sense with this in mind. I think the issue now is that it doesn't take into account if we're materializing future actions or not, in which case, it needs to not do this behavior.

          Show
          Robert Kanter added a comment - The second part of handleLastOnly() actually retrieves all READY or WAITING actions already created and sets them to SKIPPED. In other words, when materializing actions, if it sees any previously materialized actions READY/WAITING, they'll be set to SKIPPED. So in those two scenarios, I believe it will actually work correctly. The only time we need to set an action to SKIPPED is if its READY/WAITING and a newer action gets materialized. So I think putting the code in CoordMaterializeTransitionXCommand makes sense with this in mind. I think the issue now is that it doesn't take into account if we're materializing future actions or not, in which case, it needs to not do this behavior.
          Hide
          Bowen Zhang added a comment -

          Neither of the two scenarios are guaranteed to work.

          • The first scenario,

            In other words, when materializing actions, if it sees any previously materialized actions READY/WAITING, they'll be set to SKIPPED.

            After oozie server is back up, when you run the second materialization, previously materialized actions may already be in RUNNING state or even "SUCCEEDED" state, therefore your "get_unscheduled_actions" query won't retrieve them.
          • Second scenario, what if the data dependency is fulfilled at 10:30am? All the WAITING actions will be in "READY" state. You need to wait till 10:55am to kick off your materialization logic.
          • Third point, your "GET_UNSCHEDULED_ACTIONS" query is retrieving all actions regardless the execution order of the parent coord jobs.
          Show
          Bowen Zhang added a comment - Neither of the two scenarios are guaranteed to work. The first scenario, In other words, when materializing actions, if it sees any previously materialized actions READY/WAITING, they'll be set to SKIPPED. After oozie server is back up, when you run the second materialization, previously materialized actions may already be in RUNNING state or even "SUCCEEDED" state, therefore your "get_unscheduled_actions" query won't retrieve them. Second scenario, what if the data dependency is fulfilled at 10:30am? All the WAITING actions will be in "READY" state. You need to wait till 10:55am to kick off your materialization logic. Third point, your "GET_UNSCHEDULED_ACTIONS" query is retrieving all actions regardless the execution order of the parent coord jobs.
          Hide
          Robert Kanter added a comment -

          After oozie server is back up, when you run the second materialization, previously materialized actions may already be in RUNNING state or even "SUCCEEDED" state, therefore your "get_unscheduled_actions" query won't retrieve them.

          Isn't that the correct behavior? If the action is already running, then I think we should not kill it.

          In the second scenario, there would only be one WAITING action becuase the older ones should be SKIPPED. My understanding of LAST_ONLY, is that if there are any WAITING or READY actions in the past, they should all be SKIPPED except for the latest one. WAITING or READY actions in the future should be unaffected (though my patch doesn't do this yet).

          Third point, your "GET_UNSCHEDULED_ACTIONS" query is retrieving all actions regardless the execution order of the parent coord jobs.

          It only runs "GET_UNSCHEDULED_ACTIONS" if the coordinator job's execution strategy is LAST_ONLY; there's an if statement in handleLastOnly().

          Show
          Robert Kanter added a comment - After oozie server is back up, when you run the second materialization, previously materialized actions may already be in RUNNING state or even "SUCCEEDED" state, therefore your "get_unscheduled_actions" query won't retrieve them. Isn't that the correct behavior? If the action is already running, then I think we should not kill it. In the second scenario, there would only be one WAITING action becuase the older ones should be SKIPPED. My understanding of LAST_ONLY, is that if there are any WAITING or READY actions in the past, they should all be SKIPPED except for the latest one. WAITING or READY actions in the future should be unaffected (though my patch doesn't do this yet). Third point, your "GET_UNSCHEDULED_ACTIONS" query is retrieving all actions regardless the execution order of the parent coord jobs. It only runs "GET_UNSCHEDULED_ACTIONS" if the coordinator job's execution strategy is LAST_ONLY; there's an if statement in handleLastOnly() .
          Hide
          Bowen Zhang added a comment -

          Let me give a more detailed description of the two scenarios

          • same coordinator job. If oozie server is back up at 11am, CoordMaterializationService will run immediately and create 6 actions from 9am to 9:50am which the first 5 actions will be "SKIPPED". At the same time, RecoveryService is very likely to kick off the action at 9:50am and turn the action into "SUCCEEDED" even before your CoordMaterializationService runs again to materialize actions from 10am to 10:50am where your "GET_UNSCHEDULED_ACTIONS" won't be able to retrieve the action at 9:50am. As a result, both actions at 9:50am and 10:50am are "SUCCEEDED". This is not the correct behavior.
          • same coordinator job with a data dependency. oozie server is always up from the beginning. If data dependency is fulfilled at 10:35am, then by definition, the action at 10:30 am should run and all the previous ones should be "SKIPPED". But in your code, everything will run into "SUCCEEDED" since the next CoordMaterializationService will run and try to materialize new actions at 10:55 am. At 10:30 am, there are already 12 actions in "WAITING" state in the database since at 8:55 am and 9:55 am, actions ARE NOT marked "SKIPPED" during materialization stage.
          • It only runs "GET_UNSCHEDULED_ACTIONS" if the coordinator job's execution strategy is LAST_ONLY; there's an if statement in handleLastOnly()

            Not true. What if at the same time, another coordinator job with "LIFO" execution order is running? Will you "SKIPPED" all its actions?
          Show
          Bowen Zhang added a comment - Let me give a more detailed description of the two scenarios same coordinator job. If oozie server is back up at 11am, CoordMaterializationService will run immediately and create 6 actions from 9am to 9:50am which the first 5 actions will be "SKIPPED". At the same time, RecoveryService is very likely to kick off the action at 9:50am and turn the action into "SUCCEEDED" even before your CoordMaterializationService runs again to materialize actions from 10am to 10:50am where your "GET_UNSCHEDULED_ACTIONS" won't be able to retrieve the action at 9:50am. As a result, both actions at 9:50am and 10:50am are "SUCCEEDED". This is not the correct behavior. same coordinator job with a data dependency. oozie server is always up from the beginning. If data dependency is fulfilled at 10:35am, then by definition, the action at 10:30 am should run and all the previous ones should be "SKIPPED". But in your code, everything will run into "SUCCEEDED" since the next CoordMaterializationService will run and try to materialize new actions at 10:55 am. At 10:30 am, there are already 12 actions in "WAITING" state in the database since at 8:55 am and 9:55 am, actions ARE NOT marked "SKIPPED" during materialization stage. It only runs "GET_UNSCHEDULED_ACTIONS" if the coordinator job's execution strategy is LAST_ONLY; there's an if statement in handleLastOnly() Not true. What if at the same time, another coordinator job with "LIFO" execution order is running? Will you "SKIPPED" all its actions?
          Hide
          Bowen Zhang added a comment -

          correction for my second point. In the second scenario, actions from 9:00 am to 9:50 am will be correctly skipped, but actions from 10:00 am till 10:50 am will run to "SUCCEDDED".

          Show
          Bowen Zhang added a comment - correction for my second point. In the second scenario, actions from 9:00 am to 9:50 am will be correctly skipped, but actions from 10:00 am till 10:50 am will run to "SUCCEDDED".
          Hide
          Robert Kanter added a comment -

          First scenario: I understand now. You are correct and this doesn't work properly because of the windowing. It needs to take into account all of the backlogged actions.

          Second scenario: I understand now. The CoordMaterializationService doesn't run frequently enough against this coordinator job (and it's not supposed to) for what I did to work properly.

          Not true. What if at the same time, another coordinator job with "LIFO" execution order is running? Will you "SKIPPED" all its actions?

          I was being stupid here
          The query should take the coordinator job id as an argument. Good catch; I was only trying this with one coord job, so I didn't run into this being an issue.

          I think you might be right and I need to look into maybe doing something in CoordActionReadyXCommand or possibly both there and CoordMaterializeTransitionXCommand. Whatever it is, it definitely needs to take into account if a materialization is in the future or in the past because the behavior should be different. Materialized actions in the past should be SKIPPED (except for the latest one once we catch up or the job ends), while materialized actions in the future should not be SKIPPED because we haven't gotten there yet. And when materializing in the past, the materialization window makes things trickier.
          Do you agree?

          Show
          Robert Kanter added a comment - First scenario: I understand now. You are correct and this doesn't work properly because of the windowing. It needs to take into account all of the backlogged actions. Second scenario: I understand now. The CoordMaterializationService doesn't run frequently enough against this coordinator job (and it's not supposed to) for what I did to work properly. Not true. What if at the same time, another coordinator job with "LIFO" execution order is running? Will you "SKIPPED" all its actions? I was being stupid here The query should take the coordinator job id as an argument. Good catch; I was only trying this with one coord job, so I didn't run into this being an issue. I think you might be right and I need to look into maybe doing something in CoordActionReadyXCommand or possibly both there and CoordMaterializeTransitionXCommand. Whatever it is, it definitely needs to take into account if a materialization is in the future or in the past because the behavior should be different. Materialized actions in the past should be SKIPPED (except for the latest one once we catch up or the job ends), while materialized actions in the future should not be SKIPPED because we haven't gotten there yet. And when materializing in the past, the materialization window makes things trickier. Do you agree?
          Hide
          Robert Kanter added a comment -

          Here's a rough idea of what I'm currently thinking:

          1. When materializing actions in CoordMaterializeTransitionXCommand, if we're in the past and LAST_ONLY, then it needs to materialize all actions until "now" or the end of the job (whichever is earliest) and set all but the latest one to SKIPPED (so we ignore the window in this case). And when materializing more actions (at the next window), it should set any old WAITING ones to SKIPPED as well. This doesn't touch any actions materialized in the future.
          2. When getting READY actions in CoordActionReadyXCommand, only the latest READY action should become SUBMITTED; the rest should be SKIPPED (re-use LIFO query here)

          Just a note to clarify: even with LAST_ONLY it is possible for multiple jobs to be RUNNING at the same time. e.g. if an older action that used to be the latest is still RUNNING when new actions come along, then that older action and the latest of the newer actions could both be RUNNING.

          Show
          Robert Kanter added a comment - Here's a rough idea of what I'm currently thinking: When materializing actions in CoordMaterializeTransitionXCommand, if we're in the past and LAST_ONLY, then it needs to materialize all actions until "now" or the end of the job (whichever is earliest) and set all but the latest one to SKIPPED (so we ignore the window in this case). And when materializing more actions (at the next window), it should set any old WAITING ones to SKIPPED as well. This doesn't touch any actions materialized in the future. When getting READY actions in CoordActionReadyXCommand, only the latest READY action should become SUBMITTED; the rest should be SKIPPED (re-use LIFO query here) Just a note to clarify: even with LAST_ONLY it is possible for multiple jobs to be RUNNING at the same time. e.g. if an older action that used to be the latest is still RUNNING when new actions come along, then that older action and the latest of the newer actions could both be RUNNING.
          Hide
          Robert Kanter added a comment -

          The new patch attempts to do what I described in my previous comment. I believe it now properly handles actions in the past and in the present/future. I've also updated the documentation to explain LAST_ONLY a bit more.

          Show
          Robert Kanter added a comment - The new patch attempts to do what I described in my previous comment. I believe it now properly handles actions in the past and in the present/future. I've also updated the documentation to explain LAST_ONLY a bit more.
          Robert Kanter made changes -
          Attachment OOZIE-1319.patch [ 12642378 ]
          Robert Kanter made changes -
          Remote Link This issue links to "ReviewBoard (Web Link)" [ 14955 ]
          Hide
          Robert Kanter added a comment -

          I've also uploaded the patch to ReviewBoard: https://reviews.apache.org/r/20822/

          Show
          Robert Kanter added a comment - I've also uploaded the patch to ReviewBoard: https://reviews.apache.org/r/20822/
          Hide
          Hadoop QA added a comment -

          Testing JIRA OOZIE-1319

          Cleaning local git workspace

          ----------------------------

          +1 PATCH_APPLIES
          +1 CLEAN
          -1 RAW_PATCH_ANALYSIS
          . +1 the patch does not introduce any @author tags
          . +1 the patch does not introduce any tabs
          . +1 the patch does not introduce any trailing spaces
          . -1 the patch contains 3 line(s) longer than 132 characters
          . +1 the patch does adds/modifies 7 testcase(s)
          +1 RAT
          . +1 the patch does not seem to introduce new RAT warnings
          +1 JAVADOC
          . +1 the patch does not seem to introduce new Javadoc warnings
          -1 COMPILE
          . +1 HEAD compiles
          . +1 patch compiles
          . -1 the patch seems to introduce 1 new javac warning(s)
          +1 BACKWARDS_COMPATIBILITY
          . +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
          . +1 the patch does not modify JPA files
          -1 TESTS
          . Tests run: 1445
          . Tests failed: 5
          . Tests errors: 2

          . The patch failed the following testcases:

          . testQueueUniquenessWithSameKey(org.apache.oozie.service.TestCallableQueueService)
          . testJobLog(org.apache.oozie.client.TestOozieCLI)
          . testCoordinatorActionCommandsSubmitAndStart(org.apache.oozie.sla.TestSLAEventGeneration)
          . testTimeOutWithUnresolvedMissingDependencies(org.apache.oozie.command.coord.TestCoordPushDependencyCheckXCommand)
          . testBundleId(org.apache.oozie.servlet.TestBulkMonitorWebServiceAPI)

          +1 DISTRO
          . +1 distro tarball builds with the patch

          ----------------------------
          -1 Overall result, please check the reported -1(s)

          The full output of the test-patch run is available at

          . https://builds.apache.org/job/oozie-trunk-precommit-build/1205/

          Show
          Hadoop QA added a comment - Testing JIRA OOZIE-1319 Cleaning local git workspace ---------------------------- +1 PATCH_APPLIES +1 CLEAN -1 RAW_PATCH_ANALYSIS . +1 the patch does not introduce any @author tags . +1 the patch does not introduce any tabs . +1 the patch does not introduce any trailing spaces . -1 the patch contains 3 line(s) longer than 132 characters . +1 the patch does adds/modifies 7 testcase(s) +1 RAT . +1 the patch does not seem to introduce new RAT warnings +1 JAVADOC . +1 the patch does not seem to introduce new Javadoc warnings -1 COMPILE . +1 HEAD compiles . +1 patch compiles . -1 the patch seems to introduce 1 new javac warning(s) +1 BACKWARDS_COMPATIBILITY . +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations . +1 the patch does not modify JPA files -1 TESTS . Tests run: 1445 . Tests failed: 5 . Tests errors: 2 . The patch failed the following testcases: . testQueueUniquenessWithSameKey(org.apache.oozie.service.TestCallableQueueService) . testJobLog(org.apache.oozie.client.TestOozieCLI) . testCoordinatorActionCommandsSubmitAndStart(org.apache.oozie.sla.TestSLAEventGeneration) . testTimeOutWithUnresolvedMissingDependencies(org.apache.oozie.command.coord.TestCoordPushDependencyCheckXCommand) . testBundleId(org.apache.oozie.servlet.TestBulkMonitorWebServiceAPI) +1 DISTRO . +1 distro tarball builds with the patch ---------------------------- -1 Overall result, please check the reported -1(s) The full output of the test-patch run is available at . https://builds.apache.org/job/oozie-trunk-precommit-build/1205/
          Hide
          Robert Kanter added a comment -

          Test failures appear to be unrelated

          Show
          Robert Kanter added a comment - Test failures appear to be unrelated
          Hide
          Bowen Zhang added a comment -

          two things here:

          • A job with frequency of 10 mins and starts from 10 am till 12 pm. Say, none of the data dependency is fulfilled. Your code will break if all of the data dependencies are fulfilled between 10:55 am and 11am. Since at 10:55 am, you SKIPPED all actions in the first hour and materialize all future actions. But by definition, if all data dependencies are available at 10:56 am, then, the action at 10:50 am should run.
          • This is a bigger issue with your change in CoordActionReadyXCommand. Say, a job with 10 mins frequency starts from 10 am to 11am. And all the data dependencies are realized at 10:35 am. So by definition, actions at 10:00, 10:10, and 10:20 will be SKIPPED. But, do you realize that CoordActionReadyXCommand is invoked per action in CoordActionInputCheckXCommand right after each action is set to READY? And the query "GET_COORD_ACTIONS_WAITING_SUBMITTED_OLDER_THAN" has no order preference. As a result, all your "for loop" in ReadyXCommand will only contain one action which you set to SUBMITTED. It's very likely at the end, all actions will be run.
          Show
          Bowen Zhang added a comment - two things here: A job with frequency of 10 mins and starts from 10 am till 12 pm. Say, none of the data dependency is fulfilled. Your code will break if all of the data dependencies are fulfilled between 10:55 am and 11am. Since at 10:55 am, you SKIPPED all actions in the first hour and materialize all future actions. But by definition, if all data dependencies are available at 10:56 am, then, the action at 10:50 am should run. This is a bigger issue with your change in CoordActionReadyXCommand. Say, a job with 10 mins frequency starts from 10 am to 11am. And all the data dependencies are realized at 10:35 am. So by definition, actions at 10:00, 10:10, and 10:20 will be SKIPPED. But, do you realize that CoordActionReadyXCommand is invoked per action in CoordActionInputCheckXCommand right after each action is set to READY? And the query "GET_COORD_ACTIONS_WAITING_SUBMITTED_OLDER_THAN" has no order preference. As a result, all your "for loop" in ReadyXCommand will only contain one action which you set to SUBMITTED. It's very likely at the end, all actions will be run.
          Hide
          Robert Kanter added a comment -

          You're right, those two cases will both be problematic. I'm not sure of a good way to solve them; any ideas?

          What if we made it LAST_ONLY a little more restrictive and have it effectively set the timeout equal to the frequency? I believe this would help prevent the second case because even if a bunch of data dependencies get satisfied at the same time, the older actions would have already TIMEDOUT, so you'd only have the current one.

          Show
          Robert Kanter added a comment - You're right, those two cases will both be problematic. I'm not sure of a good way to solve them; any ideas? What if we made it LAST_ONLY a little more restrictive and have it effectively set the timeout equal to the frequency? I believe this would help prevent the second case because even if a bunch of data dependencies get satisfied at the same time, the older actions would have already TIMEDOUT, so you'd only have the current one.
          Hide
          Mona Chitnis added a comment -

          A job with frequency of 10 mins and starts from 10 am till 12 pm. Say, none of the data dependency is fulfilled. Your code will break if all of the data dependencies are fulfilled between 10:55 am and 11am. Since at 10:55 am, you SKIPPED all actions in the first hour

          If materialization occurred at 10:55, why is 10:50 action SKIPPED? it would be the last eligible action in this case right?

          So we are going to have a case with multiple CoordActionReadyX commands queued up all looping over each other. To resolve this, you could have an action know the total SUBMITTED count so far and look only towards its predecessor.
          e.g.
          if action 10:50, n(SUBMITTED) = 0/1, own status - READY, action 10:40 - SUBMITTED,
          then action 10:40 -> SKIPPED, action 10:50 -> SUBMITTED

          but, if action 10:50, n(SUBMITTED) = 1, own status - READY, action 10:40 - RUNNING,
          then action 10:50 would have to mark itself as SKIPPED right?

          Show
          Mona Chitnis added a comment - A job with frequency of 10 mins and starts from 10 am till 12 pm. Say, none of the data dependency is fulfilled. Your code will break if all of the data dependencies are fulfilled between 10:55 am and 11am. Since at 10:55 am, you SKIPPED all actions in the first hour If materialization occurred at 10:55, why is 10:50 action SKIPPED? it would be the last eligible action in this case right? So we are going to have a case with multiple CoordActionReadyX commands queued up all looping over each other. To resolve this, you could have an action know the total SUBMITTED count so far and look only towards its predecessor. e.g. if action 10:50, n(SUBMITTED) = 0/1, own status - READY, action 10:40 - SUBMITTED, then action 10:40 -> SKIPPED, action 10:50 -> SUBMITTED but, if action 10:50, n(SUBMITTED) = 1, own status - READY, action 10:40 - RUNNING, then action 10:50 would have to mark itself as SKIPPED right?
          Hide
          Bowen Zhang added a comment -

          Mona Chitnis,

          If materialization occurred at 10:55, why is 10:50 action SKIPPED? it would be the last eligible action in this case right?

          , this is the expected behavior, but not what his code is going to do. His new function of handleLastOnly will mark all actions as "SKIPPED" during materialization event.

          e.g.
          if action 10:50, n(SUBMITTED) = 0/1, own status - READY, action 10:40 - SUBMITTED,
          then action 10:40 -> SKIPPED, action 10:50 -> SUBMITTED

          but, if action 10:50, n(SUBMITTED) = 1, own status - READY, action 10:40 - RUNNING,
          then action 10:50 would have to mark itself as SKIPPED right?

          These are all correct expected behavior. But the problem is in your first case, marking action 10:40 from SUBMITTED to SKIPPED can't work 100% of the time since at the mean time, RecoveryService may have already converted it into RUNNING.

          Show
          Bowen Zhang added a comment - Mona Chitnis , If materialization occurred at 10:55, why is 10:50 action SKIPPED? it would be the last eligible action in this case right? , this is the expected behavior, but not what his code is going to do. His new function of handleLastOnly will mark all actions as "SKIPPED" during materialization event. e.g. if action 10:50, n(SUBMITTED) = 0/1, own status - READY, action 10:40 - SUBMITTED, then action 10:40 -> SKIPPED, action 10:50 -> SUBMITTED but, if action 10:50, n(SUBMITTED) = 1, own status - READY, action 10:40 - RUNNING, then action 10:50 would have to mark itself as SKIPPED right? These are all correct expected behavior. But the problem is in your first case, marking action 10:40 from SUBMITTED to SKIPPED can't work 100% of the time since at the mean time, RecoveryService may have already converted it into RUNNING.
          Hide
          Robert Kanter added a comment -

          since at the mean time, RecoveryService may have already converted it into RUNNING.

          What if we simply had the RecoveryService ignore actions from LAST_ONLY jobs?

          Show
          Robert Kanter added a comment - since at the mean time, RecoveryService may have already converted it into RUNNING. What if we simply had the RecoveryService ignore actions from LAST_ONLY jobs?
          Hide
          Bowen Zhang added a comment -

          You can certainly do that. But, this is equivalent of having some blocking mechanism between RecoveryService and MaterializationService. And the "Ignore" should only apply when server is back up, not during regular recovery Runnable. Otherwise, job with "Last_only" never gets to run. So, we all need to just go back to my approach in oozie-1305.

          Show
          Bowen Zhang added a comment - You can certainly do that. But, this is equivalent of having some blocking mechanism between RecoveryService and MaterializationService. And the "Ignore" should only apply when server is back up, not during regular recovery Runnable. Otherwise, job with "Last_only" never gets to run. So, we all need to just go back to my approach in oozie-1305.
          Hide
          Robert Kanter added a comment -

          Looking at this again from another perspective: isn't setting LAST_ONLY effectively equivalent to doing FIFO with a timeout set to the amount of time between actions? For example, if you have actions materializing every 10min and also have a timeout of 10min, then when the time dependency for the next action is met, the previous action would become TIMEDOUT. This behavior already exists. So, perhaps we can re-use that logic so that when LAST_ONLY is set, the timeout for the current_action is set to MIN(timeout_set_by_the_user, next_action.nominal_time – current_action.nominal_time)? We then wouldn't need most of the other changes I've already made and we'd reuse the TIMEDOUT status instead of the new SKIPPED status.

          Thoughts?

          Show
          Robert Kanter added a comment - Looking at this again from another perspective: isn't setting LAST_ONLY effectively equivalent to doing FIFO with a timeout set to the amount of time between actions? For example, if you have actions materializing every 10min and also have a timeout of 10min, then when the time dependency for the next action is met, the previous action would become TIMEDOUT. This behavior already exists. So, perhaps we can re-use that logic so that when LAST_ONLY is set, the timeout for the current_action is set to MIN(timeout_set_by_the_user, next_action.nominal_time – current_action.nominal_time) ? We then wouldn't need most of the other changes I've already made and we'd reuse the TIMEDOUT status instead of the new SKIPPED status. Thoughts?
          Hide
          Robert Kanter added a comment -

          I first tried to re-use the TIMEDOUT status and logic, but that didn't work for a few reasons, including the fact that TIMEDOUT causes the coordinator job to transition to DONEWITHERROR and not SUCCEEDED.

          So I ended up mirroring the TIMEDOUT logic to some extent for the new SKIPPED status. This ended up being much simpler than the old approach. I kept some of the old approach's logic, such as when materializing actions in CoordMaterializeTransitionXCommand, if we're in the past and LAST_ONLY, then it needs to materialize all actions until "now". The new logic is in CoordActionInputCheckXCommand where we check if an action's dependencies are met (this also gets called for actions without dependencies); the TIMEDOUT logic is here too. I added some code that will set the action's status to SKIPPED if the current time is later than the next action's nominal time; this way, there is only ever one WAITING action whose time dependency has been met. The documentation in the patch gives a more user-oriented explanation that might be more clear.

          This is a little more restrictive than the old approach would have been had it worked properly. In affect, the new approach doesn't allow two WAITING actions whose time dependencies are met but data dependencies are not met could become SUBMITTED; while the old approach would allow this. However, I think this restriction make sense because with LAST_ONLY you're only interested in the last action anyway; older actions should be SKIPPED, even if the last action's (data) dependencies haven't been met yet.

          Show
          Robert Kanter added a comment - I first tried to re-use the TIMEDOUT status and logic, but that didn't work for a few reasons, including the fact that TIMEDOUT causes the coordinator job to transition to DONEWITHERROR and not SUCCEEDED. So I ended up mirroring the TIMEDOUT logic to some extent for the new SKIPPED status. This ended up being much simpler than the old approach. I kept some of the old approach's logic, such as when materializing actions in CoordMaterializeTransitionXCommand, if we're in the past and LAST_ONLY, then it needs to materialize all actions until "now". The new logic is in CoordActionInputCheckXCommand where we check if an action's dependencies are met (this also gets called for actions without dependencies); the TIMEDOUT logic is here too. I added some code that will set the action's status to SKIPPED if the current time is later than the next action's nominal time; this way, there is only ever one WAITING action whose time dependency has been met. The documentation in the patch gives a more user-oriented explanation that might be more clear. This is a little more restrictive than the old approach would have been had it worked properly. In affect, the new approach doesn't allow two WAITING actions whose time dependencies are met but data dependencies are not met could become SUBMITTED; while the old approach would allow this. However, I think this restriction make sense because with LAST_ONLY you're only interested in the last action anyway; older actions should be SKIPPED, even if the last action's (data) dependencies haven't been met yet.
          Robert Kanter made changes -
          Attachment OOZIE-1319.patch [ 12646642 ]
          Hide
          Hadoop QA added a comment -

          Testing JIRA OOZIE-1319

          Cleaning local git workspace

          ----------------------------

          +1 PATCH_APPLIES
          +1 CLEAN
          -1 RAW_PATCH_ANALYSIS
          . +1 the patch does not introduce any @author tags
          . +1 the patch does not introduce any tabs
          . +1 the patch does not introduce any trailing spaces
          . -1 the patch contains 5 line(s) longer than 132 characters
          . +1 the patch does adds/modifies 8 testcase(s)
          +1 RAT
          . +1 the patch does not seem to introduce new RAT warnings
          +1 JAVADOC
          . +1 the patch does not seem to introduce new Javadoc warnings
          +1 COMPILE
          . +1 HEAD compiles
          . +1 patch compiles
          . +1 the patch does not seem to introduce new javac warnings
          +1 BACKWARDS_COMPATIBILITY
          . +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
          . +1 the patch does not modify JPA files
          -1 TESTS
          . Tests run: 1458
          . Tests failed: 6
          . Tests errors: 4

          . The patch failed the following testcases:

          . testCoordStatusTransitServiceDoneWithError(org.apache.oozie.service.TestStatusTransitService)
          . testCoordinatorActionCommandsSubmitAndStart(org.apache.oozie.sla.TestSLAEventGeneration)
          . testBundleRerunInPausedWithError(org.apache.oozie.command.bundle.TestBundleRerunXCommand)
          . testCoordDefUnsupportedChange(org.apache.oozie.command.coord.TestCoordUpdateXCommand)
          . testTimeOutWithUnresolvedMissingDependencies(org.apache.oozie.command.coord.TestCoordPushDependencyCheckXCommand)
          . testEngine(org.apache.oozie.TestCoordinatorEngine)

          +1 DISTRO
          . +1 distro tarball builds with the patch

          ----------------------------
          -1 Overall result, please check the reported -1(s)

          The full output of the test-patch run is available at

          . https://builds.apache.org/job/oozie-trunk-precommit-build/1263/

          Show
          Hadoop QA added a comment - Testing JIRA OOZIE-1319 Cleaning local git workspace ---------------------------- +1 PATCH_APPLIES +1 CLEAN -1 RAW_PATCH_ANALYSIS . +1 the patch does not introduce any @author tags . +1 the patch does not introduce any tabs . +1 the patch does not introduce any trailing spaces . -1 the patch contains 5 line(s) longer than 132 characters . +1 the patch does adds/modifies 8 testcase(s) +1 RAT . +1 the patch does not seem to introduce new RAT warnings +1 JAVADOC . +1 the patch does not seem to introduce new Javadoc warnings +1 COMPILE . +1 HEAD compiles . +1 patch compiles . +1 the patch does not seem to introduce new javac warnings +1 BACKWARDS_COMPATIBILITY . +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations . +1 the patch does not modify JPA files -1 TESTS . Tests run: 1458 . Tests failed: 6 . Tests errors: 4 . The patch failed the following testcases: . testCoordStatusTransitServiceDoneWithError(org.apache.oozie.service.TestStatusTransitService) . testCoordinatorActionCommandsSubmitAndStart(org.apache.oozie.sla.TestSLAEventGeneration) . testBundleRerunInPausedWithError(org.apache.oozie.command.bundle.TestBundleRerunXCommand) . testCoordDefUnsupportedChange(org.apache.oozie.command.coord.TestCoordUpdateXCommand) . testTimeOutWithUnresolvedMissingDependencies(org.apache.oozie.command.coord.TestCoordPushDependencyCheckXCommand) . testEngine(org.apache.oozie.TestCoordinatorEngine) +1 DISTRO . +1 distro tarball builds with the patch ---------------------------- -1 Overall result, please check the reported -1(s) The full output of the test-patch run is available at . https://builds.apache.org/job/oozie-trunk-precommit-build/1263/
          Hide
          Bowen Zhang added a comment -

          I don't see you dealing with the race condition between RecoveryService and CoordMaterilalizeTriggerService.. When oozie server is back up after more than an hour of downtime, you may only skip the first hour of actions, but you still will materialize and run all actions with nominal times in the past one or several hours during downtime since CoordActionInputCheckXCommand runs before CoordMaterializationXCommand.
          Second, you CoordActionInputCheckXCommand relies on a big assumption that the latest action will get called first by CoordActionInputCheckXCommand since if the second to the latest one gets called first in CoordActionInputCheckXCommand, it will mark the latest action which is supposed to run as "SKIPPED".

          Show
          Bowen Zhang added a comment - I don't see you dealing with the race condition between RecoveryService and CoordMaterilalizeTriggerService.. When oozie server is back up after more than an hour of downtime, you may only skip the first hour of actions, but you still will materialize and run all actions with nominal times in the past one or several hours during downtime since CoordActionInputCheckXCommand runs before CoordMaterializationXCommand. Second, you CoordActionInputCheckXCommand relies on a big assumption that the latest action will get called first by CoordActionInputCheckXCommand since if the second to the latest one gets called first in CoordActionInputCheckXCommand, it will mark the latest action which is supposed to run as "SKIPPED".
          Hide
          Robert Kanter added a comment -

          When oozie server is back up after more than an hour of downtime, you may only skip the first hour of actions, but you still will materialize and run all actions with nominal times in the past one or several hours during downtime since CoordActionInputCheckXCommand runs before CoordMaterializationXCommand.

          By CoordMaterializationXCommand, did you mean CoordMaterializeTransitionXCommand? Doesn't CoordActionInputCheckXCommand get run periodically to check if an action's dependencies are met?

          Second, you CoordActionInputCheckXCommand relies on a big assumption that the latest action will get called first by CoordActionInputCheckXCommand since if the second to the latest one gets called first in CoordActionInputCheckXCommand, it will mark the latest action which is supposed to run as "SKIPPED".

          I don't think that's the case. CoordActionInputCheckXCommand looks at the next action, but it only edits the "current' action. That is, if it's checking action 5, it will look at action 6's nominal time, but it would set action 5's status only, not action 6's; that happens in a different call to CoordActionInputCheckXCommand. I don't think the order makes a difference here. Can you give an example of what you mean?

          The new approach I used should inherently work because it's doing something very similar to how the timeout stuff works, and that doesn't have any problems AFAIK.

          Show
          Robert Kanter added a comment - When oozie server is back up after more than an hour of downtime, you may only skip the first hour of actions, but you still will materialize and run all actions with nominal times in the past one or several hours during downtime since CoordActionInputCheckXCommand runs before CoordMaterializationXCommand. By CoordMaterializationXCommand, did you mean CoordMaterializeTransitionXCommand? Doesn't CoordActionInputCheckXCommand get run periodically to check if an action's dependencies are met? Second, you CoordActionInputCheckXCommand relies on a big assumption that the latest action will get called first by CoordActionInputCheckXCommand since if the second to the latest one gets called first in CoordActionInputCheckXCommand, it will mark the latest action which is supposed to run as "SKIPPED". I don't think that's the case. CoordActionInputCheckXCommand looks at the next action, but it only edits the "current' action. That is, if it's checking action 5, it will look at action 6's nominal time, but it would set action 5's status only, not action 6's; that happens in a different call to CoordActionInputCheckXCommand. I don't think the order makes a difference here. Can you give an example of what you mean? The new approach I used should inherently work because it's doing something very similar to how the timeout stuff works, and that doesn't have any problems AFAIK.
          Hide
          Bowen Zhang added a comment -

          you are right on these two points. The logic LGTM. Can you change the check for "LAST_ONLY" execution order to "ignoreCase"?
          +1.

          Show
          Bowen Zhang added a comment - you are right on these two points. The logic LGTM. Can you change the check for "LAST_ONLY" execution order to "ignoreCase"? +1.
          Hide
          Bowen Zhang added a comment -

          As a follow-on, I want to merge 1305 into this feature once you check this in. I am just adding a new order called "NONE". so, we don't need to change any DB.

          Show
          Bowen Zhang added a comment - As a follow-on, I want to merge 1305 into this feature once you check this in. I am just adding a new order called "NONE". so, we don't need to change any DB.
          Hide
          Robert Kanter added a comment -

          Can you change the check for "LAST_ONLY" execution order to "ignoreCase"?

          When checking LAST_ONLY I use the enum, not the String, so that shouldn't be a problem. Unless I did one of the checks the wrong way?

          I don't think we should merge OOZIE-1305 into this JIRA; this patch is big and complicated enough. But it definitely makes sense to re-use the SKIPPED status I added for the NONE execution order. Agree?

          Show
          Robert Kanter added a comment - Can you change the check for "LAST_ONLY" execution order to "ignoreCase"? When checking LAST_ONLY I use the enum, not the String, so that shouldn't be a problem. Unless I did one of the checks the wrong way? I don't think we should merge OOZIE-1305 into this JIRA; this patch is big and complicated enough. But it definitely makes sense to re-use the SKIPPED status I added for the NONE execution order. Agree?
          Hide
          Bowen Zhang added a comment -

          I didn't say commit 1305 with this one. I ssaid

          I want to merge 1305 into this feature once you check this in

          .

          Show
          Bowen Zhang added a comment - I didn't say commit 1305 with this one. I ssaid I want to merge 1305 into this feature once you check this in .
          Hide
          Robert Kanter added a comment -

          Ah, sorry, I misunderstood what you meant.

          I'll check that the test failures are unrelated and commit this soon. Thanks for the review

          Show
          Robert Kanter added a comment - Ah, sorry, I misunderstood what you meant. I'll check that the test failures are unrelated and commit this soon. Thanks for the review
          Hide
          Robert Kanter added a comment -

          TestCoordinatorEngine failure was related: it checks an error message which lists all of the action statuses and didn't include SKIPPED so it was failing. The new patch fixes that.

          I'll commit it once Jenkins confirms that it's now fixed.

          Show
          Robert Kanter added a comment - TestCoordinatorEngine failure was related: it checks an error message which lists all of the action statuses and didn't include SKIPPED so it was failing. The new patch fixes that. I'll commit it once Jenkins confirms that it's now fixed.
          Robert Kanter made changes -
          Attachment OOZIE-1319.patch [ 12647236 ]
          Hide
          Hadoop QA added a comment -

          Testing JIRA OOZIE-1319

          Cleaning local git workspace

          ----------------------------

          +1 PATCH_APPLIES
          +1 CLEAN
          -1 RAW_PATCH_ANALYSIS
          . +1 the patch does not introduce any @author tags
          . +1 the patch does not introduce any tabs
          . +1 the patch does not introduce any trailing spaces
          . -1 the patch contains 5 line(s) longer than 132 characters
          . +1 the patch does adds/modifies 9 testcase(s)
          +1 RAT
          . +1 the patch does not seem to introduce new RAT warnings
          +1 JAVADOC
          . +1 the patch does not seem to introduce new Javadoc warnings
          +1 COMPILE
          . +1 HEAD compiles
          . +1 patch compiles
          . +1 the patch does not seem to introduce new javac warnings
          +1 BACKWARDS_COMPATIBILITY
          . +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
          . +1 the patch does not modify JPA files
          -1 TESTS
          . Tests run: 1460
          . Tests failed: 4
          . Tests errors: 6

          . The patch failed the following testcases:

          . testActionInputCheckLatestActionCreationTime(org.apache.oozie.command.coord.TestCoordActionInputCheckXCommand)
          . testBundleEngineSuspend(org.apache.oozie.servlet.TestV1JobServletBundleEngine)
          . testLastOnlyMaterialization(org.apache.oozie.command.coord.TestCoordMaterializeTransitionXCommand)
          . testActionInputCheckLatestActionCreationTime(org.apache.oozie.command.coord.TestCoordActionInputCheckXCommandNonUTC)

          +1 DISTRO
          . +1 distro tarball builds with the patch

          ----------------------------
          -1 Overall result, please check the reported -1(s)

          The full output of the test-patch run is available at

          . https://builds.apache.org/job/oozie-trunk-precommit-build/1270/

          Show
          Hadoop QA added a comment - Testing JIRA OOZIE-1319 Cleaning local git workspace ---------------------------- +1 PATCH_APPLIES +1 CLEAN -1 RAW_PATCH_ANALYSIS . +1 the patch does not introduce any @author tags . +1 the patch does not introduce any tabs . +1 the patch does not introduce any trailing spaces . -1 the patch contains 5 line(s) longer than 132 characters . +1 the patch does adds/modifies 9 testcase(s) +1 RAT . +1 the patch does not seem to introduce new RAT warnings +1 JAVADOC . +1 the patch does not seem to introduce new Javadoc warnings +1 COMPILE . +1 HEAD compiles . +1 patch compiles . +1 the patch does not seem to introduce new javac warnings +1 BACKWARDS_COMPATIBILITY . +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations . +1 the patch does not modify JPA files -1 TESTS . Tests run: 1460 . Tests failed: 4 . Tests errors: 6 . The patch failed the following testcases: . testActionInputCheckLatestActionCreationTime(org.apache.oozie.command.coord.TestCoordActionInputCheckXCommand) . testBundleEngineSuspend(org.apache.oozie.servlet.TestV1JobServletBundleEngine) . testLastOnlyMaterialization(org.apache.oozie.command.coord.TestCoordMaterializeTransitionXCommand) . testActionInputCheckLatestActionCreationTime(org.apache.oozie.command.coord.TestCoordActionInputCheckXCommandNonUTC) +1 DISTRO . +1 distro tarball builds with the patch ---------------------------- -1 Overall result, please check the reported -1(s) The full output of the test-patch run is available at . https://builds.apache.org/job/oozie-trunk-precommit-build/1270/
          Hide
          Bowen Zhang added a comment -

          Robert Kanter, I think my claim about problem arising from server backup is still legit. Consider the server goes down at 8:30 am and back up at 9:31am. The job has a frequency of 10 mins and starts at 8 am. what is expected is only the action at 9:30 am should run and the rest should be skipped. But in your code, the action at 8:50am would also run since CoordInputcheckXCommand sees the nextcoordaction is null before the materialization service materializes actions from 9 am till 10 am. As a result, the action at 8:50 am will be marked as READY.

          Show
          Bowen Zhang added a comment - Robert Kanter , I think my claim about problem arising from server backup is still legit. Consider the server goes down at 8:30 am and back up at 9:31am. The job has a frequency of 10 mins and starts at 8 am. what is expected is only the action at 9:30 am should run and the rest should be skipped. But in your code, the action at 8:50am would also run since CoordInputcheckXCommand sees the nextcoordaction is null before the materialization service materializes actions from 9 am till 10 am. As a result, the action at 8:50 am will be marked as READY.
          Hide
          Robert Kanter added a comment -

          I tried replicating that scenario (by playing with my system clock so I didn't have to wait so long) and it worked correctly; that doesn't seem to be a problem.

          Though I do have some tests to fix; it looks like OOZIE-1819, which got in when I last rebased the patch, somehow broke some of my tests.

          Show
          Robert Kanter added a comment - I tried replicating that scenario (by playing with my system clock so I didn't have to wait so long) and it worked correctly; that doesn't seem to be a problem. Though I do have some tests to fix; it looks like OOZIE-1819 , which got in when I last rebased the patch, somehow broke some of my tests.
          Hide
          Bowen Zhang added a comment -

          But analytically, this does seem to be a problem.

          Show
          Bowen Zhang added a comment - But analytically, this does seem to be a problem.
          Hide
          Robert Kanter added a comment -

          That certainly seems like a valid concern, but it looks like it materializes the actions before the check happens so it's not actually a problem.

          Show
          Robert Kanter added a comment - That certainly seems like a valid concern, but it looks like it materializes the actions before the check happens so it's not actually a problem.
          Hide
          Bowen Zhang added a comment -

          I think the reason you don't see the problem is that you only have one coordinator or few coordinators running. Therefore, Materialization happens really fast which beats the speed of recoveryservice. You can check this patch in, but certainly this is a bug.

          Show
          Bowen Zhang added a comment - I think the reason you don't see the problem is that you only have one coordinator or few coordinators running. Therefore, Materialization happens really fast which beats the speed of recoveryservice. You can check this patch in, but certainly this is a bug.
          Hide
          Robert Kanter added a comment -

          Ah, that's a good point. I'll try to think of a solution to that problem.

          Show
          Robert Kanter added a comment - Ah, that's a good point. I'll try to think of a solution to that problem.
          Hide
          Bowen Zhang added a comment -

          Try to think if you can have any blocking mechanism between the two services. For examples, skip queuing CoordInputCheckXCommand for certain actions with LAST_ONLY execution order. And queue them only after certain conditions are satisfied.

          Show
          Bowen Zhang added a comment - Try to think if you can have any blocking mechanism between the two services. For examples, skip queuing CoordInputCheckXCommand for certain actions with LAST_ONLY execution order. And queue them only after certain conditions are satisfied.
          Hide
          Robert Kanter added a comment -

          That's going to get tricky. I have another idea to try first: Instead of looking at the nominal time of the next action in CoordActionInputCheckXCommand, I can have it calculate what the nominal time of the next action should be. Functionally, this is equivalent to what I have now, but the advantage is that it doesn't rely on the next action having been materialized yet. So in the last scenario that you gave, even though the action at 8:50am might see the nextcoordaction as null, it won't matter because it will compute what the nextcoordaction's nominal time is. I should be able to steal some logic from the CoordMaterializeTransitionXCommand for computing the next nominal time.

          Show
          Robert Kanter added a comment - That's going to get tricky. I have another idea to try first: Instead of looking at the nominal time of the next action in CoordActionInputCheckXCommand, I can have it calculate what the nominal time of the next action should be. Functionally, this is equivalent to what I have now, but the advantage is that it doesn't rely on the next action having been materialized yet. So in the last scenario that you gave, even though the action at 8:50am might see the nextcoordaction as null, it won't matter because it will compute what the nextcoordaction's nominal time is. I should be able to steal some logic from the CoordMaterializeTransitionXCommand for computing the next nominal time.
          Hide
          Robert Kanter added a comment -

          To clarify, if I can get that idea to work, then I don't need to even retrieve the nextcoordaction at all and can delete that code.

          Show
          Robert Kanter added a comment - To clarify, if I can get that idea to work, then I don't need to even retrieve the nextcoordaction at all and can delete that code.
          Hide
          Bowen Zhang added a comment -

          Sounds promising. please consider both integer and cron frequency.

          Show
          Bowen Zhang added a comment - Sounds promising. please consider both integer and cron frequency.
          Hide
          Robert Kanter added a comment -

          Yup; cron is actually easier

          The new patch gets rid of the changes added to get the next action in CoordActionInputCheckXCommand and instead computes the next nominal time. I also fixed/updated the related failing tests.

          Show
          Robert Kanter added a comment - Yup; cron is actually easier The new patch gets rid of the changes added to get the next action in CoordActionInputCheckXCommand and instead computes the next nominal time. I also fixed/updated the related failing tests.
          Robert Kanter made changes -
          Attachment OOZIE-1319.patch [ 12647515 ]
          Hide
          Hadoop QA added a comment -

          Testing JIRA OOZIE-1319

          Cleaning local git workspace

          ----------------------------

          +1 PATCH_APPLIES
          +1 CLEAN
          -1 RAW_PATCH_ANALYSIS
          . +1 the patch does not introduce any @author tags
          . +1 the patch does not introduce any tabs
          . +1 the patch does not introduce any trailing spaces
          . -1 the patch contains 2 line(s) longer than 132 characters
          . +1 the patch does adds/modifies 7 testcase(s)
          +1 RAT
          . +1 the patch does not seem to introduce new RAT warnings
          +1 JAVADOC
          . +1 the patch does not seem to introduce new Javadoc warnings
          +1 COMPILE
          . +1 HEAD compiles
          . +1 patch compiles
          . +1 the patch does not seem to introduce new javac warnings
          +1 BACKWARDS_COMPATIBILITY
          . +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
          . +1 the patch does not modify JPA files
          -1 TESTS
          . Tests run: 1458
          . Tests failed: 1
          . Tests errors: 1

          . The patch failed the following testcases:

          . testBundleEngineKill(org.apache.oozie.servlet.TestV1JobServletBundleEngine)

          +1 DISTRO
          . +1 distro tarball builds with the patch

          ----------------------------
          -1 Overall result, please check the reported -1(s)

          The full output of the test-patch run is available at

          . https://builds.apache.org/job/oozie-trunk-precommit-build/1275/

          Show
          Hadoop QA added a comment - Testing JIRA OOZIE-1319 Cleaning local git workspace ---------------------------- +1 PATCH_APPLIES +1 CLEAN -1 RAW_PATCH_ANALYSIS . +1 the patch does not introduce any @author tags . +1 the patch does not introduce any tabs . +1 the patch does not introduce any trailing spaces . -1 the patch contains 2 line(s) longer than 132 characters . +1 the patch does adds/modifies 7 testcase(s) +1 RAT . +1 the patch does not seem to introduce new RAT warnings +1 JAVADOC . +1 the patch does not seem to introduce new Javadoc warnings +1 COMPILE . +1 HEAD compiles . +1 patch compiles . +1 the patch does not seem to introduce new javac warnings +1 BACKWARDS_COMPATIBILITY . +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations . +1 the patch does not modify JPA files -1 TESTS . Tests run: 1458 . Tests failed: 1 . Tests errors: 1 . The patch failed the following testcases: . testBundleEngineKill(org.apache.oozie.servlet.TestV1JobServletBundleEngine) +1 DISTRO . +1 distro tarball builds with the patch ---------------------------- -1 Overall result, please check the reported -1(s) The full output of the test-patch run is available at . https://builds.apache.org/job/oozie-trunk-precommit-build/1275/
          Hide
          Robert Kanter added a comment -

          Test failures are unrelated. Lines that are too long are Named Queries.

          Show
          Robert Kanter added a comment - Test failures are unrelated. Lines that are too long are Named Queries.
          Hide
          Bowen Zhang added a comment -

          one last issue. When computing nextNominalTime, you need to make sure nextNominalTime is < end time of the coord job. Otherwise, you may wrongly skip the last action of the coord job. After that, +1.

          Show
          Bowen Zhang added a comment - one last issue. When computing nextNominalTime, you need to make sure nextNominalTime is < end time of the coord job. Otherwise, you may wrongly skip the last action of the coord job. After that, +1.
          Hide
          Robert Kanter added a comment -

          It already checks that:

          // If the next nominal time is after the job's end time, then this is the last action, so return null
          if (nextNominalTime.after(coordJob.getEndTime())) {
              nextNominalTime = null;
          }
          return nextNominalTime;
          
          Show
          Robert Kanter added a comment - It already checks that: // If the next nominal time is after the job's end time, then this is the last action, so return null if (nextNominalTime.after(coordJob.getEndTime())) { nextNominalTime = null ; } return nextNominalTime;
          Hide
          Robert Kanter added a comment -

          Thanks for the reviews and catching all of those edge cases. I'm glad we can finally get this in

          Show
          Robert Kanter added a comment - Thanks for the reviews and catching all of those edge cases. I'm glad we can finally get this in
          Hide
          Bowen Zhang added a comment -

          my bad, didn't see it.

          Show
          Bowen Zhang added a comment - my bad, didn't see it.
          Hide
          Robert Kanter added a comment -

          Committed to trunk!

          Show
          Robert Kanter added a comment - Committed to trunk!
          Robert Kanter made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Fix Version/s trunk [ 12323240 ]
          Resolution Fixed [ 1 ]
          Hide
          Shwetha G S added a comment -

          I have a question regarding this implementation(Sorry about the late ask) - Why does this require changes in CoordActionInputCheckXCommand, why the comparison with now and next nominal time? Take an example of an hourly coord and there are coord actions for 8th, 9th, 10th and 11th hour in WAITING state. The expectation of LAST_ONLY is that among the ready actions(all dependencies met), it runs only the last one. But with now and next nominal time check, CoordActionInputCheckXCommand marks 8th, 9th and 10th as SKIPPED and 11th will remain in WAITING state. If the dependencies are available late consistently, the same thing continues(in 12th hour, 11th will be marked SKIPPED) and so on and none of the instances for this coord will run.

          Instead, shouldn't the change be in CoordActionReadyXCommand which picks all ready actions and picks latest and discard the rest?

          Show
          Shwetha G S added a comment - I have a question regarding this implementation(Sorry about the late ask) - Why does this require changes in CoordActionInputCheckXCommand, why the comparison with now and next nominal time? Take an example of an hourly coord and there are coord actions for 8th, 9th, 10th and 11th hour in WAITING state. The expectation of LAST_ONLY is that among the ready actions(all dependencies met), it runs only the last one. But with now and next nominal time check, CoordActionInputCheckXCommand marks 8th, 9th and 10th as SKIPPED and 11th will remain in WAITING state. If the dependencies are available late consistently, the same thing continues(in 12th hour, 11th will be marked SKIPPED) and so on and none of the instances for this coord will run. Instead, shouldn't the change be in CoordActionReadyXCommand which picks all ready actions and picks latest and discard the rest?
          Hide
          Robert Kanter added a comment -

          CoordActionInputCheckXCommand, why the comparison with now and next nominal time?

          As for what the LAST_ONLY option was supposed to do, we never actually had any very specific documentation. To try to keep this from becoming very complicated (both to implement and to use), at some point I made the decision that the LAST_ONLY will look at the nominal time of the next nominal time. In essence, you can think of this like having the timeout set to the interval, but instead of being TIMEDOUT (which results in a "bad" end state for the job), it becomes SKIPPED (which results in a "good" end state for the job). I think this decision makes sense anyway because if the next action is ready to run, the previous action should have either already ran or should not run.

          Take an example of an hourly coord and there are coord actions for 8th, 9th, 10th and 11th hour in WAITING state. The expectation of LAST_ONLY is that among the ready actions(all dependencies met), it runs only the last one. But with now and next nominal time check, CoordActionInputCheckXCommand marks 8th, 9th and 10th as SKIPPED and 11th will remain in WAITING state. If the dependencies are available late consistently, the same thing continues(in 12th hour, 11th will be marked SKIPPED) and so on and none of the instances for this coord will run.

          If the dependencies are met, then shouldn't the 11th transition from WAITING to READY to RUNNING etc?
          But yes, I suppose if you had a job where the data was consistently late by over one interval of the frequency, and the timeout was set high enough, then it is possible that no actions will actually run. I'm not sure of a good way to handle this scenario, but if you have any ideas, please file a JIRA.

          Show
          Robert Kanter added a comment - CoordActionInputCheckXCommand, why the comparison with now and next nominal time? As for what the LAST_ONLY option was supposed to do, we never actually had any very specific documentation. To try to keep this from becoming very complicated (both to implement and to use), at some point I made the decision that the LAST_ONLY will look at the nominal time of the next nominal time. In essence, you can think of this like having the timeout set to the interval, but instead of being TIMEDOUT (which results in a "bad" end state for the job), it becomes SKIPPED (which results in a "good" end state for the job). I think this decision makes sense anyway because if the next action is ready to run, the previous action should have either already ran or should not run. Take an example of an hourly coord and there are coord actions for 8th, 9th, 10th and 11th hour in WAITING state. The expectation of LAST_ONLY is that among the ready actions(all dependencies met), it runs only the last one. But with now and next nominal time check, CoordActionInputCheckXCommand marks 8th, 9th and 10th as SKIPPED and 11th will remain in WAITING state. If the dependencies are available late consistently, the same thing continues(in 12th hour, 11th will be marked SKIPPED) and so on and none of the instances for this coord will run. If the dependencies are met, then shouldn't the 11th transition from WAITING to READY to RUNNING etc? But yes, I suppose if you had a job where the data was consistently late by over one interval of the frequency, and the timeout was set high enough, then it is possible that no actions will actually run. I'm not sure of a good way to handle this scenario, but if you have any ideas, please file a JIRA.
          Hide
          Shwetha G S added a comment -

          Previously, LAST_ONLY was running the latest ready action. The gap was that it was picking up even the old ready instances after that. So, I thought just skipping the old instances in CoordActionReadyXCommand would complete the feature. But the current implementation changes the meaning of LAST_ONLY to run only the latest nominal time(makes sense in case of backlog, but the backlog is in waiting actions which will happen probably because of materialisation delays which is internal to oozie?)

          Anyways, now that its implemented already, Is there a usecase for the current implementation of LAST_ONLY. If there is, I will add another enum which executes latest ready action and discards the rest. If there is no usecase, can we modify LAST_ONLY to run the latest ready action?

          Depending on what we decide, I will create the jira accordingly. Let me know. Thanks

          Show
          Shwetha G S added a comment - Previously, LAST_ONLY was running the latest ready action. The gap was that it was picking up even the old ready instances after that. So, I thought just skipping the old instances in CoordActionReadyXCommand would complete the feature. But the current implementation changes the meaning of LAST_ONLY to run only the latest nominal time(makes sense in case of backlog, but the backlog is in waiting actions which will happen probably because of materialisation delays which is internal to oozie?) Anyways, now that its implemented already, Is there a usecase for the current implementation of LAST_ONLY. If there is, I will add another enum which executes latest ready action and discards the rest. If there is no usecase, can we modify LAST_ONLY to run the latest ready action? Depending on what we decide, I will create the jira accordingly. Let me know. Thanks
          Rohini Palaniswamy made changes -
          Fix Version/s 4.1.0 [ 12324960 ]
          Fix Version/s trunk [ 12323240 ]
          Hide
          Shwetha G S added a comment -

          Robert Kanter, can you check this please?

          Show
          Shwetha G S added a comment - Robert Kanter , can you check this please?
          Hide
          Robert Kanter added a comment -

          Previously, LAST_ONLY was running the latest ready action. The gap was that it was picking up even the old ready instances after that.

          Actually, previously LAST_ONLY was basically the same as LIFO. And from what I could tell, it hasn't worked properly for at least 3 years.

          But the current implementation changes the meaning of LAST_ONLY to run only the latest nominal time

          The meaning of LAST_ONLY was never clearly specified. From the documentation, it only had this: "discards all older materializations". That's pretty vague and could mean a few different things. After Bowen's attempt at this JIRA, and then my earlier attempts, it was getting too complicated and we kept find edge cases that we couldn't fix. t finally decided that LAST_ONLY could be interpreted as essentially the same as having a timeout equal to the frequency, but with a "good" ending status (see this comment and later).

          Is there a usecase for the current implementation of LAST_ONLY. If there is

          LAST_ONLY is good if you have a coordinator where you don't necessarily care about the specific instances that it runs, just that it runs. Specifically, we had a customer who had a coordinator with LAST_ONLY and was expecting this behavior. They brought down their Oozie server for a while for maintenance; when they brought it up, they were expecting Oozie to ignore the "missed" instantiations and just do the last one and continue normally after that, but it instead played "catch up" and started them all. That's why I wanted to fix this; otherwise, nobody has complained that LAST_ONLY hasn't worked in 3 years... I imagine most users don't even know that you can change it to LIFO or LAST_ONLY.

          I will add another enum which executes latest ready action and discards the rest.

          Feel free to add another execution strategy. Bowen Zhang has already created a new one called NONE in OOZIE-1875.

          Show
          Robert Kanter added a comment - Previously, LAST_ONLY was running the latest ready action. The gap was that it was picking up even the old ready instances after that. Actually, previously LAST_ONLY was basically the same as LIFO. And from what I could tell, it hasn't worked properly for at least 3 years. But the current implementation changes the meaning of LAST_ONLY to run only the latest nominal time The meaning of LAST_ONLY was never clearly specified. From the documentation, it only had this: "discards all older materializations". That's pretty vague and could mean a few different things. After Bowen's attempt at this JIRA, and then my earlier attempts, it was getting too complicated and we kept find edge cases that we couldn't fix. t finally decided that LAST_ONLY could be interpreted as essentially the same as having a timeout equal to the frequency, but with a "good" ending status (see this comment and later). Is there a usecase for the current implementation of LAST_ONLY. If there is LAST_ONLY is good if you have a coordinator where you don't necessarily care about the specific instances that it runs, just that it runs. Specifically, we had a customer who had a coordinator with LAST_ONLY and was expecting this behavior. They brought down their Oozie server for a while for maintenance; when they brought it up, they were expecting Oozie to ignore the "missed" instantiations and just do the last one and continue normally after that, but it instead played "catch up" and started them all. That's why I wanted to fix this; otherwise, nobody has complained that LAST_ONLY hasn't worked in 3 years... I imagine most users don't even know that you can change it to LIFO or LAST_ONLY. I will add another enum which executes latest ready action and discards the rest. Feel free to add another execution strategy. Bowen Zhang has already created a new one called NONE in OOZIE-1875 .
          Hide
          Shwetha G S added a comment -

          LAST_ONLY makes sense for coords with no inputs. Created OOZIE-1951 for running latest ready action. Thanks

          Show
          Shwetha G S added a comment - LAST_ONLY makes sense for coords with no inputs. Created OOZIE-1951 for running latest ready action. Thanks
          Bowen Zhang made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Hide
          zhang added a comment -

          Bowen Zhang in 4.1.0 version,Only when i restart the server it will skipped the WAITING or READY job. And why i can't find the handleLastOnly method in 4.1.0-branche source code. Thanks

          Show
          zhang added a comment - Bowen Zhang in 4.1.0 version,Only when i restart the server it will skipped the WAITING or READY job. And why i can't find the handleLastOnly method in 4.1.0-branche source code. Thanks
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Patch Available Patch Available
          47d 8h 19m 1 Bowen Zhang 29/May/13 02:05
          Patch Available Patch Available Resolved Resolved
          366d 20h 34m 1 Robert Kanter 30/May/14 22:39
          Resolved Resolved Closed Closed
          192d 3h 5m 1 Bowen Zhang 09/Dec/14 00:44

            People

            • Assignee:
              Robert Kanter
              Reporter:
              Bowen Zhang
            • Votes:
              1 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development