Uploaded image for project: 'Oozie'
  1. Oozie
  2. OOZIE-1978

Forkjoin validation code is ridiculously slow in some cases

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: trunk, 4.0.1
    • Fix Version/s: 4.3.0
    • Component/s: core
    • Labels:
      None

      Description

      We've had a few users who have run into problems where submitting a workflow appears to hang (in the case of a subworkflow, it's similar but stuck in PREP). It turns out that if you wait long enough, it will actually go through and the workflow will run normally. The problem is that the forkjoin validation code is taking a really long time.

      The attached example has a series of 20 forks where each fork has 6 actions (it's based on an actual workflow, but all of the names were changed and the actions were all replaced by simple shell actions). One of our support guys said it took 1-2 hours , but on my computer it was taking 15+ hours (I had to cancel it)
      While this example doesn't have any nested forks, those can also take a long time too.

      It's easy to verify that it's the forkjoin validation code that's taking so long by looking at a jstack of the Oozie server and seeing deep recursive calls to org.apache.oozie.workflow.lite.LiteWorkflowAppParser.validateForkJoin. I also noticed a lot of sitting around in calls LinkedList.contains.

      I think we have 3 options:

      1. See if we can make the existing code faster somehow. Perhaps there's a way to parallelize it? Maybe there's some redundant checking that we can identify and skip? Change some data structures? etc
      2. See if we can write a new way to do this validation. I had originally completely rewritten this code a while ago, and we've since made a few fixes to catch edge cases and things. Perhaps it needs another rewrite?
      3. Try to identify when it's taking a long time and at least let the user know what's happening or something. Right now, it just appears that the Oozie CLI has hung and the job doesn't show up in the Oozie server. Most users aren't going to wait more than a minute or two.
      1. workflow.xml
        37 kB
        Robert Kanter
      2. OOZIE-1978_wip.001.patch
        24 kB
        Robert Kanter
      3. OOZIE-1978-001.patch
        33 kB
        Peter Bacsko
      4. OOZIE-1978-002.patch
        45 kB
        Peter Bacsko
      5. OOZIE-1978-003.patch
        45 kB
        Peter Bacsko
      6. OOZIE-1978-004.patch
        45 kB
        Peter Bacsko
      7. OOZIE-1978-002.patch
        45 kB
        Peter Bacsko
      8. OOZIE-1978-005.patch
        90 kB
        Peter Bacsko
      9. OOZIE-1978-006.patch
        89 kB
        Peter Bacsko

        Issue Links

          Activity

          Hide
          rkanter Robert Kanter added a comment -

          Closing issue; Oozie 4.3.0 is released.

          Show
          rkanter Robert Kanter added a comment - Closing issue; Oozie 4.3.0 is released.
          Hide
          rkanter Robert Kanter added a comment -

          Thanks Peter Bacsko and everyone for reviews. Committed to master!

          I'm glad we finally got this in; it's been bothering me for 2 years!

          Show
          rkanter Robert Kanter added a comment - Thanks Peter Bacsko and everyone for reviews. Committed to master! I'm glad we finally got this in; it's been bothering me for 2 years!
          Hide
          rkanter Robert Kanter added a comment -

          I'll commit this tomorrow afternoon PST if nobody has any other comments.

          Show
          rkanter Robert Kanter added a comment - I'll commit this tomorrow afternoon PST if nobody has any other comments.
          Hide
          jaydeepvishwakarma Jaydeep Vishwakarma added a comment -

          +1, LGTM

          Show
          jaydeepvishwakarma Jaydeep Vishwakarma added a comment - +1, LGTM
          Hide
          abhishekbafna Abhishek Bafna added a comment -

          +1 non-binding

          Show
          abhishekbafna Abhishek Bafna added a comment - +1 non-binding
          Hide
          rkanter Robert Kanter added a comment -

          +1 LGTM

          Though given the complexity of this patch, it would be good to have at least one other person review it (e.g. Purshotam Shah, Rohini Palaniswamy, Satish Subhashrao Saley, Jaydeep Vishwakarma, Abhishek Bafna).

          Show
          rkanter Robert Kanter added a comment - +1 LGTM Though given the complexity of this patch, it would be good to have at least one other person review it (e.g. Purshotam Shah , Rohini Palaniswamy , Satish Subhashrao Saley , Jaydeep Vishwakarma , Abhishek Bafna ).
          Hide
          hadoopqa Hadoop QA added a comment -

          Testing JIRA OOZIE-1978

          Cleaning local git workspace

          ----------------------------

          +1 PATCH_APPLIES
          +1 CLEAN
          +1 RAW_PATCH_ANALYSIS
          . +1 the patch does not introduce any @author tags
          . +1 the patch does not introduce any tabs
          . +1 the patch does not introduce any trailing spaces
          . +1 the patch does not introduce any line longer than 132
          . +1 the patch does adds/modifies 3 testcase(s)
          +1 RAT
          . +1 the patch does not seem to introduce new RAT warnings
          +1 JAVADOC
          . +1 the patch does not seem to introduce new Javadoc warnings
          +1 COMPILE
          . +1 HEAD compiles
          . +1 patch compiles
          . +1 the patch does not seem to introduce new javac warnings
          +1 BACKWARDS_COMPATIBILITY
          . +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
          . +1 the patch does not modify JPA files
          -1 TESTS
          . Tests run: 1800
          . Tests failed: 1
          . Tests errors: 0

          . The patch failed the following testcases:

          . testCoordStatus_Killed(org.apache.oozie.command.coord.TestCoordChangeXCommand)

          +1 DISTRO
          . +1 distro tarball builds with the patch

          ----------------------------
          -1 Overall result, please check the reported -1(s)

          The full output of the test-patch run is available at

          . https://builds.apache.org/job/oozie-trunk-precommit-build/3258/

          Show
          hadoopqa Hadoop QA added a comment - Testing JIRA OOZIE-1978 Cleaning local git workspace ---------------------------- +1 PATCH_APPLIES +1 CLEAN +1 RAW_PATCH_ANALYSIS . +1 the patch does not introduce any @author tags . +1 the patch does not introduce any tabs . +1 the patch does not introduce any trailing spaces . +1 the patch does not introduce any line longer than 132 . +1 the patch does adds/modifies 3 testcase(s) +1 RAT . +1 the patch does not seem to introduce new RAT warnings +1 JAVADOC . +1 the patch does not seem to introduce new Javadoc warnings +1 COMPILE . +1 HEAD compiles . +1 patch compiles . +1 the patch does not seem to introduce new javac warnings +1 BACKWARDS_COMPATIBILITY . +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations . +1 the patch does not modify JPA files -1 TESTS . Tests run: 1800 . Tests failed: 1 . Tests errors: 0 . The patch failed the following testcases: . testCoordStatus_Killed(org.apache.oozie.command.coord.TestCoordChangeXCommand) +1 DISTRO . +1 distro tarball builds with the patch ---------------------------- -1 Overall result, please check the reported -1(s) The full output of the test-patch run is available at . https://builds.apache.org/job/oozie-trunk-precommit-build/3258/
          Hide
          hadoopqa Hadoop QA added a comment -

          Testing JIRA OOZIE-1978

          Cleaning local git workspace

          ----------------------------

          +1 PATCH_APPLIES
          +1 CLEAN
          +1 RAW_PATCH_ANALYSIS
          . +1 the patch does not introduce any @author tags
          . +1 the patch does not introduce any tabs
          . +1 the patch does not introduce any trailing spaces
          . +1 the patch does not introduce any line longer than 132
          . +1 the patch does adds/modifies 4 testcase(s)
          +1 RAT
          . +1 the patch does not seem to introduce new RAT warnings
          +1 JAVADOC
          . +1 the patch does not seem to introduce new Javadoc warnings
          -1 COMPILE
          . -1 HEAD does not compile
          . -1 patch does not compile
          . +1 the patch does not seem to introduce new javac warnings
          +1 BACKWARDS_COMPATIBILITY
          . +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
          . +1 the patch does not modify JPA files
          -1 TESTS - patch does not compile, cannot run testcases
          +1 DISTRO
          . +1 distro tarball builds with the patch

          ----------------------------
          -1 Overall result, please check the reported -1(s)

          The full output of the test-patch run is available at

          . https://builds.apache.org/job/oozie-trunk-precommit-build/3081/

          Show
          hadoopqa Hadoop QA added a comment - Testing JIRA OOZIE-1978 Cleaning local git workspace ---------------------------- +1 PATCH_APPLIES +1 CLEAN +1 RAW_PATCH_ANALYSIS . +1 the patch does not introduce any @author tags . +1 the patch does not introduce any tabs . +1 the patch does not introduce any trailing spaces . +1 the patch does not introduce any line longer than 132 . +1 the patch does adds/modifies 4 testcase(s) +1 RAT . +1 the patch does not seem to introduce new RAT warnings +1 JAVADOC . +1 the patch does not seem to introduce new Javadoc warnings -1 COMPILE . -1 HEAD does not compile . -1 patch does not compile . +1 the patch does not seem to introduce new javac warnings +1 BACKWARDS_COMPATIBILITY . +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations . +1 the patch does not modify JPA files -1 TESTS - patch does not compile, cannot run testcases +1 DISTRO . +1 distro tarball builds with the patch ---------------------------- -1 Overall result, please check the reported -1(s) The full output of the test-patch run is available at . https://builds.apache.org/job/oozie-trunk-precommit-build/3081/
          Hide
          hadoopqa Hadoop QA added a comment -

          Testing JIRA OOZIE-1978

          Cleaning local git workspace

          ----------------------------

          +1 PATCH_APPLIES
          +1 CLEAN
          +1 RAW_PATCH_ANALYSIS
          . +1 the patch does not introduce any @author tags
          . +1 the patch does not introduce any tabs
          . +1 the patch does not introduce any trailing spaces
          . +1 the patch does not introduce any line longer than 132
          . +1 the patch does adds/modifies 4 testcase(s)
          +1 RAT
          . +1 the patch does not seem to introduce new RAT warnings
          +1 JAVADOC
          . +1 the patch does not seem to introduce new Javadoc warnings
          +1 COMPILE
          . +1 HEAD compiles
          . +1 patch compiles
          . +1 the patch does not seem to introduce new javac warnings
          +1 BACKWARDS_COMPATIBILITY
          . +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
          . +1 the patch does not modify JPA files
          -1 TESTS
          . Tests run: 1790
          . Tests failed: 8
          . Tests errors: 0

          . The patch failed the following testcases:

          . testBundleStatusNotTransitionFromKilled(org.apache.oozie.service.TestStatusTransitService)
          . testBundleStatusTransitWithLock(org.apache.oozie.service.TestStatusTransitService)
          . testCoordStatusTransitServiceForTimeout(org.apache.oozie.service.TestStatusTransitService)
          . testNone(org.apache.oozie.command.coord.TestCoordActionInputCheckXCommandNonUTC)
          . testActionKillCommandActionNumbers(org.apache.oozie.command.coord.TestCoordActionsKillXCommand)
          . testCoordMaterializeTriggerService3(org.apache.oozie.service.TestCoordMaterializeTriggerService)
          . testMaxMatThrottleNotPicked(org.apache.oozie.service.TestCoordMaterializeTriggerService)
          . testCoordStatus_Killed(org.apache.oozie.command.coord.TestCoordChangeXCommand)

          +1 DISTRO
          . +1 distro tarball builds with the patch

          ----------------------------
          -1 Overall result, please check the reported -1(s)

          The full output of the test-patch run is available at

          . https://builds.apache.org/job/oozie-trunk-precommit-build/3048/

          Show
          hadoopqa Hadoop QA added a comment - Testing JIRA OOZIE-1978 Cleaning local git workspace ---------------------------- +1 PATCH_APPLIES +1 CLEAN +1 RAW_PATCH_ANALYSIS . +1 the patch does not introduce any @author tags . +1 the patch does not introduce any tabs . +1 the patch does not introduce any trailing spaces . +1 the patch does not introduce any line longer than 132 . +1 the patch does adds/modifies 4 testcase(s) +1 RAT . +1 the patch does not seem to introduce new RAT warnings +1 JAVADOC . +1 the patch does not seem to introduce new Javadoc warnings +1 COMPILE . +1 HEAD compiles . +1 patch compiles . +1 the patch does not seem to introduce new javac warnings +1 BACKWARDS_COMPATIBILITY . +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations . +1 the patch does not modify JPA files -1 TESTS . Tests run: 1790 . Tests failed: 8 . Tests errors: 0 . The patch failed the following testcases: . testBundleStatusNotTransitionFromKilled(org.apache.oozie.service.TestStatusTransitService) . testBundleStatusTransitWithLock(org.apache.oozie.service.TestStatusTransitService) . testCoordStatusTransitServiceForTimeout(org.apache.oozie.service.TestStatusTransitService) . testNone(org.apache.oozie.command.coord.TestCoordActionInputCheckXCommandNonUTC) . testActionKillCommandActionNumbers(org.apache.oozie.command.coord.TestCoordActionsKillXCommand) . testCoordMaterializeTriggerService3(org.apache.oozie.service.TestCoordMaterializeTriggerService) . testMaxMatThrottleNotPicked(org.apache.oozie.service.TestCoordMaterializeTriggerService) . testCoordStatus_Killed(org.apache.oozie.command.coord.TestCoordChangeXCommand) +1 DISTRO . +1 distro tarball builds with the patch ---------------------------- -1 Overall result, please check the reported -1(s) The full output of the test-patch run is available at . https://builds.apache.org/job/oozie-trunk-precommit-build/3048/
          Hide
          pbacsko Peter Bacsko added a comment -
          Show
          pbacsko Peter Bacsko added a comment - Sure - here it is: https://reviews.apache.org/r/50202/
          Hide
          abhishekbafna Abhishek Bafna added a comment -

          Peter Bacsko Can you please upload the patch to RB.

          Show
          abhishekbafna Abhishek Bafna added a comment - Peter Bacsko Can you please upload the patch to RB.
          Hide
          hadoopqa Hadoop QA added a comment -

          Testing JIRA OOZIE-1978

          Cleaning local git workspace

          ----------------------------

          +1 PATCH_APPLIES
          +1 CLEAN
          +1 RAW_PATCH_ANALYSIS
          . +1 the patch does not introduce any @author tags
          . +1 the patch does not introduce any tabs
          . +1 the patch does not introduce any trailing spaces
          . +1 the patch does not introduce any line longer than 132
          . +1 the patch does adds/modifies 2 testcase(s)
          +1 RAT
          . +1 the patch does not seem to introduce new RAT warnings
          +1 JAVADOC
          . +1 the patch does not seem to introduce new Javadoc warnings
          +1 COMPILE
          . +1 HEAD compiles
          . +1 patch compiles
          . +1 the patch does not seem to introduce new javac warnings
          +1 BACKWARDS_COMPATIBILITY
          . +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
          . +1 the patch does not modify JPA files
          +1 TESTS
          . Tests run: 1787
          +1 DISTRO
          . +1 distro tarball builds with the patch

          ----------------------------
          +1 Overall result, good!, no -1s

          The full output of the test-patch run is available at

          . https://builds.apache.org/job/oozie-trunk-precommit-build/3038/

          Show
          hadoopqa Hadoop QA added a comment - Testing JIRA OOZIE-1978 Cleaning local git workspace ---------------------------- +1 PATCH_APPLIES +1 CLEAN +1 RAW_PATCH_ANALYSIS . +1 the patch does not introduce any @author tags . +1 the patch does not introduce any tabs . +1 the patch does not introduce any trailing spaces . +1 the patch does not introduce any line longer than 132 . +1 the patch does adds/modifies 2 testcase(s) +1 RAT . +1 the patch does not seem to introduce new RAT warnings +1 JAVADOC . +1 the patch does not seem to introduce new Javadoc warnings +1 COMPILE . +1 HEAD compiles . +1 patch compiles . +1 the patch does not seem to introduce new javac warnings +1 BACKWARDS_COMPATIBILITY . +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations . +1 the patch does not modify JPA files +1 TESTS . Tests run: 1787 +1 DISTRO . +1 distro tarball builds with the patch ---------------------------- +1 Overall result, good!, no -1s The full output of the test-patch run is available at . https://builds.apache.org/job/oozie-trunk-precommit-build/3038/
          Hide
          hadoopqa Hadoop QA added a comment -

          Testing JIRA OOZIE-1978

          Cleaning local git workspace

          ----------------------------

          +1 PATCH_APPLIES
          +1 CLEAN
          -1 RAW_PATCH_ANALYSIS
          . +1 the patch does not introduce any @author tags
          . +1 the patch does not introduce any tabs
          . -1 the patch contains 1 line(s) with trailing spaces
          . -1 the patch contains 2 line(s) longer than 132 characters
          . +1 the patch does adds/modifies 2 testcase(s)
          +1 RAT
          . +1 the patch does not seem to introduce new RAT warnings
          +1 JAVADOC
          . +1 the patch does not seem to introduce new Javadoc warnings
          +1 COMPILE
          . +1 HEAD compiles
          . +1 patch compiles
          . +1 the patch does not seem to introduce new javac warnings
          +1 BACKWARDS_COMPATIBILITY
          . +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
          . +1 the patch does not modify JPA files
          -1 TESTS
          . Tests run: 1787
          . Tests failed: 1
          . Tests errors: 0

          . The patch failed the following testcases:

          . testActionCheckTransientDuringLauncher(org.apache.oozie.command.wf.TestActionCheckXCommand)

          +1 DISTRO
          . +1 distro tarball builds with the patch

          ----------------------------
          -1 Overall result, please check the reported -1(s)

          The full output of the test-patch run is available at

          . https://builds.apache.org/job/oozie-trunk-precommit-build/3037/

          Show
          hadoopqa Hadoop QA added a comment - Testing JIRA OOZIE-1978 Cleaning local git workspace ---------------------------- +1 PATCH_APPLIES +1 CLEAN -1 RAW_PATCH_ANALYSIS . +1 the patch does not introduce any @author tags . +1 the patch does not introduce any tabs . -1 the patch contains 1 line(s) with trailing spaces . -1 the patch contains 2 line(s) longer than 132 characters . +1 the patch does adds/modifies 2 testcase(s) +1 RAT . +1 the patch does not seem to introduce new RAT warnings +1 JAVADOC . +1 the patch does not seem to introduce new Javadoc warnings +1 COMPILE . +1 HEAD compiles . +1 patch compiles . +1 the patch does not seem to introduce new javac warnings +1 BACKWARDS_COMPATIBILITY . +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations . +1 the patch does not modify JPA files -1 TESTS . Tests run: 1787 . Tests failed: 1 . Tests errors: 0 . The patch failed the following testcases: . testActionCheckTransientDuringLauncher(org.apache.oozie.command.wf.TestActionCheckXCommand) +1 DISTRO . +1 distro tarball builds with the patch ---------------------------- -1 Overall result, please check the reported -1(s) The full output of the test-patch run is available at . https://builds.apache.org/job/oozie-trunk-precommit-build/3037/
          Hide
          hadoopqa Hadoop QA added a comment -

          Testing JIRA OOZIE-1978

          Cleaning local git workspace

          ----------------------------

          +1 PATCH_APPLIES
          +1 CLEAN
          -1 RAW_PATCH_ANALYSIS
          . +1 the patch does not introduce any @author tags
          . +1 the patch does not introduce any tabs
          . -1 the patch contains 4 line(s) with trailing spaces
          . -1 the patch contains 2 line(s) longer than 132 characters
          . -1 the patch does not add/modify any testcase
          -1 RAT
          . -1 the patch seems to introduce 1 new RAT warning(s)
          +1 JAVADOC
          . +1 the patch does not seem to introduce new Javadoc warnings
          +1 COMPILE
          . +1 HEAD compiles
          . +1 patch compiles
          . +1 the patch does not seem to introduce new javac warnings
          +1 BACKWARDS_COMPATIBILITY
          . +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
          . +1 the patch does not modify JPA files
          -1 TESTS
          . Tests run: 1787
          . Tests failed: 13
          . Tests errors: 9

          . The patch failed the following testcases:

          . testDryrunInvalidXml(org.apache.oozie.command.wf.TestSubmitXCommand)
          . testWfNoForkJoin(org.apache.oozie.workflow.lite.TestLiteWorkflowAppParser)
          . testSimpleForkJoin(org.apache.oozie.workflow.lite.TestLiteWorkflowAppParser)
          . testNestedForkJoin(org.apache.oozie.workflow.lite.TestLiteWorkflowAppParser)
          . testTransition2(org.apache.oozie.workflow.lite.TestLiteWorkflowAppParser)
          . testTransition3(org.apache.oozie.workflow.lite.TestLiteWorkflowAppParser)
          . testErrorTransitionForkJoin(org.apache.oozie.workflow.lite.TestLiteWorkflowAppParser)
          . testDecisionForkJoin(org.apache.oozie.workflow.lite.TestLiteWorkflowAppParser)
          . testDecisionsToJoinForkJoin(org.apache.oozie.workflow.lite.TestLiteWorkflowAppParser)
          . testDecisionsToKillForkJoin(org.apache.oozie.workflow.lite.TestLiteWorkflowAppParser)
          . testDecisionTwoPathsForkJoin(org.apache.oozie.workflow.lite.TestLiteWorkflowAppParser)
          . testMultipleDecisionThreePathsForkJoin(org.apache.oozie.workflow.lite.TestLiteWorkflowAppParser)
          . testDecisionMultipleForks(org.apache.oozie.workflow.lite.TestLiteWorkflowAppParser)

          +1 DISTRO
          . +1 distro tarball builds with the patch

          ----------------------------
          -1 Overall result, please check the reported -1(s)

          The full output of the test-patch run is available at

          . https://builds.apache.org/job/oozie-trunk-precommit-build/3036/

          Show
          hadoopqa Hadoop QA added a comment - Testing JIRA OOZIE-1978 Cleaning local git workspace ---------------------------- +1 PATCH_APPLIES +1 CLEAN -1 RAW_PATCH_ANALYSIS . +1 the patch does not introduce any @author tags . +1 the patch does not introduce any tabs . -1 the patch contains 4 line(s) with trailing spaces . -1 the patch contains 2 line(s) longer than 132 characters . -1 the patch does not add/modify any testcase -1 RAT . -1 the patch seems to introduce 1 new RAT warning(s) +1 JAVADOC . +1 the patch does not seem to introduce new Javadoc warnings +1 COMPILE . +1 HEAD compiles . +1 patch compiles . +1 the patch does not seem to introduce new javac warnings +1 BACKWARDS_COMPATIBILITY . +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations . +1 the patch does not modify JPA files -1 TESTS . Tests run: 1787 . Tests failed: 13 . Tests errors: 9 . The patch failed the following testcases: . testDryrunInvalidXml(org.apache.oozie.command.wf.TestSubmitXCommand) . testWfNoForkJoin(org.apache.oozie.workflow.lite.TestLiteWorkflowAppParser) . testSimpleForkJoin(org.apache.oozie.workflow.lite.TestLiteWorkflowAppParser) . testNestedForkJoin(org.apache.oozie.workflow.lite.TestLiteWorkflowAppParser) . testTransition2(org.apache.oozie.workflow.lite.TestLiteWorkflowAppParser) . testTransition3(org.apache.oozie.workflow.lite.TestLiteWorkflowAppParser) . testErrorTransitionForkJoin(org.apache.oozie.workflow.lite.TestLiteWorkflowAppParser) . testDecisionForkJoin(org.apache.oozie.workflow.lite.TestLiteWorkflowAppParser) . testDecisionsToJoinForkJoin(org.apache.oozie.workflow.lite.TestLiteWorkflowAppParser) . testDecisionsToKillForkJoin(org.apache.oozie.workflow.lite.TestLiteWorkflowAppParser) . testDecisionTwoPathsForkJoin(org.apache.oozie.workflow.lite.TestLiteWorkflowAppParser) . testMultipleDecisionThreePathsForkJoin(org.apache.oozie.workflow.lite.TestLiteWorkflowAppParser) . testDecisionMultipleForks(org.apache.oozie.workflow.lite.TestLiteWorkflowAppParser) +1 DISTRO . +1 distro tarball builds with the patch ---------------------------- -1 Overall result, please check the reported -1(s) The full output of the test-patch run is available at . https://builds.apache.org/job/oozie-trunk-precommit-build/3036/
          Hide
          pbacsko Peter Bacsko added a comment -

          I submitted my patch for review.

          Couple of things:

          • I extracted all validation stuff from LiteWorkflowAppParser to a separate class. Acyclic validation was already there but I changed it slightly (with this approach we can actually print the path that leads to the loop - I stole this from the existing validateForkJoin)
          • I think we need extra validation for transitions, eg. we need to make sure that a node (other than End and Kill) points to someting. I don't think this is checked right now. Or is it checked on schema level?
          • Add acyclic test (at least one)
          • Add test with a big fork-join WF, like the one which is attached. Probably should run on a separate thread just in case it doesn't terminate (I've seen this in Robert's patch)
          • We might think about other edge cases and tests
          Show
          pbacsko Peter Bacsko added a comment - I submitted my patch for review. Couple of things: I extracted all validation stuff from LiteWorkflowAppParser to a separate class. Acyclic validation was already there but I changed it slightly (with this approach we can actually print the path that leads to the loop - I stole this from the existing validateForkJoin) I think we need extra validation for transitions, eg. we need to make sure that a node (other than End and Kill) points to someting. I don't think this is checked right now. Or is it checked on schema level? Add acyclic test (at least one) Add test with a big fork-join WF, like the one which is attached. Probably should run on a separate thread just in case it doesn't terminate (I've seen this in Robert's patch) We might think about other edge cases and tests
          Hide
          puru Purshotam Shah added a comment -

          We did not look into the forkjoin validation though.

          We have disabled fork and join validation in our production.

          Show
          puru Purshotam Shah added a comment - We did not look into the forkjoin validation though. We have disabled fork and join validation in our production.
          Hide
          pbacsko Peter Bacsko added a comment -

          Robert Kanter thanks for the patch. After taking a quick look, it roughly has the same approach as mine - the TODOs are useful to think about, I'm not sure we really validate all edge cases. For example, I can't find a test case in TestLiteWorkflow which validates loop detection (error E707). So at least one additional test is necessary, but maybe there are some more.

          Show
          pbacsko Peter Bacsko added a comment - Robert Kanter thanks for the patch. After taking a quick look, it roughly has the same approach as mine - the TODOs are useful to think about, I'm not sure we really validate all edge cases. For example, I can't find a test case in TestLiteWorkflow which validates loop detection (error E707). So at least one additional test is necessary, but maybe there are some more.
          Hide
          rkanter Robert Kanter added a comment -

          That's a fancy ASCII diagram Peter Bacsko. I realized that there's a ton of paths before, but didn't realize there were that many (I never did the math here).

          Anyway, I have a work-in-progress patch that I had been meaning to upload for quite some time, but never got around to it. I'll attach it now. Feel free to use it or steal things from it, but if you already have something, that's fine too. It's been so long since I looked at it, and this whole thing gets complicated, so I don't remember how it works or what else I was planning to do, though there are some comments.

          Show
          rkanter Robert Kanter added a comment - That's a fancy ASCII diagram Peter Bacsko . I realized that there's a ton of paths before, but didn't realize there were that many (I never did the math here). Anyway, I have a work-in-progress patch that I had been meaning to upload for quite some time, but never got around to it. I'll attach it now. Feel free to use it or steal things from it, but if you already have something, that's fine too. It's been so long since I looked at it, and this whole thing gets complicated, so I don't remember how it works or what else I was planning to do, though there are some comments.
          Hide
          pbacsko Peter Bacsko added a comment - - edited

          HI

          I was looking at this problem and I think I found out why it is so slow sometimes. Here is a small graph:

               A1--->A2--->A3    A7--->A8--->A9 
               |           |     |            |
          S--> F1          J1-->F2            J2--->E
               |           |     |            |
               A4--->A5--->A6    A10-->A11-->A12
          

          The problem is that ALL possible execution path is covered by the validateForkJoin() method. For example, the first recursion choses the path F1->A1->A2->A3->J1 and then F2->A7->A8->A9->J2. We go back to check F2->A10->A11->A12->J2. Then the calls return to F1 where the second path is chosen: F1->A4->A5->A6. However, it again re-walks both paths for F2->J2. If you have a lot of fork-join pairs, that's a massive amount of possible paths.

          Eg. we have 10 ForkJoin pairs with 5 forks each, that's 500 paths it total. In the attached example, we have 20 pairs with 6 actions, that is 6^20 (3 656 158 440 062 976) paths. No wonder it doesn't finish in 15 hours. Also, at each invocation, we check if we found a cycle - that's the explanation for the lot of LinkedList.contains().

          I'm already working on a solution which separately verifies the acyclic property of the graph and then it validates FJ. It's not yet finished, but I did a test run on the example and it took <1ms to complete. So looks like it's working. I'll attach the patch as soon as it passes all unit tests.

          Show
          pbacsko Peter Bacsko added a comment - - edited HI I was looking at this problem and I think I found out why it is so slow sometimes. Here is a small graph: A1--->A2--->A3 A7--->A8--->A9 | | | | S--> F1 J1-->F2 J2--->E | | | | A4--->A5--->A6 A10-->A11-->A12 The problem is that ALL possible execution path is covered by the validateForkJoin() method. For example, the first recursion choses the path F1->A1->A2->A3->J1 and then F2->A7->A8->A9->J2. We go back to check F2->A10->A11->A12->J2. Then the calls return to F1 where the second path is chosen: F1->A4->A5->A6. However, it again re-walks both paths for F2->J2. If you have a lot of fork-join pairs, that's a massive amount of possible paths. Eg. we have 10 ForkJoin pairs with 5 forks each, that's 500 paths it total. In the attached example, we have 20 pairs with 6 actions, that is 6^20 (3 656 158 440 062 976) paths. No wonder it doesn't finish in 15 hours. Also, at each invocation, we check if we found a cycle - that's the explanation for the lot of LinkedList.contains() . I'm already working on a solution which separately verifies the acyclic property of the graph and then it validates FJ. It's not yet finished, but I did a test run on the example and it took <1ms to complete. So looks like it's working. I'll attach the patch as soon as it passes all unit tests.
          Hide
          rohini Rohini Palaniswamy added a comment -

          Robert Kanter,
          Take a look at Purshotam Shah's patch in OOZIE-2345 as well. It parallelizes job submission in fork as we were hitting major slowness with that with 20+ fork jobs with 15m SLA when there was some NN slowness. We did not look into the forkjoin validation though.

          Show
          rohini Rohini Palaniswamy added a comment - Robert Kanter , Take a look at Purshotam Shah 's patch in OOZIE-2345 as well. It parallelizes job submission in fork as we were hitting major slowness with that with 20+ fork jobs with 15m SLA when there was some NN slowness. We did not look into the forkjoin validation though.
          Hide
          rkanter Robert Kanter added a comment -

          FYI: I've been slowly working on this and have it roughly halfway done.

          We've also seen an issue due to this where a Coordinator gets stuck and you can't even kill it because the Kill command can't acquire the Coordinator's lock since it's being held while the forkjoin validation is taking forever.

          Show
          rkanter Robert Kanter added a comment - FYI: I've been slowly working on this and have it roughly halfway done. We've also seen an issue due to this where a Coordinator gets stuck and you can't even kill it because the Kill command can't acquire the Coordinator's lock since it's being held while the forkjoin validation is taking forever.

            People

            • Assignee:
              pbacsko Peter Bacsko
              Reporter:
              rkanter Robert Kanter
            • Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development