Uploaded image for project: 'Falcon'
  1. Falcon
  2. FALCON-740

Entity kill job calls OozieClient.kill on bundle coord job ids before calling kill on bundle job id

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.6
    • 0.6
    • webapp
    • None

    Description

      When Falcon user makes an entity kill API call, Falcon does the following in org.apache.falcon.workflow.engine.OozieWorkflowEngine.killBundle(String clusterName, BundleJob job)

       //kill all coords
                  for (CoordinatorJob coord : job.getCoordinators()) {
                      client.kill(coord.getId());
                      LOG.debug("Killed coord {} on cluster {}", coord.getId(), clusterName);
                  }
      
                  //set end time of bundle
                  client.change(job.getId(), OozieClient.CHANGE_VALUE_ENDTIME + "=" + SchemaHelper.formatDateUTC(new Date()));
                  LOG.debug("Changed end time of bundle {} on cluster {}", job.getId(), clusterName);
      
                  //kill bundle
                  client.kill(job.getId());
                  LOG.debug("Killed bundle {} on cluster {}", job.getId(), clusterName);
      

      Two questions.
      1. Why should we kill the coordinator jobs before killing the bundle job? OozieClient.kill(bundle_job_id) should kill all the bundle's coord jobs.
      2. Why is the endtime changed for bundle job? https://oozie.apache.org/docs/4.0.1/DG_CommandLineTool.html#Changing_pausetime_of_a_Bundle_Job does not say that endtime can be changed for bundlejob.

      I think this code should be updated, please comment if you think I made any wrong assumptions.

      Thank you

      Attachments

        1. FALCON-740.patch
          3 kB
          Sowmya Ramesh
        2. FALCON-740.v1.patch
          3 kB
          Sowmya Ramesh

        Activity

          sriksun, shwethags, shaik.idris, can you please take a look?

          svenkat Venkatesh Seetharam added a comment - sriksun , shwethags , shaik.idris , can you please take a look?

          bvellanki, pls provide a patch so others can review and also you were indicating that the oozie behavior has changed, can you refer to that jira here?

          svenkat Venkatesh Seetharam added a comment - bvellanki , pls provide a patch so others can review and also you were indicating that the oozie behavior has changed, can you refer to that jira here?

          If I recall correct end time is being set to identify that the bundle is actually killed through user action and not because of some error in the bundle definition. Coords are killed as oozie was queueing the bundle kill and there is no guarantee on when the underlying coords will actually get killed and hence inlined it.

          sriksun Srikanth Sundarrajan added a comment - If I recall correct end time is being set to identify that the bundle is actually killed through user action and not because of some error in the bundle definition. Coords are killed as oozie was queueing the bundle kill and there is no guarantee on when the underlying coords will actually get killed and hence inlined it.
          shwethags Shwetha GS added a comment -

          Why should we kill the coordinator jobs before killing the bundle job? OozieClient.kill(bundle_job_id) should kill all the bundle's coord jobs.

          Yes, bundle kill kills coords, but is done asynchronously. So, coord kill can fail later and we will not come to know. To ensure that coord is indeed killed, we kill coords directly

          Why is the endtime changed for bundle job? https://oozie.apache.org/docs/4.0.1/DG_CommandLineTool.html#Changing_pausetime_of_a_Bundle_Job does not say that endtime can be changed for bundlejob.

          Bundle end time can be changed, even though it may not be documented. To differentiate between manual kill and status change in oozie, we set end time. But now, we check the staging path to find the bundles for a given entity. So, we can remove set end time

          shwethags Shwetha GS added a comment - Why should we kill the coordinator jobs before killing the bundle job? OozieClient.kill(bundle_job_id) should kill all the bundle's coord jobs. Yes, bundle kill kills coords, but is done asynchronously. So, coord kill can fail later and we will not come to know. To ensure that coord is indeed killed, we kill coords directly Why is the endtime changed for bundle job? https://oozie.apache.org/docs/4.0.1/DG_CommandLineTool.html#Changing_pausetime_of_a_Bundle_Job does not say that endtime can be changed for bundlejob. Bundle end time can be changed, even though it may not be documented. To differentiate between manual kill and status change in oozie, we set end time. But now, we check the staging path to find the bundles for a given entity. So, we can remove set end time

          The context is we are facing an issue with deleting an entity. There is a regression in Oozie that if a coord is killed and then you kill a bundle, Oozie throws an exception that the coord is already killed. We have opened a blocker ticket against Oozie but wanted to understand the reason behind this. Thanks guys!

          svenkat Venkatesh Seetharam added a comment - The context is we are facing an issue with deleting an entity. There is a regression in Oozie that if a coord is killed and then you kill a bundle, Oozie throws an exception that the coord is already killed. We have opened a blocker ticket against Oozie but wanted to understand the reason behind this. Thanks guys!
          shwethags Shwetha GS added a comment -

          Recently they changed bundle update to be synchronous(but it returns success even though some coord updates failed). But bundle kill is still asynchronous. Whats the oozie bug number?

          shwethags Shwetha GS added a comment - Recently they changed bundle update to be synchronous(but it returns success even though some coord updates failed). But bundle kill is still asynchronous. Whats the oozie bug number?
          bvellanki Balu Vellanki added a comment -

          shwethags - The bug Venkatesh referred to is an internal bug. When a falcon user tries to delete an entity, Falcon catches and throws the following exception being thrown by Oozie. This causes entity delete to fail.

          2014-09-10 07:50:06,260 ERROR BundleJobChangeXCommand:540 - USER- GROUP- TOKEN[] APP- JOB0000743-140910031253668-oozie-oozi-B ACTION- XException, 
          org.apache.oozie.command.CommandException: E1320: Bundle Job change error, [[ 0000744-140910031253668-oozie-oozi-C : Coord is in killed state ]]
          at org.apache.oozie.command.bundle.BundleJobChangeXCommand.execute(BundleJobChangeXCommand.java:208)
          at org.apache.oozie.command.bundle.BundleJobChangeXCommand.execute(BundleJobChangeXCommand.java:50)
          at org.apache.oozie.command.XCommand.call(XCommand.java:281)
          at org.apache.oozie.BundleEngine.change(BundleEngine.java:85)
          at org.apache.oozie.servlet.V1JobServlet.changeBundleJob(V1JobServlet.java:585)
          

          Bowen from Oozie team confirmed that this is caused by Falcon killing coord_jobs of a bundle, and then trying to change the bundle job endtime, followed by falcon killing the bundle job. This is caused because Oozie changed how it handles bundle change command. The related oozie jira is https://issues.apache.org/jira/browse/OOZIE-1807

          Since you confirmed that we can now remove set end time code block - I will do that, create a patch and test it before submitting the patch.

          Thanks

          bvellanki Balu Vellanki added a comment - shwethags - The bug Venkatesh referred to is an internal bug. When a falcon user tries to delete an entity, Falcon catches and throws the following exception being thrown by Oozie. This causes entity delete to fail. 2014-09-10 07:50:06,260 ERROR BundleJobChangeXCommand:540 - USER- GROUP- TOKEN[] APP- JOB0000743-140910031253668-oozie-oozi-B ACTION- XException, org.apache.oozie.command.CommandException: E1320: Bundle Job change error, [[ 0000744-140910031253668-oozie-oozi-C : Coord is in killed state ]] at org.apache.oozie.command.bundle.BundleJobChangeXCommand.execute(BundleJobChangeXCommand.java:208) at org.apache.oozie.command.bundle.BundleJobChangeXCommand.execute(BundleJobChangeXCommand.java:50) at org.apache.oozie.command.XCommand.call(XCommand.java:281) at org.apache.oozie.BundleEngine.change(BundleEngine.java:85) at org.apache.oozie.servlet.V1JobServlet.changeBundleJob(V1JobServlet.java:585) Bowen from Oozie team confirmed that this is caused by Falcon killing coord_jobs of a bundle, and then trying to change the bundle job endtime, followed by falcon killing the bundle job. This is caused because Oozie changed how it handles bundle change command. The related oozie jira is https://issues.apache.org/jira/browse/OOZIE-1807 Since you confirmed that we can now remove set end time code block - I will do that, create a patch and test it before submitting the patch. Thanks

          shwethags, you are trolling in these oozie tickets and letting it rot, whats going on?

          svenkat Venkatesh Seetharam added a comment - shwethags , you are trolling in these oozie tickets and letting it rot, whats going on?

          shwethags, can you please elaborate on your comment

          Bundle end time can be changed, even though it may not be documented. To differentiate between manual kill and status change in oozie, we set end time. But now, we check the staging path to find the bundles for a given entity. So, we can remove set end time

          svenkat Venkatesh Seetharam added a comment - shwethags , can you please elaborate on your comment Bundle end time can be changed, even though it may not be documented. To differentiate between manual kill and status change in oozie, we set end time. But now, we check the staging path to find the bundles for a given entity. So, we can remove set end time
          shwethags Shwetha GS added a comment -

          I didn't look at the exact code changes done for bundle change. So, didn't realise it will affect us.

          We set bundle name to some prefix + process name. Since we don't maintain process to bundle mapping, we need to figure out the applicable bundles for a given process. If a process is deleted and then scheduled again, we should be able to filter out the previously scheduled bundles. Oozie sets bundle to killed if user killed the bundle or because its coords are killed(if all its instances are killed). To differentiate between the two, we used to set bundle end time during process delete. But now, we changed the logic to get all bundles and select only the ones whose app path exists. Since we delete the staging path at process delete, we always get only the bundles applicable for the current entity.

          shwethags Shwetha GS added a comment - I didn't look at the exact code changes done for bundle change. So, didn't realise it will affect us. We set bundle name to some prefix + process name. Since we don't maintain process to bundle mapping, we need to figure out the applicable bundles for a given process. If a process is deleted and then scheduled again, we should be able to filter out the previously scheduled bundles. Oozie sets bundle to killed if user killed the bundle or because its coords are killed(if all its instances are killed). To differentiate between the two, we used to set bundle end time during process delete. But now, we changed the logic to get all bundles and select only the ones whose app path exists. Since we delete the staging path at process delete, we always get only the bundles applicable for the current entity.
          sowmyaramesh Sowmya Ramesh added a comment - - edited

          Issue is reproducible with Oozie 4.1.

          Oozie behavior changed in 4.1. In 4.0 Oozie didn't allow to rerun a killed coord job. From 4.1 and onwards Oozie allows rerun of killed coord job. After this change if user tries to update (like setting end time) for killed coord, Oozie throws exception "coord cannot be changed since it's in killed state" to indicate update didn't go through for the coord job.

          Current code flow in Falcon for killBundle :
          1> kill all coords in bundle
          2> Set end time of bundle
          3> kill the bundle

          If its changed to
          1> Set end time of bundle (This action is sync now after oozie-1807)
          2> kill all coords in bundle
          3> Kill the bundle

          issue will be fixed. Uploaded the patch with fix.

          sowmyaramesh Sowmya Ramesh added a comment - - edited Issue is reproducible with Oozie 4.1. Oozie behavior changed in 4.1. In 4.0 Oozie didn't allow to rerun a killed coord job. From 4.1 and onwards Oozie allows rerun of killed coord job. After this change if user tries to update (like setting end time) for killed coord, Oozie throws exception "coord cannot be changed since it's in killed state" to indicate update didn't go through for the coord job. Current code flow in Falcon for killBundle : 1> kill all coords in bundle 2> Set end time of bundle 3> kill the bundle If its changed to 1> Set end time of bundle (This action is sync now after oozie-1807) 2> kill all coords in bundle 3> Kill the bundle issue will be fixed. Uploaded the patch with fix.
          shwethags Shwetha GS added a comment -

          We can get rid of setting end time. Its not required anymore. Won't that work?

          shwethags Shwetha GS added a comment - We can get rid of setting end time. Its not required anymore. Won't that work?
          sowmyaramesh Sowmya Ramesh added a comment -

          shwethags: That would work too. As long as we are sure end time is not used anywhere else we can get rid of it. I will upload the patch with that fix. Thanks!

          sowmyaramesh Sowmya Ramesh added a comment - shwethags : That would work too. As long as we are sure end time is not used anywhere else we can get rid of it. I will upload the patch with that fix. Thanks!

          +1. Patch looks good. shwethags, do you have concerns? Sowmya has tested this end to end.

          svenkat Venkatesh Seetharam added a comment - +1. Patch looks good. shwethags , do you have concerns? Sowmya has tested this end to end.
          shwethags Shwetha GS added a comment -

          Looks good. Thanks

          shwethags Shwetha GS added a comment - Looks good. Thanks

          Thanks sowmyaramesh for your contribution.

          svenkat Venkatesh Seetharam added a comment - Thanks sowmyaramesh for your contribution.

          People

            sowmyaramesh Sowmya Ramesh
            bvellanki Balu Vellanki
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: