Oozie
  1. Oozie
  2. OOZIE-1025

Killing oozie job kills oozie launcher job alone in hadoop.

    Details

    • Type: Bug Bug
    • Status: Reopened
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:

      Centos-5.8,Hadoop 2.0.0-cdh4.0.1

      Description

      As per the release build version: 3.1.3-cdh4.0.1 killing oozie job using kill command kills the oozie launcher job alone in hadoop and all other jobs associated with that workflow is running till it gets complete.

        Issue Links

          Activity

          Hide
          Alejandro Abdelnur added a comment -

          Priya, if you are reporting an issue with CDH Oozie, the correct JIRA to file the issue is https://issues.cloudera.org/

          For this one I've filed https://issues.cloudera.org/browse/DISTRO-433 there with the info you've provided.

          Cloudera then will verify the issue and if it occurs in Apache Oozie as well it will follow up with this JIRA.

          Show
          Alejandro Abdelnur added a comment - Priya, if you are reporting an issue with CDH Oozie, the correct JIRA to file the issue is https://issues.cloudera.org/ For this one I've filed https://issues.cloudera.org/browse/DISTRO-433 there with the info you've provided. Cloudera then will verify the issue and if it occurs in Apache Oozie as well it will follow up with this JIRA.
          Hide
          Robert Kanter added a comment -

          This is a current limitation of the clients for pig, sqoop, etc.
          For example, if you run a job from Pig and do ctrl+c to kill it, it won't actually kill the hadoop job already running in the cluster. There isn't much Oozie can do about this.

          Show
          Robert Kanter added a comment - This is a current limitation of the clients for pig, sqoop, etc. For example, if you run a job from Pig and do ctrl+c to kill it, it won't actually kill the hadoop job already running in the cluster. There isn't much Oozie can do about this.
          Hide
          Rohini Palaniswamy added a comment -

          Pig does kill jobs on shutdown. And cursory look at hadoop JvmManager it seems to attempt a kill -15 before doing a kill -9.

          http://svn.apache.org/viewvc/pig/branches/branch-0.10/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java?revision=1301794&view=markup

          Runtime.getRuntime().addShutdownHook(new HangingJobKiller());

          Show
          Rohini Palaniswamy added a comment - Pig does kill jobs on shutdown. And cursory look at hadoop JvmManager it seems to attempt a kill -15 before doing a kill -9. http://svn.apache.org/viewvc/pig/branches/branch-0.10/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java?revision=1301794&view=markup Runtime.getRuntime().addShutdownHook(new HangingJobKiller());
          Hide
          Robert Kanter added a comment -

          I looked at that with Cheolsoo Park and you're right, when you run a job from Pig it will try to kill it (I was wrong before). However, the shutdown hook doesn't seem to get triggered when Oozie kills the pig launcher job. According to the javadoc for the shutdown hook it is possible for the shutdown hook not to get triggered when the JVM is terminated externally (e.g. a SIGKILL). I'm not sure how the JobTracker kills its jobs, but because the shutdown hook isn't getting triggered, I'd guess its doing a SIGKILL or something similar.

          I was talking with Alejandro Abdelnur about this, and he had another idea where once we kill the pig launcher job, we could then have Oozie go through and kill any still-running jobs that were launched by scooping up the launched job IDs from the launcher (it prints them out once the Pig job(s) have finished). I looked into this a bit, and unfortunately, that will require a bit of extra work because if you kill the pig launcher job, it doesn't get a chance to write out the launched job IDs (I'd guess for the same reason the shutdown hook isn't getting triggered: the JT is violently killing it). We could do something where the launcher job has another thread which keeps checking for the job IDs that Pig launched and writes them out to a file as it sees new ones; then Oozie would be able to pick up the IDs from that file even if the launcher job is killed. I'm pretty sure a similar solution would work for other actions that have this issue.

          Show
          Robert Kanter added a comment - I looked at that with Cheolsoo Park and you're right, when you run a job from Pig it will try to kill it (I was wrong before). However, the shutdown hook doesn't seem to get triggered when Oozie kills the pig launcher job. According to the javadoc for the shutdown hook it is possible for the shutdown hook not to get triggered when the JVM is terminated externally (e.g. a SIGKILL). I'm not sure how the JobTracker kills its jobs, but because the shutdown hook isn't getting triggered, I'd guess its doing a SIGKILL or something similar. I was talking with Alejandro Abdelnur about this, and he had another idea where once we kill the pig launcher job, we could then have Oozie go through and kill any still-running jobs that were launched by scooping up the launched job IDs from the launcher (it prints them out once the Pig job(s) have finished). I looked into this a bit, and unfortunately, that will require a bit of extra work because if you kill the pig launcher job, it doesn't get a chance to write out the launched job IDs (I'd guess for the same reason the shutdown hook isn't getting triggered: the JT is violently killing it). We could do something where the launcher job has another thread which keeps checking for the job IDs that Pig launched and writes them out to a file as it sees new ones; then Oozie would be able to pick up the IDs from that file even if the launcher job is killed. I'm pretty sure a similar solution would work for other actions that have this issue.
          Hide
          Mona Chitnis added a comment -

          You can now leverage OOZIE-1160 which helps copy over the file storing the external Job Ids launched by a Pig Launcher, in the event of fail/kill too.

          Show
          Mona Chitnis added a comment - You can now leverage OOZIE-1160 which helps copy over the file storing the external Job Ids launched by a Pig Launcher, in the event of fail/kill too.
          Hide
          Robert Kanter added a comment -

          I looked into this a bit more recently.

          The way Oozie kills a job is to tell Hadoop to kill the launcher job. The launcher job doesn't write the child ids until after they're finished, right before the launcher itself finishes (e.g. Pig gets run, then the launcher writes the ids of all of the jobs launched by Pig to a file). However, when the launcher job gets killed by Hadoop, it doesn't write the file, which means that Oozie doesn't have the child IDs so it can't kill them.

          So in order to get this to work, I think we'd have to do some non-trivial refactoring of how the launcher jobs work. Some ideas I had were:

          1. Make the launcher job multithreaded so a second thread can go and figure out the jobs immediately when their IDs are available and write that to the file and keep updating the file. This way, when the launcher is killed, Oozie will have the child IDs (or at least most of them). This may not be possible for all action types.
          2. This would require a lot of changes and make things really complicated, but having the launcher job listen on a port or accept a REST call or something similar; instead of asking hadoop to kill the launcher job, Oozie would send it a command on that port/REST/etc so that the launcher could take care of more "nicely" killing the job, including any children and then itself. This would probably also open up some security concerns.

          I don't really see a clean solution or one that we can easily apply to all action types

          Show
          Robert Kanter added a comment - I looked into this a bit more recently. The way Oozie kills a job is to tell Hadoop to kill the launcher job. The launcher job doesn't write the child ids until after they're finished, right before the launcher itself finishes (e.g. Pig gets run, then the launcher writes the ids of all of the jobs launched by Pig to a file). However, when the launcher job gets killed by Hadoop, it doesn't write the file, which means that Oozie doesn't have the child IDs so it can't kill them. So in order to get this to work, I think we'd have to do some non-trivial refactoring of how the launcher jobs work. Some ideas I had were: Make the launcher job multithreaded so a second thread can go and figure out the jobs immediately when their IDs are available and write that to the file and keep updating the file. This way, when the launcher is killed, Oozie will have the child IDs (or at least most of them). This may not be possible for all action types. This would require a lot of changes and make things really complicated, but having the launcher job listen on a port or accept a REST call or something similar; instead of asking hadoop to kill the launcher job, Oozie would send it a command on that port/REST/etc so that the launcher could take care of more "nicely" killing the job, including any children and then itself. This would probably also open up some security concerns. I don't really see a clean solution or one that we can easily apply to all action types
          Hide
          Robert Kanter added a comment -

          We should be able to use the yarn tag support added by OOZIE-1722, at least for Hadoop 2.4.0+

          Show
          Robert Kanter added a comment - We should be able to use the yarn tag support added by OOZIE-1722 , at least for Hadoop 2.4.0+

            People

            • Assignee:
              Unassigned
              Reporter:
              PriyaSundararajan
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:

                Development