Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-3084

race when KILL_CONTAINER is received for a LOCALIZED container

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Duplicate
    • Affects Version/s: 0.23.0
    • Fix Version/s: 0.23.0
    • Component/s: mrv2
    • Labels:
      None

      Description

      Depending on when ContainersLaunch starts a container, KILL_CONTAINER when container state is LOCALIZED (LAUNCH_CONTAINER event already sent) can end up generating a CONTAINER_LAUNCHED event - which isn't handled by ContainerState: KILLING. Also, the launched container won't be killed since CLEANUP_CONTAINER would have already been processed.

      1. MR-3084.wip.patch
        23 kB
        Hitesh Shah
      2. MR-3084.1.patch
        25 kB
        Hitesh Shah

        Issue Links

          Activity

          Hide
          Vinod Kumar Vavilapalli added a comment -

          Good catch, Sid!

          While we are at it, we need to fix a couple of missing transitions from the LOCALIZATION_FAILED state.

          Show
          Vinod Kumar Vavilapalli added a comment - Good catch, Sid! While we are at it, we need to fix a couple of missing transitions from the LOCALIZATION_FAILED state.
          Hide
          Hitesh Shah added a comment -

          Attaching more or less a working version that fixes the issue.

          Handling the launched event at the killing state is effectively a no-op as the container cleanup event is always handled after a container launch event.

          The patch effectively ensures that either the container does not come up if it has not yet or kills it if it has.

          This requires changes in hadoop-common to get around the async nature of the launches .

          Sid/Vinod, please take a look and let me know if you see something wrong/missing.

          Given the slightly complex nature of this change, I decided not to incorporate the other missing state transitions into this patch but will instead open a separate jira for those.

          Show
          Hitesh Shah added a comment - Attaching more or less a working version that fixes the issue. Handling the launched event at the killing state is effectively a no-op as the container cleanup event is always handled after a container launch event. The patch effectively ensures that either the container does not come up if it has not yet or kills it if it has. This requires changes in hadoop-common to get around the async nature of the launches . Sid/Vinod, please take a look and let me know if you see something wrong/missing. Given the slightly complex nature of this change, I decided not to incorporate the other missing state transitions into this patch but will instead open a separate jira for those.
          Hide
          Hitesh Shah added a comment -

          Forgot to add - still need to add tests for the container state transitions.

          Show
          Hitesh Shah added a comment - Forgot to add - still need to add tests for the container state transitions.
          Hide
          Hitesh Shah added a comment -

          Has the combined diffs from the changes in common as well as the changes for mapreduce.

          Added a unit test for the new state transition.

          Show
          Hitesh Shah added a comment - Has the combined diffs from the changes in common as well as the changes for mapreduce. Added a unit test for the new state transition.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          I discussed with Sid regarding this and found that there is bigger race here. ContainerLaunch does a bunch of setup steps needed for the container and if a kill arrives during any of it, the container will never be killed. We do cancel the Future, but that is without any interrupts. Also we try to kill the process, but the PID may not be available at that point of time.

          This part of the code needs a minor redesign.

          Show
          Vinod Kumar Vavilapalli added a comment - I discussed with Sid regarding this and found that there is bigger race here. ContainerLaunch does a bunch of setup steps needed for the container and if a kill arrives during any of it, the container will never be killed. We do cancel the Future, but that is without any interrupts. Also we try to kill the process, but the PID may not be available at that point of time. This part of the code needs a minor redesign.
          Hide
          Siddharth Seth added a comment -

          There's two races - One while ContainersLaunch is localizing files prior to launch, and the other between generation of the CONTAINER_LAUNCHED event and the actual launch. From an initial look at the patch, seems like both cases will be handled with the changes (will continue looking).
          One case that could be problematic is if there's multiple threads in the AsyncDispatcher - 1 per event type - that could still lead to a race between CE.cancelOrSendSignal and CE.launchContainer (similar stuff likely exists elsewhere also).
          Vinod had an alternate suggestion - by using additional states and moving the START_CONTAINER localization to an alternate state.
          Would it be simpler fixing this by just using interrupts ? (Shell does currently ignore interrupts in some cases though)

          Show
          Siddharth Seth added a comment - There's two races - One while ContainersLaunch is localizing files prior to launch, and the other between generation of the CONTAINER_LAUNCHED event and the actual launch. From an initial look at the patch, seems like both cases will be handled with the changes (will continue looking). One case that could be problematic is if there's multiple threads in the AsyncDispatcher - 1 per event type - that could still lead to a race between CE.cancelOrSendSignal and CE.launchContainer (similar stuff likely exists elsewhere also). Vinod had an alternate suggestion - by using additional states and moving the START_CONTAINER localization to an alternate state. Would it be simpler fixing this by just using interrupts ? (Shell does currently ignore interrupts in some cases though)
          Hide
          Hitesh Shah added a comment -

          Being addressed as part of MR-3240.

          Show
          Hitesh Shah added a comment - Being addressed as part of MR-3240.

            People

            • Assignee:
              Hitesh Shah
              Reporter:
              Siddharth Seth
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development