Uploaded image for project: 'Apache Apex Core'
  1. Apache Apex Core
  2. APEXCORE-743

Killed container is shown as running

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 3.7.0
    • None
    • None

    Description

      Here is the behavior

      1. Container Heartbeat timeout happened
      2. AppMaster sends the request to kill the container
      3. Container is killed
      4. AppMaster state is not updated and no new container was allocated

      After analyzing the code here is the possible reason
      1. Send the kill request to NM
      2. Container killed by NM, but NM callback doesn't happen. RecoverContainer is called in NM callback, which in this case is not called.
      3. AppMaster state is not updated

      Possible fix.
      Have a timeout for NM callback, so that if NM doesn't respond that the container is killed in time, call the RecoverContainer.

      Attachments

        Issue Links

          Activity

            People

              sandesh Sandesh Hegde
              sandesh Sandesh Hegde
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: