Uploaded image for project: 'Samza'
  1. Samza
  2. SAMZA-2266

Introduce a backoff when there are repeated failures for host-affinity allocations

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.3
    • None
    • None

    Description

      The issue here is that we retry allocations of dead containers (and repeatedly on subsequent failures) in a very small window of time (<1min). 

      It is observed that NMs take ~2mins to mark themselves as unhealthy to the RM.

      If a job has host-affinity enabled, this will cause us to allocate containers on the same unhealthy host multiple times and eventually kill the application.

      This ticket is to evaluate the feasibility and possibly implement a fix that involves introducing a time backoff on retries of container allocation on the same host - so we eventually get a different host when the unhealthy NM's status is updated.

      We may also want to look into the possibility of abandoning host-affinity on the 8th attempt of restarting a container - so we don't kill the entire job.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            dnishimura Daniel Nishimura
            dnishimura Daniel Nishimura
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 9h
                9h

                Slack

                  Issue deployment