Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-2175

Container localization has no timeouts and tasks can be stuck there for a long time

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.4.0
    • None
    • nodemanager
    • None

    Description

      There are no timeouts that can be used to limit the time taken by various container startup operations. Localization for example could take a long time and there is no automated way to kill an task if its stuck in these states. These may have nothing to do with the task itself and could be an issue within the platform.

      Ideally there should be configurable limits for various states within the NodeManager to limit various states. The RM does not care about most of these and its only between AM and the NM. We can start by making these global configurable defaults and in future we can make it fancier by letting AM override them in the start container request.

      This jira will be used to limit localization time and we can open others if we feel we need to limit other operations.

      Attachments

        Activity

          People

            Unassigned Unassigned
            adhoot Anubhav Dhoot
            Votes:
            0 Vote for this issue
            Watchers:
            15 Start watching this issue

            Dates

              Created:
              Updated: