Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-8903

when NM becomes unhealthy due to local disk usage, have option to kill application using most space instead of releasing all containers on node

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.1.1
    • None
    • nodemanager
    • None

    Description

      We sometimes experience an issue in which a single application, usually a Spark job, causes at least one node in a YARN cluster to become unhealthy by filling up the local dir(s) on that node past the threshold at which the node is considered unhealthy.

      When this happens, the impact is potentially large depending on what else is running on that node, as all containers on that node are lost. Sometimes not much else is running on the node and it's fine, but other times we lose AM containers from other apps and/or non-AM containers with long-running tasks.

      I thought that it would be helpful to add an option (default false) whereby if a node is going to become unhealthy due to full local disk(s), it instead identifies the application that's using the most local disk space on that node, and kills that application. (Roughly analogous to how the OOM killer in Linux picks one process to kill rather than letting the machine crash.)

      The benefit is that only one application is impacted, and no other application loses any containers. This prevents one user's poorly written code that shuffles/spills huge amounts of data from negatively impacting other users.

      The downside is that we're killing the entire application, not just the task(s) responsible for the local disk usage. I believe it's necessary to kill the whole application instead of identifying the container running the relevant task(s), because doing so would require more knowledge of the internal state of aux services responsible for shuffling than what YARN has according to my understanding.

      If this seems reasonable, I can work on the implementation.

      Attachments

        Activity

          People

            Unassigned Unassigned
            Steven Rand Steven Rand
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: