Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-6319

race condition between deleting app dir and deleting container dir

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • nodemanager
    • None

    Description

      Last container (on one node) of one app complete

      --> triggers async deletion of container dir (container cleanup)
      --> triggers async deletion of app dir (app cleanup)

      For LCE, deletion is done by container-executor. The "app cleanup" lists sub-dir (step 1), and then unlink items one by one(step 2). If a file is deleted by "container cleanup" between step 1 and step2, it'll report below error and breaks the deletion.

      ContainerExecutor: Couldn't delete file $LOCAL/usercache/$USER/appcache/application_1481785469354_353539/container_1481785469354_353539_01_000028/$FILE - No such file or directory
      

      This app dir then escape the cleanup. And that's why we always have many app dirs left there.

      solution 1: just ignore the error without breaking in container-executor.c::delete_path()
      solution 2: use a lock to serialize the cleanup of same app dir.
      solution 3: backoff and retry on error

      Comments are welcome.

      Attachments

        Activity

          People

            zhiguohong Hong Zhiguo
            zhiguohong Hong Zhiguo
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated: