Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-9833

Race condition when DirectoryCollection.checkDirs() runs during container launch

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.2.0
    • 3.3.0, 3.2.2, 3.1.4, 2.10.2
    • None
    • None
    • Reviewed

    Description

      During endurance testing, we found a race condition that cause an empty localDirs being passed to container-executor.

      The problem is that DirectoryCollection.checkDirs() clears three collections:

          this.writeLock.lock();
          try {
            localDirs.clear();
            errorDirs.clear();
            fullDirs.clear();
            ...
      

      This happens in critical section guarded by a write lock. When we start a container, we retrieve the local dirs by calling dirsHandler.getLocalDirs(); which in turn invokes DirectoryCollection.getGoodDirs(). The implementation of this method is:

      List<String> getGoodDirs() {
          this.readLock.lock();
          try {
            return Collections.unmodifiableList(localDirs);
          } finally {
            this.readLock.unlock();
          }
        }
      

      So we're also in a critical section guarded by the lock. But Collections.unmodifiableList() only returns a view of the collection, not a copy. After we get the view, MonitoringTimerTask.run() might be scheduled to run and immediately clears localDirs.
      This caused a weird behaviour in container-executor, which exited with error code 35 (COULD_NOT_CREATE_WORK_DIRECTORIES).

      Therefore we can't just return a view, we must return a copy with ImmutableList.copyOf().

      Credits to Szilard Nemeth for analyzing and determining the root cause.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            pbacsko Peter Bacsko Assign to me
            pbacsko Peter Bacsko
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment