Details
Description
During endurance testing, we found a race condition that cause an empty localDirs being passed to container-executor.
The problem is that DirectoryCollection.checkDirs() clears three collections:
this.writeLock.lock(); try { localDirs.clear(); errorDirs.clear(); fullDirs.clear(); ...
This happens in critical section guarded by a write lock. When we start a container, we retrieve the local dirs by calling dirsHandler.getLocalDirs(); which in turn invokes DirectoryCollection.getGoodDirs(). The implementation of this method is:
List<String> getGoodDirs() { this.readLock.lock(); try { return Collections.unmodifiableList(localDirs); } finally { this.readLock.unlock(); } }
So we're also in a critical section guarded by the lock. But Collections.unmodifiableList() only returns a view of the collection, not a copy. After we get the view, MonitoringTimerTask.run() might be scheduled to run and immediately clears localDirs.
This caused a weird behaviour in container-executor, which exited with error code 35 (COULD_NOT_CREATE_WORK_DIRECTORIES).
Therefore we can't just return a view, we must return a copy with ImmutableList.copyOf().
Credits to Szilard Nemeth for analyzing and determining the root cause.
Attachments
Attachments
Issue Links
- is duplicated by
-
YARN-8786 LinuxContainerExecutor fails sporadically in create_local_dirs
-
- Resolved
-
- is related to
-
YARN-8786 LinuxContainerExecutor fails sporadically in create_local_dirs
-
- Resolved
-
- relates to
-
YARN-10562 Follow up changes for YARN-9833
-
- Resolved
-