During endurance testing, we found a race condition that cause an empty localDirs being passed to container-executor.
The problem is that DirectoryCollection.checkDirs() clears three collections:
This happens in critical section guarded by a write lock. When we start a container, we retrieve the local dirs by calling dirsHandler.getLocalDirs(); which in turn invokes DirectoryCollection.getGoodDirs(). The implementation of this method is:
So we're also in a critical section guarded by the lock. But Collections.unmodifiableList() only returns a view of the collection, not a copy. After we get the view, MonitoringTimerTask.run() might be scheduled to run and immediately clears localDirs.
This caused a weird behaviour in container-executor, which exited with error code 35 (COULD_NOT_CREATE_WORK_DIRECTORIES).
Therefore we can't just return a view, we must return a copy with ImmutableList.copyOf().
Credits to Szilard Nemeth for analyzing and determining the root cause.