@Todd - Yes, to re-trigger you need to restart the TT. This is how the code currently works - once a directory is removed from LocalStorage's "good list" it is never put back while the TT is running, ie once a dir is identified as bad it won't be used by the TT. LocalDirAllocator#confChanged tries to notice when a new dir is added to the conf but we don't add new MR local dirs at runtime so this feature isn't used. Per
HADOOP-7551 LocalDirAllocator (common) and LocalStorage (mr) are currently independent but should be aware of each other.
@Ravi LocalDirAllocator already keeps track of the valid dirs itself. Once there is a bad dir LocalDirAllocator#confChanged executes for every call to get a local directory, it's this code that calls checkDirs on each local directory. It turns out the version of checkDirs that doesn't take a permissions parameter is not as expensive as I thought (the method that takes a permission forks a call to ls for each directory which is expensive). However confChanged creates a new DF object for each local dir which has the side effect of resetting the df interval which means forking a call to df instead of caching the last result when LocalDirAllocator uses each DF.
In short, I think it's expensive if the configured dirs are different from the list of valid dirs maintained by LocalDirAllocator. If we remove bad dirs from the conf in the TT then they won't differ. Alternatively, we could modify LocalDirAllocator to ignore bad directories but that would conflict with its current design that explicitly tries to notice a difference between the set of valid and configured dirs.