When FileSystem cache is enabled, FileSystem.get(..) will call FileSystem.Cache.get(..), which is a synchronized method. If the lookup fails, a new instance will be initialized. Depends on the FileSystem subclass implementation, the initialization may take a long time. In such case, the FileSystem.Cache lock will be hold and all calls to FileSystem.get(..) by other threads will be blocked for a long time.
In particular, the DistributedFileSystem initialization may take a long time since there are retries. It is even worst if the socket timeout is set to a large value.
There are two possible fixes for the problem:
- (by Sanjay) Change FileSystem.Cache.get(..) so that if the lookup fails, it first releases the lock, initializes a FileSystem instance, acquires the lock again, and then add the instance to the cache. One problem is that if a user application keeps calling FileSystem.get(..) for the same FileSystem in a short period of time, it will result in initializing many instances.
- Change DistributedFileSystem so that it does a lazy connection: it defers connecting to the server until there is an rpc. A drawback is that this only fixes DistributedFileSystem but not other FileSystem subclasses.