Thank you, Xuan and Jian.
Just to provide a bit more background on this, Xuan found that streaming jobs using files in Azure Storage were not able to override the setting of fs.azure.block.size from the command line. It looks like he found the root cause is that validateFiles checks for existence of files against a FileSystem instance, but this FileSystem instance is obtained before handling -D options. This would mean we then have an instance sitting in the FileSystem cache that was created without the -D options set in the Configuration. Later, during MapReduce job split calculation, it would use the cached instance that didn't have the override of fs.azure.block.size.
I agree with the change here, because the expectation is that the command line arguments take precedence. However, I don't think we should move the -D handling all the way to the top of the method. Right now, the handling is such that -D options would take precedence over -fs and -jt. The current patch would reverse that. I don't know if anyone depends on that behavior, but we can avoid changing it by doing the -D handling in between the handling of -conf and the handling of -libjars. I'd be +1 for the patch with that change if you test it and it still works for overriding fs.azure.block.size.
Should the API Path.getFileSystem(Configuration conf) be that the returned file system object always apply the up-to-date conf ?
This is a long-standing weakness of the FileSystem cache. It has been discussed in other jiras, but I can't find those now. The FileSystem cache key is composed of scheme, authority, and UserGroupInformation. However, the FileSystem#get API is phrased in terms of a whole Configuration. Various other configuration properties can tune the behavior of a FileSystem, but if you get a cached instance, then these configuration properties might not be applied. OTOH, it would be too costly to make the whole Configuration part of the cache key.
This is an existing problem, unrelated to the current patch.