When FsUrlStreamHandlerFactory is registered with java.net.URL (ex: when Spark is initialized), it breaks URLs with spaces (even though they are properly URI-encoded). I traced the problem down to FSUrlConnection.connect() method. It naively gets the path from the URL, which contains encoded spaces, and pases it to org.apache.hadoop.fs.Path(String) constructor. This is not correct, because the docs clearly say that the string must NOT be encoded. Doing so causes double encoding within the Path class (ie: %20 becomes %2520).
See attached JUnit test.
This test case mimics an issue I ran into when trying to use Commons Configuration 1.9 AFTER initializing Spark. Commons Configuration uses URL class to load configuration files, but Spark installs FsUrlStreamHandlerFactory, which hits this issue. For now, we are using an AspectJ aspect to "patch" the bytecode at load time to work-around the issue.
The real fix is quite simple. All you need to do is replace this line in org.apache.hadoop.fs.FsUrlConnection.connect():
is = fs.open(new Path(url.getPath()));
with this line:
is = fs.open(new Path(url.toUri().getPath()));
URI.getPath() will correctly decode the path, which is what is expected by org.apache.hadoop.fs.Path(String) constructor.