Here's a copy-n-paste from an offline discussion:
The problem here is hadoop is designed around URIs and we are trying to support something (windows paths) that is NOT a URI. Earlier, I bent over backwards trying to temporarily handle windows paths but Suresh convinced me it was the wrong direction.
The article I linked is about how to properly reference windows paths as URIs and says windows style \ paths are deprecated in IE which I think essentially means the file browser. The windows shell supports / paths so I'm grappling with why we should perpetuate deprecated windows paths as pseudo-URIs when real URIs appear to be fully supported in windows.
I'd be a bit happier ( or less unhappy! )if \ support is more context specific to just windows local path names. As it stands, all URIs on windows are subject to \ to / conversion which prevents windows from accessing valid filenames in hdfs and other supported filesystems. I can understand/sympathize with the motivation to support c:\path, but I don't agree that hdfs:\\host\path, or hdfs:\/path/path2\path3 should be supported at all. This bizarre behavior creates compatibility issues where jobs accessing paths in that way are not cross-platform compatible. Ie. They "work" on hadoop for windows, but fail on every other OS. Once we "let the cat of of the bag" by adding more pseudo-support for non-URIs on windows, it's going to be that much harder to take it away.
What if we did something a bit more selective:
- [a-z]:\ considered a windows non-URI
- implicitly deemed to have a "file" scheme if not already declared
- all \ are converted to / - which means no quoting of metachars available, or we support ^ as the escape
- throw an exception if / already exists in the path
- [a-z]:/
- considered a standard URI
- implicitly deemed to have a "file" scheme if not already declared
- no \ conversion - quoting of metachars is supported
- all other URI schemes and relative paths
- add ctor Path(File)
- allow users to create Paths from non-URIs
- will eventually be the only supported way to access non-URI paths
- eliminate treating ":" as an invalid path character to allow drive letters
I'm curious what serious breakage we'll have if we just require standard URIs - ie. change little to nothing or implement the above proposal?
This includes the following test suites:
org.apache.hadoop.fs.TestFsShellCopy
org.apache.hadoop.fs.TestFsShellReturnCode