# filenames with ':' colon throws java.lang.IllegalArgumentException

## Details

• Type: Bug
• Status: Resolved
• Priority: Major
• Resolution: Duplicate
• Affects Version/s: None
• Fix Version/s: None
• Component/s: None
• Labels:
None

## Description

File names containing colon ":" throws java.lang.IllegalArgumentException while LINUX file system supports it.

$hadoop dfs -put ./testfile-2007-09-24-03:00:00.gz filenametest Exception in thread "main" java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: testfile-2007-09-24-03:00:00.gz at org.apache.hadoop.fs.Path.initialize(Path.java:140) at org.apache.hadoop.fs.Path.<init>(Path.java:126) at org.apache.hadoop.fs.Path.<init>(Path.java:50) at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:273) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:117) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:776) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:757) at org.apache.hadoop.fs.FsShell.copyFromLocal(FsShell.java:116) at org.apache.hadoop.fs.FsShell.run(FsShell.java:1229) at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:187) at org.apache.hadoop.fs.FsShell.main(FsShell.java:1342) Caused by: java.net.URISyntaxException: Relative path in absolute URI: testfile-2007-09-24-03:00:00.gz at java.net.URI.checkPath(URI.java:1787) at java.net.URI.<init>(URI.java:735) at org.apache.hadoop.fs.Path.initialize(Path.java:137) ... 10 more Path(String pathString) when given a filename which contains ':' treats it as URI and selects anything before ':' as scheme, which in this case is clearly not a valid scheme. ## Attachments 1. HADOOP-2066.patch 2 kB Doug Cutting 2. 2066_20071022.patch 4 kB Tsz Wo Nicholas Sze ## Issue Links ## Activity Lohit Vijayarenu created issue - Jim Kellerman made changes - Field Original Value New Value Link This issue duplicates HADOOP-2056 Jim Kellerman made changes -  Link This issue duplicates HADOOP-2056 Hide Tsz Wo Nicholas Sze added a comment - - edited I guess / is the only invalid character for the filename in Linux (GNU/Linux). All filenames below are valid. A#B A$B  A%B  A&B  A*B  A+B  A,B  A:B  A;B  A=B  A?B  A[B  A\B  A^B  A{B

Show
Tsz Wo Nicholas Sze added a comment - - edited I guess / is the only invalid character for the filename in Linux (GNU/Linux). All filenames below are valid. A#B A$B A%B A&B A*B A+B A,B A:B A;B A=B A?B A[B A\B A^B A{B Hide Doug Cutting added a comment - The bug is not that this fails, but that it still fails even if you escape the colons, turning them into '%3a'. Path(String) does not properly handle escapes. Show Doug Cutting added a comment - The bug is not that this fails, but that it still fails even if you escape the colons, turning them into '%3a'. Path(String) does not properly handle escapes. Hide eric baldeschwieler added a comment - The API used does not look like a URI API, hence a URI quoting requirement also seems wrong. Show eric baldeschwieler added a comment - The API used does not look like a URI API, hence a URI quoting requirement also seems wrong. Hide Doug Cutting added a comment - > The API used does not look like a URI API, hence a URI quoting requirement also seems wrong. It is like a URI constructor. URI cannot be subclassed, and we want to add behavior, so we wrap our URIs in Paths. The javadoc for the Path constructor says, and has always said, "Path strings are URIs". We long-ago settled on URI syntax for file paths. So URI escapes seem appropriate. I've attached a patch which fixes this on Linux. Who knows what havoc it will wreak on Windows, though, where colons in paths are more common, and requiring applications to escape them, while correct, may prove painful. Show Doug Cutting added a comment - > The API used does not look like a URI API, hence a URI quoting requirement also seems wrong. It is like a URI constructor. URI cannot be subclassed, and we want to add behavior, so we wrap our URIs in Paths. The javadoc for the Path constructor says, and has always said, "Path strings are URIs". We long-ago settled on URI syntax for file paths. So URI escapes seem appropriate. I've attached a patch which fixes this on Linux. Who knows what havoc it will wreak on Windows, though, where colons in paths are more common, and requiring applications to escape them, while correct, may prove painful. Doug Cutting made changes -  Attachment HADOOP-2066.patch [ 12367891 ] Hide Doug Cutting added a comment - By, "fixes this", I mean this permits one to include colons in filenames by using the escape sequence "%3a". In particular, 'bin/hadoop fs -put foo%3abar baz' will copy the local file named 'foo:bar'. Show Doug Cutting added a comment - By, "fixes this", I mean this permits one to include colons in filenames by using the escape sequence "%3a". In particular, 'bin/hadoop fs -put foo%3abar baz' will copy the local file named 'foo:bar'. Hide Tsz Wo Nicholas Sze added a comment - Do you want to use java.net.URLEncoder to encode filename, so that 'bin/hadoop fs -put foo:bar baz' will work? Show Tsz Wo Nicholas Sze added a comment - Do you want to use java.net.URLEncoder to encode filename, so that 'bin/hadoop fs -put foo:bar baz' will work? Hide Doug Cutting added a comment - > Do you want to use java.net.URLEncoder to encode filename, so that 'bin/hadoop fs -put foo:bar baz' will work? That would break things like 'bin/hadoop fs -cp hdfs://a:999/foo/bar baz'. Colons have an interpretation in paths, that's why they need to be escaped. Show Doug Cutting added a comment - > Do you want to use java.net.URLEncoder to encode filename, so that 'bin/hadoop fs -put foo:bar baz' will work? That would break things like 'bin/hadoop fs -cp hdfs://a:999/foo/bar baz'. Colons have an interpretation in paths, that's why they need to be escaped. Hide Tsz Wo Nicholas Sze added a comment - We could only encode filenames and directory names but not the entire string. Then, all characters except / can be allowed. Also, we probably should provide a option to allow user giving a raw URI. Show Tsz Wo Nicholas Sze added a comment - We could only encode filenames and directory names but not the entire string. Then, all characters except / can be allowed. Also, we probably should provide a option to allow user giving a raw URI. Hide Doug Cutting added a comment - > We could only encode filenames and directory names but not the entire string. Then, all characters except / can be allowed. How do we know what parts are the filename? The colon separates the scheme from the rest of the uri, so we cannot know whether a colon in a name delimits the scheme or part of the name. > Also, we probably should provide a option to allow user giving a raw URI. We only support a subset of URIs, so we need to have our own constructors to validate and normalize. We could accept a URI and check that it is of the form we require. The easiest implementation would be to convert it to a string and pass it to the existing constructor. What's the use case? Show Doug Cutting added a comment - > We could only encode filenames and directory names but not the entire string. Then, all characters except / can be allowed. How do we know what parts are the filename? The colon separates the scheme from the rest of the uri, so we cannot know whether a colon in a name delimits the scheme or part of the name. > Also, we probably should provide a option to allow user giving a raw URI. We only support a subset of URIs, so we need to have our own constructors to validate and normalize. We could accept a URI and check that it is of the form we require. The easiest implementation would be to convert it to a string and pass it to the existing constructor. What's the use case? Lohit Vijayarenu made changes -  Description File names containing colon ":" throws java.lang.IllegalArgumentException while LINUX file system supports it. [lohit@krygw1000 ~]$ hadoop dfs -put ./testfile-2007-09-24-03:00:00.gz filenametest Exception in thread "main" java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: testfile-2007-09-24-03:00:00.gz at org.apache.hadoop.fs.Path.initialize(Path.java:140) at org.apache.hadoop.fs.Path.(Path.java:126) at org.apache.hadoop.fs.Path.(Path.java:50) at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:273) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:117) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:776) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:757) at org.apache.hadoop.fs.FsShell.copyFromLocal(FsShell.java:116) at org.apache.hadoop.fs.FsShell.run(FsShell.java:1229) at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:187) at org.apache.hadoop.fs.FsShell.main(FsShell.java:1342) Caused by: java.net.URISyntaxException: Relative path in absolute URI: testfile-2007-09-24-03:00:00.gz at java.net.URI.checkPath(URI.java:1787) at java.net.URI.(URI.java:735) at org.apache.hadoop.fs.Path.initialize(Path.java:137) ... 10 more [lohit@krygw1000 ~]$Path(String pathString) when given a filename which contains ':' treats it as URI and selects anything before ':' as scheme, which in this case is clearly not a valid scheme. File names containing colon ":" throws java.lang.IllegalArgumentException while LINUX file system supports it.$ hadoop dfs -put ./testfile-2007-09-24-03:00:00.gz filenametest Exception in thread "main" java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: testfile-2007-09-24-03:00:00.gz at org.apache.hadoop.fs.Path.initialize(Path.java:140) at org.apache.hadoop.fs.Path.(Path.java:126) at org.apache.hadoop.fs.Path.(Path.java:50) at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:273) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:117) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:776) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:757) at org.apache.hadoop.fs.FsShell.copyFromLocal(FsShell.java:116) at org.apache.hadoop.fs.FsShell.run(FsShell.java:1229) at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:187) at org.apache.hadoop.fs.FsShell.main(FsShell.java:1342) Caused by: java.net.URISyntaxException: Relative path in absolute URI: testfile-2007-09-24-03:00:00.gz at java.net.URI.checkPath(URI.java:1787) at java.net.URI.(URI.java:735) at org.apache.hadoop.fs.Path.initialize(Path.java:137) ... 10 more Path(String pathString) when given a filename which contains ':' treats it as URI and selects anything before ':' as scheme, which in this case is clearly not a valid scheme.
Hide
Tsz Wo Nicholas Sze added a comment - - edited

> How do we know what parts are the filename?

Could we assume the URIs are hierarchical URIs (defined in the java.net.URI)? If yes and we further force the rule that scheme must be specified with authority,
then the syntax will be
[[scheme:]//authority]path

I think the colons can be distinguished in this case.

I agree that the current patch is simple and clean. However, users cannot perform something like

> ls
a:b   foo
> bin/hadoop fs -put a* baz

Show
Tsz Wo Nicholas Sze added a comment - - edited > How do we know what parts are the filename? Could we assume the URIs are hierarchical URIs (defined in the java.net.URI )? If yes and we further force the rule that scheme must be specified with authority, then the syntax will be [ [scheme:] //authority]path I think the colons can be distinguished in this case. I agree that the current patch is simple and clean. However, users cannot perform something like > ls a:b foo > bin/hadoop fs -put a* baz
Hide
Lohit Vijayarenu added a comment -

How would be the usage with this patch?
I tried these

[lohit@hadoop-trunk]$hadoop dfs -put ./testfile-2007-09-24-03:00:00.gz "testfile-2007-09-24-03%3a00%3a00.gz" put: Pathname /user/lohit/testfile-2007-09-24-03:00:00.gz from testfile-2007-09-24-03:00:00.gz is not a valid DFS filename. [lohit@hadoop-trunk]$ echo "TEST" > testfile
[lohit@hadoop-trunk]$hadoop dfs -put ./testfile "testfile-2007-09-24-03%3a00%3a00.gz" put: Pathname /user/lohit/testfile-2007-09-24-03:00:00.gz from testfile-2007-09-24-03:00:00.gz is not a valid DFS filename. Show Lohit Vijayarenu added a comment - How would be the usage with this patch? I tried these [lohit@hadoop-trunk]$ hadoop dfs -put ./testfile-2007-09-24-03:00:00.gz "testfile-2007-09-24-03%3a00%3a00.gz" put: Pathname /user/lohit/testfile-2007-09-24-03:00:00.gz from testfile-2007-09-24-03:00:00.gz is not a valid DFS filename. [lohit@hadoop-trunk] $echo "TEST" > testfile [lohit@hadoop-trunk]$ hadoop dfs -put ./testfile "testfile-2007-09-24-03%3a00%3a00.gz" put: Pathname /user/lohit/testfile-2007-09-24-03:00:00.gz from testfile-2007-09-24-03:00:00.gz is not a valid DFS filename.
Hide
Doug Cutting added a comment -

Nicholas, yes, perhaps we can optimize our parsing of URIs to accept more cases, so that, in particular, colons can be included in relative paths, excluding only relative paths that contain '://'. Should we still support escapes? Would you like to make a patch for this? Passing tests on Windows will be the hard part. I've not yet tested my patch on Windows and assume it will fail. It would be easier if we don't support escapes...

Lohit, HDFS does not permit colons in file names, while Unix does. This is hard-coded into HDFS. It could perhaps be changed. It was done to err on the safe side: we did not want to end up with files whose names were illegal, and hence might, e.g., not be removed or read, and chose to check for characters which could prove problematic and prohibit them.

Show
Doug Cutting added a comment - Nicholas, yes, perhaps we can optimize our parsing of URIs to accept more cases, so that, in particular, colons can be included in relative paths, excluding only relative paths that contain '://'. Should we still support escapes? Would you like to make a patch for this? Passing tests on Windows will be the hard part. I've not yet tested my patch on Windows and assume it will fail. It would be easier if we don't support escapes... Lohit, HDFS does not permit colons in file names, while Unix does. This is hard-coded into HDFS. It could perhaps be changed. It was done to err on the safe side: we did not want to end up with files whose names were illegal, and hence might, e.g., not be removed or read, and chose to check for characters which could prove problematic and prohibit them.
Hide
Tsz Wo Nicholas Sze added a comment -

Sure, I will make a patch.

Show
Tsz Wo Nicholas Sze added a comment - Sure, I will make a patch.
Hide
Tsz Wo Nicholas Sze added a comment -
• All characters except slash are valid for directory/file names
• Syntax:
[[<scheme>:]//<authority>]<path> | file:<path>
Show
Tsz Wo Nicholas Sze added a comment - All characters except slash are valid for directory/file names Syntax: [[<scheme>:] //<authority>]<path> | file:<path>
Tsz Wo Nicholas Sze made changes -
 Attachment 2066_20071019.patch [ 12368059 ]
Hide
Lohit Vijayarenu added a comment -

Hi Nicholas,

I just tried this patch against trunk.
Am I doing this right?

[lohit@ hadoop-trunk]$hadoop dfs -put ./abcd:efgh.tar.gz "abcd\:efgh.tar.gz" put: Pathname /user/lohit/abcd\:efgh.tar.gz from abcd\:efgh.tar.gz is not a valid DFS filename. [lohit@ hadoop-trunk]$ hadoop dfs -put ./abcd:efgh.tar.gz "abcd%3aefgh.tar.gz"
[lohit@ hadoop-trunk]$hadoop dfs -ls Found 3 items /user/lohit/abcd <r 3> 5 2007-10-22 15:38 /user/lohit/abcd%3aefgh.tar.gz <r 3> 5 2007-10-22 15:39 /user/lohit/test.x <r 3> 61646 2007-10-22 15:38 [lohit@ucdev27 hadoop-trunk]$

Thanks!

Show
Lohit Vijayarenu added a comment - Hi Nicholas, I just tried this patch against trunk. Am I doing this right? [lohit@ hadoop-trunk] $hadoop dfs -put ./abcd:efgh.tar.gz "abcd\:efgh.tar.gz" put: Pathname /user/lohit/abcd\:efgh.tar.gz from abcd\:efgh.tar.gz is not a valid DFS filename. [lohit@ hadoop-trunk]$ hadoop dfs -put ./abcd:efgh.tar.gz "abcd%3aefgh.tar.gz" [lohit@ hadoop-trunk] $hadoop dfs -ls Found 3 items /user/lohit/abcd <r 3> 5 2007-10-22 15:38 /user/lohit/abcd%3aefgh.tar.gz <r 3> 5 2007-10-22 15:39 /user/lohit/test.x <r 3> 61646 2007-10-22 15:38 [lohit@ucdev27 hadoop-trunk]$ Thanks!
Hide
Tsz Wo Nicholas Sze added a comment -

Hi Lohit,

The message says, "abcd\:efgh.tar.gz is not a valid DFS filename." because HDFS does not permit special characters like colon (see FSNamesystem.isValidName). This is not a parsing problem. If you try something like

hadoop dfs -put ./a:b ./ab

i.e. put and rename the file. It should work.

Nicholas

Show
Tsz Wo Nicholas Sze added a comment - Hi Lohit, The message says, "abcd\:efgh.tar.gz is not a valid DFS filename." because HDFS does not permit special characters like colon (see FSNamesystem.isValidName). This is not a parsing problem. If you try something like hadoop dfs -put ./a:b ./ab i.e. put and rename the file. It should work. Nicholas
Hide
Doug Cutting added a comment -

Nicholas: it looks like you've only addressed including colons within "file:" URIs. Is there a reason for this? That seems rather arbitrary to me. I would much prefer a general solution (like the one I attached).

Show
Doug Cutting added a comment - Nicholas: it looks like you've only addressed including colons within "file:" URIs. Is there a reason for this? That seems rather arbitrary to me. I would much prefer a general solution (like the one I attached).
Hide
Tsz Wo Nicholas Sze added a comment -

I intend to force the rule "scheme, if there is any, must be specified with authority." However, there are quite a few places in the current codes that using path uri like "file:/foo/bar". If we don't allow "file:" without authority, we might have a compatibility issue.

Show
Tsz Wo Nicholas Sze added a comment - I intend to force the rule "scheme, if there is any, must be specified with authority." However, there are quite a few places in the current codes that using path uri like "file:/foo/bar". If we don't allow "file:" without authority, we might have a compatibility issue.
Hide
Tsz Wo Nicholas Sze added a comment -

I found that the URI constructor indeed can handle special character like ':'. The error reported by Lohit is another issue.

The first URI below will success but the second one will fail.

new URI("http", "www.apache.com", "/a:b", null, null);
new URI("http", "www.apache.com", "./a:b", null, null);


That's why the error message says "Relative path in absolute URI". I will think about how to fix it.

Show
Tsz Wo Nicholas Sze added a comment - I found that the URI constructor indeed can handle special character like ':'. The error reported by Lohit is another issue. The first URI below will success but the second one will fail. new URI( "http" , "www.apache.com" , "/a:b" , null , null ); new URI( "http" , "www.apache.com" , "./a:b" , null , null ); That's why the error message says "Relative path in absolute URI". I will think about how to fix it.
Hide
Tsz Wo Nicholas Sze added a comment -

In 2066_20071022.patch, we handle colons as following

• If the input URI is absolute, then the colons before the root slash are for scheme or authority and the colons after it are for path.
• For relative URIs, if colons occur after a slash, they must be a part of file/directory names. No problem.
• If the URI is relative and the first colon occur before any slash, then the colon will be misinterpreted for scheme in the URI constructor. Therefore, ./ is inserted in the beginning in Path.normalizePath.
• For some wired relative URI like hdfs://apache.com/foo (i.e. hdfs: and apache.com are directories), then users have to add ./ themselves (i.e. ./hdfs://apache.com/foo)

Another question: should we allow colons in HDFS?

Show
Tsz Wo Nicholas Sze added a comment - In 2066_20071022.patch, we handle colons as following If the input URI is absolute, then the colons before the root slash are for scheme or authority and the colons after it are for path. For relative URIs, if colons occur after a slash, they must be a part of file/directory names. No problem. If the URI is relative and the first colon occur before any slash, then the colon will be misinterpreted for scheme in the URI constructor. Therefore, ./ is inserted in the beginning in Path.normalizePath . For some wired relative URI like hdfs://apache.com/foo (i.e. hdfs: and apache.com are directories), then users have to add ./ themselves (i.e. ./hdfs://apache.com/foo ) Another question: should we allow colons in HDFS?
Tsz Wo Nicholas Sze made changes -
 Attachment 2066_20071022.patch [ 12368184 ]
Tsz Wo Nicholas Sze made changes -
 Attachment 2066_20071019.patch [ 12368059 ]
Hide
eric baldeschwieler added a comment -

I feel like we are going down a rat hole here. How do we get our input syntax to not surprise users? users expect to be able to specify posix like file strings. If URL syntax is going to keep throwing surprises and cross platform issues at us, maybe we need to create a syntax that doesn;t pull in a lot of unwanted baggage?

Show
eric baldeschwieler added a comment - I feel like we are going down a rat hole here. How do we get our input syntax to not surprise users? users expect to be able to specify posix like file strings. If URL syntax is going to keep throwing surprises and cross platform issues at us, maybe we need to create a syntax that doesn;t pull in a lot of unwanted baggage?
Hide
Doug Cutting added a comment -

> users expect to be able to specify posix like file strings

Which users? Not Windows or MacOS X users, since those OSes don't have POSIX file names, especially when it comes to colons. Linux users may expect posix names, but Hadoop is cross-platform.

URIs are a good standard for our pathnames, a better standard than POSIX. URIs can handle platform-dependent characters like colons just fine. We have in the past made some exceptions in URI syntax to simplify platform compatibility. In particular, we interpret "C:\foo\bar" as "file:///c:/foo/bar" without requiring escapes. Perhaps we should continue this trend further, as Nicholas has proposed, perhaps any exceptions are a mistake, or perhaps we should not make any new exceptions. That seems like a reasonable discussion. But I don't see the case to abandon URIs entirely as our primary filename syntax specification. Handling non-portable filename characters like colons with or without quotation seems a minor issue next to having an i18n-safe standard that extensibly handles protocols, hosts, ports, etc. I don't consider that unwanted baggage.

Show
Doug Cutting added a comment - > users expect to be able to specify posix like file strings Which users? Not Windows or MacOS X users, since those OSes don't have POSIX file names, especially when it comes to colons. Linux users may expect posix names, but Hadoop is cross-platform. URIs are a good standard for our pathnames, a better standard than POSIX. URIs can handle platform-dependent characters like colons just fine. We have in the past made some exceptions in URI syntax to simplify platform compatibility. In particular, we interpret "C:\foo\bar" as "file:///c:/foo/bar" without requiring escapes. Perhaps we should continue this trend further, as Nicholas has proposed, perhaps any exceptions are a mistake, or perhaps we should not make any new exceptions. That seems like a reasonable discussion. But I don't see the case to abandon URIs entirely as our primary filename syntax specification. Handling non-portable filename characters like colons with or without quotation seems a minor issue next to having an i18n-safe standard that extensibly handles protocols, hosts, ports, etc. I don't consider that unwanted baggage.
Hide
eric baldeschwieler added a comment -

I think it makes great sense to support a URI interface to the FS. I recall championing that.

I'm not at all sure it makes sense to define what is a valid filename based on a URI library.

URI can support quoting and we will undoubtedly support several non-URI interfaces to the FS.

I can create filenames with colon's on my mac btw. Not that I see why our filenames should be defined by the limits of what all supported client platforms can support.

I think this issue requires more thought.

Show
eric baldeschwieler added a comment - I think it makes great sense to support a URI interface to the FS. I recall championing that. I'm not at all sure it makes sense to define what is a valid filename based on a URI library. URI can support quoting and we will undoubtedly support several non-URI interfaces to the FS. I can create filenames with colon's on my mac btw. Not that I see why our filenames should be defined by the limits of what all supported client platforms can support. I think this issue requires more thought.
Hide
Doug Cutting added a comment -

> I'm not at all sure it makes sense to define what is a valid filename based on a URI library.

The URI standard provides a good interchange syntax for file names. But we shouldn't let it limit what names are possible in various filesystems as we do today: we should support their full range, using escapes where necessary. Unfortunately, with the current API, we can't tell when a character needs to be escaped or when it is intended as a URI meta-character.

The problem is that we construct paths in FileSystem independent code, so we don't know how to escape things. Perhaps the solution is to remove the public Path constructor and force all Paths to be created by a FileSystem#createPath method, so that they can be escaped appropriately.

Thus, when running on Windows, if one passes a string with unescaped backslashes to LocalFileSystem#createPath(), the backslashes would be interpreted as directory separators, while on Linux or HDFS they'd be treated as literals. Unescaped slashes in a Path URI will always be directory separators, since that's the URI standard we're using for interchange.

Show
Doug Cutting added a comment - > I'm not at all sure it makes sense to define what is a valid filename based on a URI library. The URI standard provides a good interchange syntax for file names. But we shouldn't let it limit what names are possible in various filesystems as we do today: we should support their full range, using escapes where necessary. Unfortunately, with the current API, we can't tell when a character needs to be escaped or when it is intended as a URI meta-character. The problem is that we construct paths in FileSystem independent code, so we don't know how to escape things. Perhaps the solution is to remove the public Path constructor and force all Paths to be created by a FileSystem#createPath method, so that they can be escaped appropriately. Thus, when running on Windows, if one passes a string with unescaped backslashes to LocalFileSystem#createPath(), the backslashes would be interpreted as directory separators, while on Linux or HDFS they'd be treated as literals. Unescaped slashes in a Path URI will always be directory separators, since that's the URI standard we're using for interchange.
Hide
eric baldeschwieler added a comment -

Good idea. Something like this would address my concerns.

Show
eric baldeschwieler added a comment - Good idea. Something like this would address my concerns.
Hide
Tsz Wo Nicholas Sze added a comment -

> Perhaps the solution is to remove the public Path constructor and force all Paths to be created by a FileSystem#createPath method, so that they can be escaped appropriately.

I also like this idea.

However, LocalFileSystem#createPath may be still hard to implement.

I also suggest that each file system should provide static methods for creating Path. So that Path can be created without initializing a FileSystem object.

Show
Tsz Wo Nicholas Sze added a comment - > Perhaps the solution is to remove the public Path constructor and force all Paths to be created by a FileSystem#createPath method, so that they can be escaped appropriately. I also like this idea. However, LocalFileSystem#createPath may be still hard to implement. I also suggest that each file system should provide static methods for creating Path. So that Path can be created without initializing a FileSystem object.
Arun C Murthy made changes -
 Link This issue is related to HADOOP-3256
Arun C Murthy made changes -
 Link This issue is part of HADOOP-3257
Tsz Wo Nicholas Sze made changes -
 Link This issue is related to HADOOP-3173
Tsz Wo Nicholas Sze made changes -
Owen O'Malley made changes -
 Project Hadoop Common [ 12310240 ] HDFS [ 12310942 ] Key HADOOP-2066 HDFS-13 Component/s dfs [ 12310710 ]
Tsz Wo Nicholas Sze made changes -
 Link This issue is related to HADOOP-6334
Harsh J made changes -
 Link This issue is duplicated by HDFS-2557
Hide
Wijnand Suijlen added a comment -

I've recently been bitten by this issue, and while finding out why this is an issue at all, I was amazed that Hadoop tries to implement its own URL class which it calls org.apache.hadoop.fs.Path, and uses this class to interface with FileSystem. If I understand it correctly, the reason for this custom URL class, is that the user can't be bothered escaping his paths. But I think the current interface leads to confusion. If FileSystem worked with just java.net.URI instead of its own Path class then it is absolutely clear on how to construct paths and when extra escaping is necessary, because Sun's/Oracle's javadocs are very clear. Of course, when working from the command line, it might still be convenient to have a convenience class like the current org.apache.hadoop.fs.Path to ease the burden of writing well formed path names.

The colon thing is not the only problem, I found, related to org.apache.hadoop.fs.FileSystem and org.apache.hadoop.fs.Path. For example 'FileSystem.checkPath' does string comparisons on the authority part. The problem there has almost been fixed by using a case insensitive string comparison, but it will still give bad results if one of the authorities is written with an IP address while the other is written with a DNS name.

Show
Wijnand Suijlen added a comment - I've recently been bitten by this issue, and while finding out why this is an issue at all, I was amazed that Hadoop tries to implement its own URL class which it calls org.apache.hadoop.fs.Path, and uses this class to interface with FileSystem. If I understand it correctly, the reason for this custom URL class, is that the user can't be bothered escaping his paths. But I think the current interface leads to confusion. If FileSystem worked with just java.net.URI instead of its own Path class then it is absolutely clear on how to construct paths and when extra escaping is necessary, because Sun's/Oracle's javadocs are very clear. Of course, when working from the command line, it might still be convenient to have a convenience class like the current org.apache.hadoop.fs.Path to ease the burden of writing well formed path names. The colon thing is not the only problem, I found, related to org.apache.hadoop.fs.FileSystem and org.apache.hadoop.fs.Path. For example 'FileSystem.checkPath' does string comparisons on the authority part. The problem there has almost been fixed by using a case insensitive string comparison, but it will still give bad results if one of the authorities is written with an IP address while the other is written with a DNS name.
Hide
Harsh J added a comment -

The latter problem you describe is seen frequently as well, when one tries to query the NameNode with a path that includes the IP instead of hostname.

Wijnand - Do you have a backwards-compatible solution in mind as well?

Also, am marking this as a duplicate as the ideal JIRA should be HADOOP-3257. Lets discuss on that one.

Show
Harsh J added a comment - The latter problem you describe is seen frequently as well, when one tries to query the NameNode with a path that includes the IP instead of hostname. Wijnand - Do you have a backwards-compatible solution in mind as well? Also, am marking this as a duplicate as the ideal JIRA should be HADOOP-3257 . Lets discuss on that one.
Harsh J made changes -
 Status Open [ 1 ] Resolved [ 5 ] Resolution Duplicate [ 3 ]
Hide
Mayurkumar Patel added a comment -

Not sure weather this is issue only with colons, because I was trying this command "hadoop jar hadoop-streaming.jar -file "/user/gpadmin/TestData/WordCountApp/WorldCountMapper.exe, /user/gpadmin/TestData/WordCountApp/ConsoleApplication1.exe" -input "/user/gpadmin/TestData/WordCountInput/action.txt" -output "/user/gpadmin/TestData/WordCountOutput" -mapper "WorldCountMapper.exe" -reducer "ConsoleApplication1.exe"" .. Path doesn't contain any colon still getting the same error message.

Show
Mayurkumar Patel added a comment - Not sure weather this is issue only with colons, because I was trying this command "hadoop jar hadoop-streaming.jar -file "/user/gpadmin/TestData/WordCountApp/WorldCountMapper.exe, /user/gpadmin/TestData/WordCountApp/ConsoleApplication1.exe" -input "/user/gpadmin/TestData/WordCountInput/action.txt" -output "/user/gpadmin/TestData/WordCountOutput" -mapper "WorldCountMapper.exe" -reducer "ConsoleApplication1.exe"" .. Path doesn't contain any colon still getting the same error message.
Transition Time In Source Status Execution Times Last Executer Last Execution Date
 Open Resolved
1538d 5h 48m 1 Harsh J 02/Jan/12 03:40

## People

• Assignee:
Unassigned
Reporter:
Lohit Vijayarenu