Hadoop Common
  1. Hadoop Common
  2. HADOOP-1995

Path can not handle a file name that contains a back slash

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Later
    • Affects Version/s: 0.14.1
    • Fix Version/s: None
    • Component/s: fs
    • Labels:
      None

      Description

      When normalizing a path name, Path incorrectly converts a back slash to a path separator even if the path name is of the unix style. This prohibs a glob from using a back slash to escape a special character. A fix is to make path normalization file system dependent.

        Issue Links

          Activity

          Hide
          Harsh J added a comment -

          I wonder if the branch-1-win tackled this somehow?

          Show
          Harsh J added a comment - I wonder if the branch-1-win tackled this somehow?
          Hide
          Mahadev konar added a comment -

          it looks like it would impossible without making Path os dependent ( able to differentiate between windows and linux) but then it would involve different application behavior on windows and linux. I am marking this as a later fix.

          Show
          Mahadev konar added a comment - it looks like it would impossible without making Path os dependent ( able to differentiate between windows and linux) but then it would involve different application behavior on windows and linux. I am marking this as a later fix.
          Hide
          Doug Cutting added a comment -

          > I would vote that all paths are uris and thus must use "/" as the separator on all operating systems and file systems.

          That would certainly be nice, and we try to do that as much as possible. Paths are always normalized this way. But if we start rejecting paths with backslashes, or interpreting backslashes as quotations, Hadoop on windows will start exploding all over the place, with no easy central place to fix things.

          > I would push the flip from "/" to "\" in the local file system when running on windows.

          As I mentioned above, not all paths come from a FileSystem impl so we can't depend on this happening before we see a path, and folks process paths in os-independent code, traversing directories, so delaying it until the filesystem sees the path won't work either. I've tried the high road, and it seems impassible. There are also back-compatibility constraints: we don't want to break user code, and a lot of user code processes paths.

          I think cygwin is a good analogy. Cygwin tries to use unix syntax and, at the same time, support windows paths from, e.g., environment variables. For the most part it works, but there are a few edge cases where things don't work quite the same, as in the email I cited above. We need to minimize those edge cases to rare situations and have a ready workaround. But we may not be able to easily eliminate them.

          You're welcome to try to try the high road yourself. I've already spent more hours than I care to trying to get Hadoop paths to work transparently across Windows and linux. The current solution is not arbitrary, but the result of lots of trial and error.

          Show
          Doug Cutting added a comment - > I would vote that all paths are uris and thus must use "/" as the separator on all operating systems and file systems. That would certainly be nice, and we try to do that as much as possible. Paths are always normalized this way. But if we start rejecting paths with backslashes, or interpreting backslashes as quotations, Hadoop on windows will start exploding all over the place, with no easy central place to fix things. > I would push the flip from "/" to "\" in the local file system when running on windows. As I mentioned above, not all paths come from a FileSystem impl so we can't depend on this happening before we see a path, and folks process paths in os-independent code, traversing directories, so delaying it until the filesystem sees the path won't work either. I've tried the high road, and it seems impassible. There are also back-compatibility constraints: we don't want to break user code, and a lot of user code processes paths. I think cygwin is a good analogy. Cygwin tries to use unix syntax and, at the same time, support windows paths from, e.g., environment variables. For the most part it works, but there are a few edge cases where things don't work quite the same, as in the email I cited above. We need to minimize those edge cases to rare situations and have a ready workaround. But we may not be able to easily eliminate them. You're welcome to try to try the high road yourself. I've already spent more hours than I care to trying to get Hadoop paths to work transparently across Windows and linux. The current solution is not arbitrary, but the result of lots of trial and error.
          Hide
          Owen O'Malley added a comment -

          I would vote that all paths are uris and thus must use "/" as the separator on all operating systems and file systems. I would push the flip from "/" to "\" in the local file system when running on windows. I don't know what would break, but I think the gain in consistency would be worth it.

          Show
          Owen O'Malley added a comment - I would vote that all paths are uris and thus must use "/" as the separator on all operating systems and file systems. I would push the flip from "/" to "\" in the local file system when running on windows. I don't know what would break, but I think the gain in consistency would be worth it.
          Hide
          Doug Cutting added a comment -

          > A fix is to make path normalization file system dependent.

          First, there's a technical problem, that normalization is currently done when the FileSystem is unknown, under Path's constructor. But, even so, I'm not sure that will solve it.

          By this you mean that a local path that contains backslashes will have them escaped by Path's constructor. So that "[bar,baz]" will be parsed as "/[bar,baz]", while an HDFS path like "[bar,baz]" will be parsed as "[bar,baz]", so that the '[' is unavailable for globbing. But then applications which run on both unix and Windows and using both the local fs and HDFS will have to pass in different kinds of path strings, no?

          Not all paths come from a FileSystem implementation, some come from environment variables, config files, constant strings in user code, etc. Thus we must be able to handle Windows file names passed to the Path constructor that have not undergone special escaping, e.g., C:\foo\bar should be parsed as c:/foo/bar. We've tried other approaches and they've not worked well.

          This is a hard problem to handle well:

          http://www.cygwin.com/ml/cygwin/1999-06/msg00213.html

          Perhaps we need to expect some Path-related things to be broken on Windows, but make those be rarely used things. Windows paths that contains '[' or ']' simply might not work correctly when passed to listPaths unless the user is careful to insert escapes: we will not attempt to insert such escapes automatically. We would only translate '\' to '/' when running on Windows, and only then when it's not immediately followed by another backslash. This will mean that a directory whose name starts with a glob character will not work correctly on Windows unless the developer manually inserts appropriate escapes, but that globs will work correctly on Windows. My assumption is that directories beginning with glob characters are much more rare than uses of glob characters for globbing. Could that work?

          Show
          Doug Cutting added a comment - > A fix is to make path normalization file system dependent. First, there's a technical problem, that normalization is currently done when the FileSystem is unknown, under Path's constructor. But, even so, I'm not sure that will solve it. By this you mean that a local path that contains backslashes will have them escaped by Path's constructor. So that "[bar,baz]" will be parsed as "/ [bar,baz] ", while an HDFS path like "[bar,baz]" will be parsed as "[bar,baz]", so that the '[' is unavailable for globbing. But then applications which run on both unix and Windows and using both the local fs and HDFS will have to pass in different kinds of path strings, no? Not all paths come from a FileSystem implementation, some come from environment variables, config files, constant strings in user code, etc. Thus we must be able to handle Windows file names passed to the Path constructor that have not undergone special escaping, e.g., C:\foo\bar should be parsed as c:/foo/bar. We've tried other approaches and they've not worked well. This is a hard problem to handle well: http://www.cygwin.com/ml/cygwin/1999-06/msg00213.html Perhaps we need to expect some Path-related things to be broken on Windows, but make those be rarely used things. Windows paths that contains ' [' or '] ' simply might not work correctly when passed to listPaths unless the user is careful to insert escapes: we will not attempt to insert such escapes automatically. We would only translate '\' to '/' when running on Windows, and only then when it's not immediately followed by another backslash. This will mean that a directory whose name starts with a glob character will not work correctly on Windows unless the developer manually inserts appropriate escapes, but that globs will work correctly on Windows. My assumption is that directories beginning with glob characters are much more rare than uses of glob characters for globbing. Could that work?

            People

            • Assignee:
              Mahadev konar
              Reporter:
              Hairong Kuang
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development