Hadoop Common
  1. Hadoop Common
  2. HADOOP-129

FileSystem should not name files with java.io.File

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.1.0, 0.1.1
    • Fix Version/s: 0.2.0
    • Component/s: fs
    • Labels:
      None

      Description

      In Hadoop's FileSystem API, files are currently named using java.io.File. This is confusing, as many methods on that class are inappropriate to call on Hadoop paths. For example, calling isDirectory(), exists(), etc. on a java.io.File is not the same as calling FileSystem.isDirectory() or FileSystem.exists() passing that same file. Using java.io.File also makes correct operation on Windows difficult, since java.io.File operates differently on Windows in order to accomodate Windows path names. For example, new File("/foo") is not absolute on Windows, and prints its path as "
      foo", which causes confusion.

      To fix this we could replace the uses of java.io.File in the FileSystem API with String, a new FileName class, or perhaps java.net.URI. The advantage of URI is that it can also naturally include the namenode host and port. The disadvantage is that URI does not support tree operations like getParent().

      This change will cause a lot of incompatibility. Thus it should probably be made early in a development cycle in order to maximize the time for folks to adapt to it.

      1. path.patch
        161 kB
        Doug Cutting

        Activity

        Hide
        Doug Cutting added a comment -

        I just committed this. It was a big change. I hope I haven't broken anything!

        Show
        Doug Cutting added a comment - I just committed this. It was a big change. I hope I haven't broken anything!
        Hide
        Doug Cutting added a comment -

        Here's a patch that replaces uses of java.io.File in Hadoop's FileSystem and MapReduce API's with a new class named Path. I left some existing File-based methods, now deprecated, sufficient for Nutch to run w/o alteration. I'd like to remove the deprecated methods after the 0.2 release.

        I believe that the only incompatible change is that dfs.data.dir and mapred.local.dir, when lists of directories, must now be comma-separated and may no longer be space-separated. This is in order to make things work better on Windows.

        I have tested this in standalone and pseudo-distributed operation on both Linux and Windows, with unit tests and with the Nutch crawler.

        Barring objections, I will apply this tomorrow.

        Show
        Doug Cutting added a comment - Here's a patch that replaces uses of java.io.File in Hadoop's FileSystem and MapReduce API's with a new class named Path. I left some existing File-based methods, now deprecated, sufficient for Nutch to run w/o alteration. I'd like to remove the deprecated methods after the 0.2 release. I believe that the only incompatible change is that dfs.data.dir and mapred.local.dir, when lists of directories, must now be comma-separated and may no longer be space-separated. This is in order to make things work better on Windows. I have tested this in standalone and pseudo-distributed operation on both Linux and Windows, with unit tests and with the Nutch crawler. Barring objections, I will apply this tomorrow.
        Hide
        Doug Cutting added a comment -

        > Does it make sense to create class that would extend File and override unsupported operations to throw UnsupportedOperationException?

        I'm not sure what advantages that would have. Is the idea to detect errors at runtime rather than at compile time? I've just about finished a patch adding a new class. I'll post it later today.

        Show
        Doug Cutting added a comment - > Does it make sense to create class that would extend File and override unsupported operations to throw UnsupportedOperationException? I'm not sure what advantages that would have. Is the idea to detect errors at runtime rather than at compile time? I've just about finished a patch adding a new class. I'll post it later today.
        Hide
        Igor Bolotin added a comment -

        Does it make sense to create class that would extend File and override unsupported operations to throw UnsupportedOperationException?

        Show
        Igor Bolotin added a comment - Does it make sense to create class that would extend File and override unsupported operations to throw UnsupportedOperationException?
        Hide
        eric baldeschwieler added a comment -

        It could contain a URI...

        Show
        eric baldeschwieler added a comment - It could contain a URI...
        Hide
        Doug Cutting added a comment -

        Working through this more, I'm now leaning away from URI and towards a new class. It will be easier to replace with a new class, since the API can be made to resemble File. For example, we have a lot of code that calls 'new File(dir, name)' to construct a file in a subdirectory. The idiom for doing that with URI's is slightly more complicated, and would require a utility method somewhere. Similarly for file.getParentFile(), etc.

        So now I'm leaning towards a class named "Path" that's mostly a drop-in replacement for File, except it doesn't support FS operations like exists(), mkdir(), delete(), etc.

        Show
        Doug Cutting added a comment - Working through this more, I'm now leaning away from URI and towards a new class. It will be easier to replace with a new class, since the API can be made to resemble File. For example, we have a lot of code that calls 'new File(dir, name)' to construct a file in a subdirectory. The idiom for doing that with URI's is slightly more complicated, and would require a utility method somewhere. Similarly for file.getParentFile(), etc. So now I'm leaning towards a class named "Path" that's mostly a drop-in replacement for File, except it doesn't support FS operations like exists(), mkdir(), delete(), etc.
        Hide
        Doug Cutting added a comment -

        URI actually can compute parent directory. For example:

        URI subDir = new URI("/foo/bar/baz/");
        URI parent = subDir.resolve("..");

        Parent.toString() returns "/foo/bar/".

        So I think that URI has the features we want for filenames and not much else. Am I missing something?

        It might also be useful to implement a URLStreamHandler, so that one can create "hdfs:" urls and use them whereever java accepts URLs, e.g., in classloaders, etc. But the URL class doesn't support relative path name resolution, the primary feature we require for names.

        Unless there are objections, I'll start exploring replacing the uses of java.io.File with java.net.URI.

        My thinking is that we remove rather than deprecate the old methods. This makes the change incompatible, but I think we really want to get rid of the use of java.io.File. I'm willing to update Nutch & unit tests as required, but this may break others' code. Should we instead deprecate these in Hadoop 0.2 and then remove them in 0.3? Thoughts?

        Show
        Doug Cutting added a comment - URI actually can compute parent directory. For example: URI subDir = new URI("/foo/bar/baz/"); URI parent = subDir.resolve(".."); Parent.toString() returns "/foo/bar/". So I think that URI has the features we want for filenames and not much else. Am I missing something? It might also be useful to implement a URLStreamHandler, so that one can create "hdfs:" urls and use them whereever java accepts URLs, e.g., in classloaders, etc. But the URL class doesn't support relative path name resolution, the primary feature we require for names. Unless there are objections, I'll start exploring replacing the uses of java.io.File with java.net.URI. My thinking is that we remove rather than deprecate the old methods. This makes the change incompatible, but I think we really want to get rid of the use of java.io.File. I'm willing to update Nutch & unit tests as required, but this may break others' code. Should we instead deprecate these in Hadoop 0.2 and then remove them in 0.3? Thoughts?
        Hide
        Doug Cutting added a comment -

        > I think we should change this to a Hadoop-specific class, e.g. FileName.

        Why not URI? What required methods are missing from URI? Conversely, what URI methods do you think might cause problems?

        Partially answering my own question, with URIs we'd have to check the schema host and port matched the fs when implementing each FS method. In other words, given that we need a FileSystem instance to do anything, the schema, host and port fields of the URI are usually redundant and force us to perform error checking. However these same fields would be useful when specifying MapReduce input and output directories, in command lines, etc., permitting one to easily specify non-default FileSystem implementations.

        Note that I don't think URI buys us interoperability with other systems. So we should only use it if we think it will make writing Hadoop easier: if it consists of code that we'd need to mostly need to write anyway.

        A side-benefit of URI is that it provides standards-defined filename syntax. We don't have to figure out how to, e.g., escape things, or how backslashes and colons should be treated, etc. We can simply point to a standard.

        > I also propose that this class should be versioned, and contain some File-like metadata - for now I'm thinking specifically about creation / modification time.

        This works so long as files are write-once. But if they can be appended to or overwritten then this information could get stale.

        Show
        Doug Cutting added a comment - > I think we should change this to a Hadoop-specific class, e.g. FileName. Why not URI? What required methods are missing from URI? Conversely, what URI methods do you think might cause problems? Partially answering my own question, with URIs we'd have to check the schema host and port matched the fs when implementing each FS method. In other words, given that we need a FileSystem instance to do anything, the schema, host and port fields of the URI are usually redundant and force us to perform error checking. However these same fields would be useful when specifying MapReduce input and output directories, in command lines, etc., permitting one to easily specify non-default FileSystem implementations. Note that I don't think URI buys us interoperability with other systems. So we should only use it if we think it will make writing Hadoop easier: if it consists of code that we'd need to mostly need to write anyway. A side-benefit of URI is that it provides standards-defined filename syntax. We don't have to figure out how to, e.g., escape things, or how backslashes and colons should be treated, etc. We can simply point to a standard. > I also propose that this class should be versioned, and contain some File-like metadata - for now I'm thinking specifically about creation / modification time. This works so long as files are write-once. But if they can be appended to or overwritten then this information could get stale.
        Hide
        Andrzej Bialecki added a comment -

        I think we should change this to a Hadoop-specific class, e.g. FileName (not a simple String - too limiting). FileName-s could only be used when holding a reference to a valid instance of FileSystem - this way operations like getParent() could always consult FileSystem-specific routines to resolve DFS names to real names in case of LocalFileSystem.

        I also propose that this class should be versioned, and contain some File-like metadata - for now I'm thinking specifically about creation / modification time.

        Show
        Andrzej Bialecki added a comment - I think we should change this to a Hadoop-specific class, e.g. FileName (not a simple String - too limiting). FileName-s could only be used when holding a reference to a valid instance of FileSystem - this way operations like getParent() could always consult FileSystem-specific routines to resolve DFS names to real names in case of LocalFileSystem. I also propose that this class should be versioned, and contain some File-like metadata - for now I'm thinking specifically about creation / modification time.

          People

          • Assignee:
            Doug Cutting
            Reporter:
            Doug Cutting
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development