Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-7099

Make RawLocalFileSystem more friendly to sub-classing

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 0.23.0
    • None
    • fs
    • None
    • RawLocalFileSystem FileSystem

    Description

      This patch does 3 things that makes sub-classing RawLocalFileSystem easier.

      First, it adds a constructor that allows a sub-class to avoid calling getInitialWorkingDirectory(). This is important because if a sub-class has an initially null uri (prior to initialize() being called), then getInitialWorkingDirectory() will cause a NullPointerException when it tries to work with the FS's uri.

      Second, allows subclasses to modify the working directory.

      The third thing this patch does is change loadPermissions to not pass the URI object to the File(URI) constructor, but rather pass the string representation of the path. This is important because URI's that are not using the "file" scheme will cause the File(URI) constructor to throw an exception.

      Attachments

        1. HADOOP-7099.patch
          1 kB
          Noah Watkins

        Issue Links

          Activity

            nwatkins Noah Watkins added a comment -

            Please review.

            nwatkins Noah Watkins added a comment - Please review.
            nwatkins Noah Watkins added a comment -

            Please review.

            nwatkins Noah Watkins added a comment - Please review.
            tlipcon Todd Lipcon added a comment -

            I haven't looked at your patch, but in general I think subclassing is a very fragile mechanism for providing new filesystems, so not sure if we should be making changes specifically for subclassers. We fell into this trap for a year or two in HBase development and it only came back to bite us hard later. FilterFileSystem is a much more maintainable mechanism. Does it fit the bill for your use case?

            tlipcon Todd Lipcon added a comment - I haven't looked at your patch, but in general I think subclassing is a very fragile mechanism for providing new filesystems, so not sure if we should be making changes specifically for subclassers. We fell into this trap for a year or two in HBase development and it only came back to bite us hard later. FilterFileSystem is a much more maintainable mechanism. Does it fit the bill for your use case?
            nwatkins Noah Watkins added a comment -

            Thanks for the feedback. I think that FilterFileSystem will work; I'll do some testing with it to see. As an alternative, would you be opposed to a patch that started from a duplicate copy of RawLocalFileSystem? In a lot of ways that might be the best solution, especially thinking forward to cases in which the two diverge. But, I want to check if there is any major issue with that code duplication?

            nwatkins Noah Watkins added a comment - Thanks for the feedback. I think that FilterFileSystem will work; I'll do some testing with it to see. As an alternative, would you be opposed to a patch that started from a duplicate copy of RawLocalFileSystem? In a lot of ways that might be the best solution, especially thinking forward to cases in which the two diverge. But, I want to check if there is any major issue with that code duplication?
            tlipcon Todd Lipcon added a comment -

            I wouldn't want to duplicate the class inside Hadoop - the goal is to keep the core small where possible, I think, especially where we're talking about new interfaces we'll have to support.

            Down the road we already have a new parallel set of classes for FS access on the way - FileContext and the "new" FileSystem. Worth checking out if you haven't seen them yet.

            tlipcon Todd Lipcon added a comment - I wouldn't want to duplicate the class inside Hadoop - the goal is to keep the core small where possible, I think, especially where we're talking about new interfaces we'll have to support. Down the road we already have a new parallel set of classes for FS access on the way - FileContext and the "new" FileSystem. Worth checking out if you haven't seen them yet.
            nwatkins Noah Watkins added a comment -

            Yes, I've seen the new FS interfaces, but I'm waiting for a new implementation of RawLocal that uses it (I believe the current version uses DelegateToFileSystem?). This all sounds good. I'll switch over to FilterFileSystem for now. Thanks!

            nwatkins Noah Watkins added a comment - Yes, I've seen the new FS interfaces, but I'm waiting for a new implementation of RawLocal that uses it (I believe the current version uses DelegateToFileSystem?). This all sounds good. I'll switch over to FilterFileSystem for now. Thanks!
            tlipcon Todd Lipcon added a comment -

            Cool, shall we resolve this one as wontfix then?

            tlipcon Todd Lipcon added a comment - Cool, shall we resolve this one as wontfix then?
            nwatkins Noah Watkins added a comment -

            Yes, sounds good. There is a minor change to RawLocal that will be needed, but I'lll start a new JIRA for that. It's unrelated to this JIRA. Thanks a lot!

            nwatkins Noah Watkins added a comment - Yes, sounds good. There is a minor change to RawLocal that will be needed, but I'lll start a new JIRA for that. It's unrelated to this JIRA. Thanks a lot!
            nwatkins Noah Watkins added a comment -

            Well, I'm afraid that using the FilterFileSystem does pose a little bit of a road block. It has to do with checkPath failing when the FilteredFileSystem changes its scheme. So, for example, if a FilteredFileSystem uses RawLocalFileSystem as the base FS, but changes its scheme to be "new-scheme", then RawLocalFileSystem#checkPath fails because it expects "file" scheme.

            In this case is it better to design a new checkPath that lets "new-scheme" through, or is this indicative of a larger problem?

            nwatkins Noah Watkins added a comment - Well, I'm afraid that using the FilterFileSystem does pose a little bit of a road block. It has to do with checkPath failing when the FilteredFileSystem changes its scheme. So, for example, if a FilteredFileSystem uses RawLocalFileSystem as the base FS, but changes its scheme to be "new-scheme", then RawLocalFileSystem#checkPath fails because it expects "file" scheme. In this case is it better to design a new checkPath that lets "new-scheme" through, or is this indicative of a larger problem?
            tlipcon Todd Lipcon added a comment -

            Hmm, that's a good point. Owen, I noticed you're watching this ticket, do you have any thoughts here? Noah, maybe you can elaborate a little on the use case? Is this for Ceph?

            tlipcon Todd Lipcon added a comment - Hmm, that's a good point. Owen, I noticed you're watching this ticket, do you have any thoughts here? Noah, maybe you can elaborate a little on the use case? Is this for Ceph?
            nwatkins Noah Watkins added a comment -

            Yes, this is for Ceph. The solution we are using is great: code is identical to RawLocalFileSystem, but for files located under a Ceph mount point, getBlockLocations consults the in-kernel client via an IOCTL call written in JNI. That's the high-level view. In practice it has been a little hard to get right in terms of something that fits well into the Hadoop code base.

            So, the fundamental issue is that access to Ceph is through the local file system (i.e. file:///), but for a subset of the file system hierarchy (i.e. the Ceph mount point), a different file system implementation should be used, not just for accounting, but also for altered functionality (i.e. the ioctl availability for Ceph files). We use a configuration parameter to specify the mount point, and force access through the Hadoop Ceph File System to stay within the mount point.

            With sub-classing, the solution is nice because changing the URI (ceph:///) doesn't cause the problem I cited in the previous comment (checkPath blowing up).

            Does that paint a clearer picture? We now have our code working well, but are struggling with how to package it up for distribution.

            Thanks!

            nwatkins Noah Watkins added a comment - Yes, this is for Ceph. The solution we are using is great: code is identical to RawLocalFileSystem, but for files located under a Ceph mount point, getBlockLocations consults the in-kernel client via an IOCTL call written in JNI. That's the high-level view. In practice it has been a little hard to get right in terms of something that fits well into the Hadoop code base. So, the fundamental issue is that access to Ceph is through the local file system (i.e. file:/// ), but for a subset of the file system hierarchy (i.e. the Ceph mount point), a different file system implementation should be used, not just for accounting, but also for altered functionality (i.e. the ioctl availability for Ceph files). We use a configuration parameter to specify the mount point, and force access through the Hadoop Ceph File System to stay within the mount point. With sub-classing, the solution is nice because changing the URI (ceph:///) doesn't cause the problem I cited in the previous comment (checkPath blowing up). Does that paint a clearer picture? We now have our code working well, but are struggling with how to package it up for distribution. Thanks!
            nwatkins Noah Watkins added a comment -

            Explained more succinctly:

            We want to reuse the RawLocalFileSystem code (plus some additions, and new URI scheme):

            • Sub-classing seems to have caused problems in the past
            • Delegation with FIlterFileSystem poses issues with checkPath
            • And duplicating the code just seems inefficient.

            Thanks for the patience in explaining this well.

            nwatkins Noah Watkins added a comment - Explained more succinctly: We want to reuse the RawLocalFileSystem code (plus some additions, and new URI scheme): Sub-classing seems to have caused problems in the past Delegation with FIlterFileSystem poses issues with checkPath And duplicating the code just seems inefficient. Thanks for the patience in explaining this well.
            omalley Owen O'Malley added a comment -

            Couldn't you just override the checkPath implementation in FilterFileSystem? It defaults to passing to the lower level, but clearly you don't want it to.

            omalley Owen O'Malley added a comment - Couldn't you just override the checkPath implementation in FilterFileSystem? It defaults to passing to the lower level, but clearly you don't want it to.
            nwatkins Noah Watkins added a comment -

            It doesn't seem to be FilterFileSystem#checkPath that is the problem, but rather RawLocalFileSystem#checkPath.

            Using Ceph as an example, in which CephFileSystem extends FilterFileSystem:

            • If a path exists in CephFileSystem, it will have scheme "ceph"
            • When "delete()" is called the path is passed down to RawLocalFileSystem from CephFileSystem.
            • However, once in RawLocalFileSystem, any call to checkPath will expect the "file" scheme, and blow up.

            This problem doesn't occur in the current setup because classes using FilterFileSystem to delegate to a lower FS use the same scheme.

            One way to solve this in FilterFileSystem might be to transform every path that passes through to be of the "file" scheme that RawLocal expects. However, this seems likely quite fragile. Thoughts?

            Thanks

            nwatkins Noah Watkins added a comment - It doesn't seem to be FilterFileSystem#checkPath that is the problem, but rather RawLocalFileSystem#checkPath. Using Ceph as an example, in which CephFileSystem extends FilterFileSystem: If a path exists in CephFileSystem, it will have scheme "ceph" When "delete()" is called the path is passed down to RawLocalFileSystem from CephFileSystem. However, once in RawLocalFileSystem, any call to checkPath will expect the "file" scheme, and blow up. This problem doesn't occur in the current setup because classes using FilterFileSystem to delegate to a lower FS use the same scheme. One way to solve this in FilterFileSystem might be to transform every path that passes through to be of the "file" scheme that RawLocal expects. However, this seems likely quite fragile. Thoughts? Thanks
            omalley Owen O'Malley added a comment -

            Ah, of course. You'll need to re-write the path to have a file: schema before you delegate it down. In the long term, it probably makes sense to have support for that in FilterFileSystem. Notably:

            protected Path modifyPathForInnerFileSystem(Path path) throws IOException {
              return path;
            }
            

            And then in FilterFileSystem all of the default bodies should call that method to transform the path before passing the path to the nested file system.

            omalley Owen O'Malley added a comment - Ah, of course. You'll need to re-write the path to have a file: schema before you delegate it down. In the long term, it probably makes sense to have support for that in FilterFileSystem. Notably: protected Path modifyPathForInnerFileSystem(Path path) throws IOException { return path; } And then in FilterFileSystem all of the default bodies should call that method to transform the path before passing the path to the nested file system.
            nwatkins Noah Watkins added a comment -

            Owen, this all sounds great. I'll move forward then with transforming my inheritance version to use delegation.

            Do you have any thoughts on possible points in the abstraction that might "leak" paths that contain the nested file system's URI scheme? Should there also be a hook in FilterFileSystem that will change the scheme of any path returned from the nested class?

            nwatkins Noah Watkins added a comment - Owen, this all sounds great. I'll move forward then with transforming my inheritance version to use delegation. Do you have any thoughts on possible points in the abstraction that might "leak" paths that contain the nested file system's URI scheme? Should there also be a hook in FilterFileSystem that will change the scheme of any path returned from the nested class?
            omalley Owen O'Malley added a comment -

            Actually, this will help fix another annoyance where the RawLocal and Local FileSystem both use "file:" as their protocol. We probably should use "file:" for raw and "filecrc:" for checksummed.

            omalley Owen O'Malley added a comment - Actually, this will help fix another annoyance where the RawLocal and Local FileSystem both use "file:" as their protocol. We probably should use "file:" for raw and "filecrc:" for checksummed.
            nwatkins Noah Watkins added a comment -

            Hi Owen. Would you like to see a patch to FilterFileSystem now, or should I instead perform these path conversions in my own file system class?

            nwatkins Noah Watkins added a comment - Hi Owen. Would you like to see a patch to FilterFileSystem now, or should I instead perform these path conversions in my own file system class?

            People

              Unassigned Unassigned
              nwatkins Noah Watkins
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: