Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-4949 Centralized cache management in HDFS
  3. HDFS-5224

Refactor PathBasedCache* methods to use a Path rather than a String

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • HDFS-4949
    • HDFS-4949
    • datanode, namenode
    • None
    • Reviewed

    Description

      As discussed in HDFS-5213, we should refactor PathBasedCacheDirective and related methods in DistributedFileSystem to use a Path to represent paths to cache, rather than a String.

      Attachments

        1. HDFS-5224.3.patch
          34 kB
          Chris Nauroth
        2. HDFS-5224.2.patch
          30 kB
          Chris Nauroth
        3. HDFS-5224.1.patch
          28 kB
          Chris Nauroth

        Activity

          cnauroth Chris Nauroth added a comment -

          Here is an initial cut at a patch. I don't think it's final, but I want to discuss further with andrew.wang and cmccabe before doing any more work.

          This changes the client-facing API in DistributedFileSystem so that any occurrence of a path is represented by a Path object instead of a String. This includes the internal member of PathBasedCacheDirective and PathBasedCacheDescriptor. Input and output Path objects are subject to validation and qualification against the file system and its working directory, just like other methods that use a Path. EmptyPathError has been removed, because it is impossible to create a Path with an empty string.

          From this version of the patch, it looks like Path is useful in the client-facing API of DistributedFileSystem, for the reasons discussed in earlier issues: existing validation logic, path qualification, and consistency with other methods. However, it also looks like Path is more of a hindrance after you cross the barrier from DistributedFileSystem into the protocol and the namenode implementation. The path string portion of the Path is the only useful thing at that point, so we just end up passing it around a lot until we need to unpack it via Path#toUri#getPath.

          I'd like to float the idea of defining different objects, similar to PathBasedCacheDirective/PathBasedCacheDescriptor, but using strings for the paths. The existing objects would be used in the interface of DistributedFileSystem. The new objects would be used at all layers below: DFSClient, ClientProtocol, CacheManager, etc. This would decouple client and server so that the two sides can evolve independently. There is some existing precedent for this in that DFSClient and ClientProtocol use String instead of Path.

          Colin and Andrew, thoughts?

          cnauroth Chris Nauroth added a comment - Here is an initial cut at a patch. I don't think it's final, but I want to discuss further with andrew.wang and cmccabe before doing any more work. This changes the client-facing API in DistributedFileSystem so that any occurrence of a path is represented by a Path object instead of a String . This includes the internal member of PathBasedCacheDirective and PathBasedCacheDescriptor . Input and output Path objects are subject to validation and qualification against the file system and its working directory, just like other methods that use a Path . EmptyPathError has been removed, because it is impossible to create a Path with an empty string. From this version of the patch, it looks like Path is useful in the client-facing API of DistributedFileSystem , for the reasons discussed in earlier issues: existing validation logic, path qualification, and consistency with other methods. However, it also looks like Path is more of a hindrance after you cross the barrier from DistributedFileSystem into the protocol and the namenode implementation. The path string portion of the Path is the only useful thing at that point, so we just end up passing it around a lot until we need to unpack it via Path#toUri#getPath . I'd like to float the idea of defining different objects, similar to PathBasedCacheDirective / PathBasedCacheDescriptor , but using strings for the paths. The existing objects would be used in the interface of DistributedFileSystem . The new objects would be used at all layers below: DFSClient , ClientProtocol , CacheManager , etc. This would decouple client and server so that the two sides can evolve independently. There is some existing precedent for this in that DFSClient and ClientProtocol use String instead of Path . Colin and Andrew, thoughts?
          andrew.wang Andrew Wang added a comment -

          Hey Chris, thanks for picking this up,

          I agree with your thinking here. We shouldn't be using a Path to the server-side, since we're supposed to handle Pathy-y things like the scheme and relative paths on the client. The NN should only be seeing absolute path strings.

          Adding some new classes dealing in strings is likely the right call, though it does unfortunately increase the class-bloat. It's already kind of complicated having the similar-ish Directives, Descriptors, and Entries, and now we'll have 2 more (though both internal to HDFS). Internal complexity we can deal with, but simplifying all of this is an end-goal of mine.

          andrew.wang Andrew Wang added a comment - Hey Chris, thanks for picking this up, I agree with your thinking here. We shouldn't be using a Path to the server-side, since we're supposed to handle Pathy -y things like the scheme and relative paths on the client. The NN should only be seeing absolute path strings. Adding some new classes dealing in strings is likely the right call, though it does unfortunately increase the class-bloat. It's already kind of complicated having the similar-ish Directives , Descriptors , and Entries , and now we'll have 2 more (though both internal to HDFS). Internal complexity we can deal with, but simplifying all of this is an end-goal of mine.
          cmccabe Colin McCabe added a comment -

          Thanks for looking at this, Chris.

          I don't think we should create another class. We already have PathBasedCacheDirective (client-facing), PathBasedCacheDescriptor (a directive with an ID associated), and PathBasedCacheEntry (server-side). Adding more classes would just be too many!

          It would make more sense to add a method like DistributedFileSystem#createPathBasedCacheDirective(Path path) which returns a PathBasedCacheDirective. Keep in mind that we want the DistributedFileSystem to supply its current working directory, and validate that the path matches its URI. Then, we should make the fields in PathBasedCacheDirective mutable and provide setters, so clients can do something like this:

          fs.addPathBasedCacheDirective(fs.createPathBasedCacheDirective(new Path("/foo/bar")).
              setReplication(2).
              setPool("pool1")));
          

          This way:

          • we don't have more class bloat
          • DFS provides its working directory and validates the path
          • it's easy to see what's being set
          • we can add more fields to the directive later without breaking compatibility or getting into constructor overload hell
          cmccabe Colin McCabe added a comment - Thanks for looking at this, Chris. I don't think we should create another class. We already have PathBasedCacheDirective (client-facing), PathBasedCacheDescriptor (a directive with an ID associated), and PathBasedCacheEntry (server-side). Adding more classes would just be too many! It would make more sense to add a method like DistributedFileSystem#createPathBasedCacheDirective(Path path) which returns a PathBasedCacheDirective . Keep in mind that we want the DistributedFileSystem to supply its current working directory, and validate that the path matches its URI. Then, we should make the fields in PathBasedCacheDirective mutable and provide setters, so clients can do something like this: fs.addPathBasedCacheDirective(fs.createPathBasedCacheDirective( new Path( "/foo/bar" )). setReplication(2). setPool( "pool1" ))); This way: we don't have more class bloat DFS provides its working directory and validates the path it's easy to see what's being set we can add more fields to the directive later without breaking compatibility or getting into constructor overload hell
          andrew.wang Andrew Wang added a comment -

          Good idea, only change I'd like is to keep the directive immutable and use a builder instead. We create the descriptor from the input directive in the ClientNamenode PB translator, so it could possibly race here, and I don't think mutating a directive after create time makes much sense anyway (esp after it's used in addPathBasedCacheDirective).

          andrew.wang Andrew Wang added a comment - Good idea, only change I'd like is to keep the directive immutable and use a builder instead. We create the descriptor from the input directive in the ClientNamenode PB translator, so it could possibly race here, and I don't think mutating a directive after create time makes much sense anyway (esp after it's used in addPathBasedCacheDirective ).
          cnauroth Chris Nauroth added a comment -

          cmccabe, if I understand correctly, you're advocating that we go ahead and pass around PathBasedCacheDirective/Descriptor at all layers (client, protocol, and NN implementation all the way down to CacheManager). The path member will be a Path, not a String, and the implementation in the NN will only use the path part and ignore all other URI components like scheme and authority. This is accomplished by calling getPath().toUri().getPath() in any NN code that needs to work with it. Am I understanding the proposal correctly? If so, then the consequence is that we have a slightly odd data type in the NN implementation (most components of Path will be ignored), but the benefit is that we have fewer total classes.

          (BTW, I'm not objecting to anything above, just trying to understand trade-offs.)

          Keep in mind that we want the DistributedFileSystem to supply its current working directory, and validate that the path matches its URI.

          Yes, understood. The current patch already does this, and now I think we're just working out the final state for refactoring this code.

          cnauroth Chris Nauroth added a comment - cmccabe , if I understand correctly, you're advocating that we go ahead and pass around PathBasedCacheDirective/Descriptor at all layers (client, protocol, and NN implementation all the way down to CacheManager ). The path member will be a Path , not a String , and the implementation in the NN will only use the path part and ignore all other URI components like scheme and authority. This is accomplished by calling getPath().toUri().getPath() in any NN code that needs to work with it. Am I understanding the proposal correctly? If so, then the consequence is that we have a slightly odd data type in the NN implementation (most components of Path will be ignored), but the benefit is that we have fewer total classes. (BTW, I'm not objecting to anything above, just trying to understand trade-offs.) Keep in mind that we want the DistributedFileSystem to supply its current working directory, and validate that the path matches its URI. Yes, understood. The current patch already does this, and now I think we're just working out the final state for refactoring this code.
          cmccabe Colin McCabe added a comment -

          Colin Patrick McCabe, if I understand correctly, you're advocating that we go ahead and pass around PathBasedCacheDirective/Descriptor at all layers (client, protocol, and NN implementation all the way down to CacheManager). The path member will be a Path, not a String, and the implementation in the NN will only use the path part and ignore all other URI components like scheme and authority.

          Right, exactly. If you look at NameNodeRpcServer, you see that all the methods there treat path as a String. The NN doesn't need scheme or authority, since we assume that if you figured out how to talk to the NN, you know those already Only the DistributedFilesystem (client-facing API) needs to handle scheme and authority.

          Good idea, only change I'd like is to keep the directive immutable and use a builder instead. We create the descriptor from the input directive in the ClientNamenode PB translator, so it could possibly race here, and I don't think mutating a directive after create time makes much sense anyway (esp after it's used in addPathBasedCacheDirective).

          Good idea. We could have a method in the DistributedFileSystem return a PathBasedCacheDirective#Builder, perhaps.

          cmccabe Colin McCabe added a comment - Colin Patrick McCabe, if I understand correctly, you're advocating that we go ahead and pass around PathBasedCacheDirective/Descriptor at all layers (client, protocol, and NN implementation all the way down to CacheManager). The path member will be a Path, not a String, and the implementation in the NN will only use the path part and ignore all other URI components like scheme and authority. Right, exactly. If you look at NameNodeRpcServer , you see that all the methods there treat path as a String . The NN doesn't need scheme or authority, since we assume that if you figured out how to talk to the NN, you know those already Only the DistributedFilesystem (client-facing API) needs to handle scheme and authority. Good idea, only change I'd like is to keep the directive immutable and use a builder instead. We create the descriptor from the input directive in the ClientNamenode PB translator, so it could possibly race here, and I don't think mutating a directive after create time makes much sense anyway (esp after it's used in addPathBasedCacheDirective). Good idea. We could have a method in the DistributedFileSystem return a PathBasedCacheDirective#Builder , perhaps.
          cmccabe Colin McCabe added a comment -

          er, wait, I think I misunderstood Chris's proposal.

          No, I am not saying we add Path to PathBasedCacheEntry. I'm saying we add a method in DistributedFileSystem that takes a Path and returns a PathBasedCacheEntry initialized with the correct String (or a builder initialized with the correct string)

          cmccabe Colin McCabe added a comment - er, wait, I think I misunderstood Chris's proposal. No, I am not saying we add Path to PathBasedCacheEntry . I'm saying we add a method in DistributedFileSystem that takes a Path and returns a PathBasedCacheEntry initialized with the correct String (or a builder initialized with the correct string)
          cnauroth Chris Nauroth added a comment -

          I'm saying we add a method in DistributedFileSystem that takes a Path and returns a PathBasedCacheEntry initialized with the correct String (or a builder initialized with the correct string).

          I had steered away from this approach, because it leaves clients getting a String instead of a Path when they call listPathBasedCacheDescriptors. That would still be a bit of inconsistency in the interface. (By analogy, listStatus returns FileStatus, which contains Path.)

          While folks chew on this a little more, I'll get started on the builder code itself. It sounds like everyone is in agreement on that part.

          cnauroth Chris Nauroth added a comment - I'm saying we add a method in DistributedFileSystem that takes a Path and returns a PathBasedCacheEntry initialized with the correct String (or a builder initialized with the correct string). I had steered away from this approach, because it leaves clients getting a String instead of a Path when they call listPathBasedCacheDescriptors . That would still be a bit of inconsistency in the interface. (By analogy, listStatus returns FileStatus , which contains Path .) While folks chew on this a little more, I'll get started on the builder code itself. It sounds like everyone is in agreement on that part.
          andrew.wang Andrew Wang added a comment -

          Is it acceptable to have the descriptor's internal representation still be a String, but then turn it into a Path for the client in the getter? I think this is ok since the path string is always absolute, and we could even qualify with the correct scheme and authority if desired.

          andrew.wang Andrew Wang added a comment - Is it acceptable to have the descriptor's internal representation still be a String , but then turn it into a Path for the client in the getter? I think this is ok since the path string is always absolute, and we could even qualify with the correct scheme and authority if desired.
          cmccabe Colin McCabe added a comment -

          I don't have a super-strong opinion about this, but I like Andrew's proposal to have a getter that returns a Path in the PathBasedCacheDescriptor.

          While folks chew on this a little more, I'll get started on the builder code itself. It sounds like everyone is in agreement on that part.

          Yeah, a factory would be good.

          cmccabe Colin McCabe added a comment - I don't have a super-strong opinion about this, but I like Andrew's proposal to have a getter that returns a Path in the PathBasedCacheDescriptor . While folks chew on this a little more, I'll get started on the builder code itself. It sounds like everyone is in agreement on that part. Yeah, a factory would be good.
          cnauroth Chris Nauroth added a comment -

          Here is version 2 of the patch.

          Is it acceptable to have the descriptor's internal representation still be a String, but then turn it into a Path for the client in the getter?

          I wound up not doing this, because it didn't seem to simplify anything in practice. On the client side, it causes an extra trip through the Path constructor. On the server side, code like CacheManager still ends up needing to turn the Path back into a String.

          The fundamental problem here is that the client wants a URI, and the NN wants just a path. If we don't want to define different data types to represent that difference, then we need to overload the meaning of something. Putting a Path in the object and unpacking in a few places in the NN seems like the least intrusive way to do that, so I'm OK with this. Let me know what you think.

          cnauroth Chris Nauroth added a comment - Here is version 2 of the patch. Is it acceptable to have the descriptor's internal representation still be a String, but then turn it into a Path for the client in the getter? I wound up not doing this, because it didn't seem to simplify anything in practice. On the client side, it causes an extra trip through the Path constructor. On the server side, code like CacheManager still ends up needing to turn the Path back into a String . The fundamental problem here is that the client wants a URI, and the NN wants just a path. If we don't want to define different data types to represent that difference, then we need to overload the meaning of something. Putting a Path in the object and unpacking in a few places in the NN seems like the least intrusive way to do that, so I'm OK with this. Let me know what you think.
          andrew.wang Andrew Wang added a comment -

          Hey Chris,

          Sorry about the delay on getting back to you. I played with your patch a bit and eventually arrived at the same conclusion as you. Having {{Path}}s on the namenode is unfortunate, but it's a lesser of two evils situation.

          Some review comments:

          • I think we should make it so the PCD path is qualified at creation time, else you might create a PCD with a relative path, change working directories, and the PCD now refers to a different path. I think the easiest way of doing this is a factory method in DFS.
          • CacheAdmin#run: we already check that path is not null, so I don't think we need that ternary. The PCDir constructor already does a null check.
          • We could do a little cleanup of validation in PCDir too, while we're there. I think we should move all the checks to PCD#validate, and call it in Builder#build. It'd also be nice to do a little cleanup to have all the checks in one place, e.g. #validate.
          • I think we still want EmptyPathError since someone could manually mangle the protobuf to give us an empty path. It's nicer to throw the named exception than some Path creation error in the PB translator.

          Overall looks really good though, thanks for working on this.

          andrew.wang Andrew Wang added a comment - Hey Chris, Sorry about the delay on getting back to you. I played with your patch a bit and eventually arrived at the same conclusion as you. Having {{Path}}s on the namenode is unfortunate, but it's a lesser of two evils situation. Some review comments: I think we should make it so the PCD path is qualified at creation time, else you might create a PCD with a relative path, change working directories, and the PCD now refers to a different path. I think the easiest way of doing this is a factory method in DFS . CacheAdmin#run: we already check that path is not null, so I don't think we need that ternary. The PCDir constructor already does a null check. We could do a little cleanup of validation in PCDir too, while we're there. I think we should move all the checks to PCD#validate , and call it in Builder#build . It'd also be nice to do a little cleanup to have all the checks in one place, e.g. #validate . I think we still want EmptyPathError since someone could manually mangle the protobuf to give us an empty path. It's nicer to throw the named exception than some Path creation error in the PB translator. Overall looks really good though, thanks for working on this.
          cnauroth Chris Nauroth added a comment -

          No worries on the delay, and thanks for the feedback.

          I think we should make it so the PCD path is qualified at creation time, else you might create a PCD with a relative path, change working directories, and the PCD now refers to a different path. I think the easiest way of doing this is a factory method in DFS.

          This is actually consistent with all of the other DFS methods that accept a Path. If you create a relative Path instance, then call DistributedFileSystem#setWorkingDirectory, and then pass the Path to a method, the method internally resolves against the new working directory, not the working directory that was in effect at Path creation time. Thus, adding a factory method like this on DFS would create an inconsistency in the API. If a caller doesn't like this behavior for their PathBasedCacheDirective instances, then they can create it using the result of DistributedFileSystem#makeQualified, passing the relative Path, and this would be consistent with the rest of the DFS API. Let me know what you think.

          cnauroth Chris Nauroth added a comment - No worries on the delay, and thanks for the feedback. I think we should make it so the PCD path is qualified at creation time, else you might create a PCD with a relative path, change working directories, and the PCD now refers to a different path. I think the easiest way of doing this is a factory method in DFS. This is actually consistent with all of the other DFS methods that accept a Path . If you create a relative Path instance, then call DistributedFileSystem#setWorkingDirectory , and then pass the Path to a method, the method internally resolves against the new working directory, not the working directory that was in effect at Path creation time. Thus, adding a factory method like this on DFS would create an inconsistency in the API. If a caller doesn't like this behavior for their PathBasedCacheDirective instances, then they can create it using the result of DistributedFileSystem#makeQualified , passing the relative Path , and this would be consistent with the rest of the DFS API. Let me know what you think.
          cnauroth Chris Nauroth added a comment -

          CacheAdmin#run: we already check that path is not null, so I don't think we need that ternary. The PCDir constructor already does a null check.

          Yes, we don't need the check for AddPathBasedCacheDirectiveCommand, because it pre-validates the path string from the command line. We do need the ternary for ListPathBasedCacheDirectiveCommand though, where a null path means "don't filter my result set on path". I'll clean this up for the add case.

          I'll look into the validation cleanup too, and add back EmptyPathError.

          cnauroth Chris Nauroth added a comment - CacheAdmin#run: we already check that path is not null, so I don't think we need that ternary. The PCDir constructor already does a null check. Yes, we don't need the check for AddPathBasedCacheDirectiveCommand , because it pre-validates the path string from the command line. We do need the ternary for ListPathBasedCacheDirectiveCommand though, where a null path means "don't filter my result set on path". I'll clean this up for the add case. I'll look into the validation cleanup too, and add back EmptyPathError .
          andrew.wang Andrew Wang added a comment -

          This is actually consistent with all of the other DFS methods that accept a Path.

          Alright, that sounds reasonable. Thanks Chris. Looking forward to the new patch.

          andrew.wang Andrew Wang added a comment - This is actually consistent with all of the other DFS methods that accept a Path. Alright, that sounds reasonable. Thanks Chris. Looking forward to the new patch.
          cnauroth Chris Nauroth added a comment -

          I'm attaching patch version 3. The differences since last time are:

          1. CacheAdmin: Removed redundant null check.
          2. EmptyPathError: This is back in a slightly different form. Since it's now impossible to instantiate a PathBasedCacheDirective with an empty path, the check is enforced at the protocol in the translator. I've added a new unit test at this layer too.

          I decided to leave validate alone. Andrew, I hit a snag when I tried calling it from build. The validation logic includes a call to DFSUtil#isValidName. That method checks that it's an absolute path. We want to enable callers to instantiate with a relative path and have it resolved by the DFS, so doing this validation at build time conflicts with that. The check for an absolute path is still helpful at the time that CacheManager is working on it, so I just left the whole thing alone.

          cnauroth Chris Nauroth added a comment - I'm attaching patch version 3. The differences since last time are: CacheAdmin : Removed redundant null check. EmptyPathError : This is back in a slightly different form. Since it's now impossible to instantiate a PathBasedCacheDirective with an empty path, the check is enforced at the protocol in the translator. I've added a new unit test at this layer too. I decided to leave validate alone. Andrew, I hit a snag when I tried calling it from build . The validation logic includes a call to DFSUtil#isValidName . That method checks that it's an absolute path. We want to enable callers to instantiate with a relative path and have it resolved by the DFS, so doing this validation at build time conflicts with that. The check for an absolute path is still helpful at the time that CacheManager is working on it, so I just left the whole thing alone.
          andrew.wang Andrew Wang added a comment -

          +1, thanks for the contribution Chris.

          andrew.wang Andrew Wang added a comment - +1, thanks for the contribution Chris.
          cnauroth Chris Nauroth added a comment -

          I've committed this to the HDFS-4949 branch. Thanks for the reviews!

          cnauroth Chris Nauroth added a comment - I've committed this to the HDFS-4949 branch. Thanks for the reviews!

          People

            cnauroth Chris Nauroth
            andrew.wang Andrew Wang
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: