Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      HDFS-245. Adds a symlink implementation to HDFS. This complements the new symlink feature added in HADOOP-6421
      Show
      HDFS-245 . Adds a symlink implementation to HDFS. This complements the new symlink feature added in HADOOP-6421

      Description

      HDFS should support symbolic links. A symbolic link is a special type of file that contains a reference to another file or directory in the form of an absolute or relative path and that affects pathname resolution. Programs which read or write to files named by a symbolic link will behave as if operating directly on the target file. However, archiving utilities can handle symbolic links specially and manipulate them directly.

      1. 4044_20081030spi.java
        6 kB
        Tsz Wo Nicholas Sze
      2. designdocv1.txt
        7 kB
        Eli Collins
      3. designdocv2.txt
        12 kB
        Eli Collins
      4. designdocv3.txt
        13 kB
        Eli Collins
      5. design-doc-v4.txt
        9 kB
        Eli Collins
      6. HADOOP-4044-strawman.patch
        4 kB
        Doug Cutting
      7. symlink-0.20.0.patch
        155 kB
        weimin zhu
      8. symLink1.patch
        44 kB
        dhruba borthakur
      9. symLink1.patch
        5 kB
        dhruba borthakur
      10. symLink11.patch
        185 kB
        dhruba borthakur
      11. symLink12.patch
        199 kB
        dhruba borthakur
      12. symLink13.patch
        200 kB
        dhruba borthakur
      13. symLink14.patch
        200 kB
        dhruba borthakur
      14. symLink15.txt
        213 kB
        dhruba borthakur
      15. symLink15.txt
        213 kB
        dhruba borthakur
      16. symlink16-common.patch
        110 kB
        Eli Collins
      17. symlink16-hdfs.patch
        138 kB
        Eli Collins
      18. symlink16-mr.patch
        12 kB
        Eli Collins
      19. symlink17-common.txt
        47 kB
        dhruba borthakur
      20. symlink17-hdfs.txt
        71 kB
        dhruba borthakur
      21. symlink18-common.txt
        24 kB
        dhruba borthakur
      22. symlink19-common.txt
        25 kB
        dhruba borthakur
      23. symlink19-common.txt
        24 kB
        dhruba borthakur
      24. symlink19-common-delta.patch
        6 kB
        Eli Collins
      25. symlink19-hdfs.txt
        66 kB
        dhruba borthakur
      26. symlink19-hdfs-delta.patch
        45 kB
        Eli Collins
      27. symlink20-common.patch
        31 kB
        Eli Collins
      28. symlink20-hdfs.patch
        96 kB
        Eli Collins
      29. symlink21-common.patch
        31 kB
        Eli Collins
      30. symlink21-hdfs.patch
        96 kB
        Eli Collins
      31. symlink22-common.patch
        31 kB
        Eli Collins
      32. symlink22-hdfs.patch
        100 kB
        Eli Collins
      33. symlink23-common.patch
        34 kB
        Eli Collins
      34. symlink23-hdfs.patch
        101 kB
        Eli Collins
      35. symlink24-hdfs.patch
        103 kB
        Eli Collins
      36. symlink-25-hdfs.patch
        107 kB
        Eli Collins
      37. symlink-26-hdfs.patch
        108 kB
        Eli Collins
      38. symlink-26-hdfs.patch
        107 kB
        Eli Collins
      39. symlink27-hdfs.patch
        113 kB
        Eli Collins
      40. symlink28-hdfs.patch
        117 kB
        Eli Collins
      41. symlink29-hdfs.patch
        117 kB
        Eli Collins
      42. symlink29-hdfs.patch
        117 kB
        Eli Collins
      43. symlink30-hdfs.patch
        117 kB
        Eli Collins
      44. symlink31-hdfs.patch
        97 kB
        Eli Collins
      45. symlink33-hdfs.patch
        268 kB
        Eli Collins
      46. symlink35-hdfs.patch
        176 kB
        Eli Collins
      47. symlink36-hdfs.patch
        178 kB
        Eli Collins
      48. symlink37-hdfs.patch
        178 kB
        Eli Collins
      49. symlink38-hdfs.patch
        181 kB
        Eli Collins
      50. symlink39-hdfs.patch
        181 kB
        Eli Collins
      51. symLink4.patch
        157 kB
        dhruba borthakur
      52. symlink40-hdfs.patch
        180 kB
        Eli Collins
      53. symlink41-hdfs.patch
        182 kB
        Eli Collins
      54. symLink5.patch
        162 kB
        dhruba borthakur
      55. symLink6.patch
        158 kB
        dhruba borthakur
      56. symLink8.patch
        155 kB
        dhruba borthakur
      57. symLink9.patch
        161 kB
        dhruba borthakur

        Issue Links

          Activity

          Hide
          Joydeep Sen Sarma added a comment -

          +1. good primitive to have.

          Show
          Joydeep Sen Sarma added a comment - +1. good primitive to have.
          Hide
          dhruba borthakur added a comment -

          The FileStatus object will indicate whether a path is a symbolic link. The NameNode will have a new INode type called INodeSymbolicLink. It will store the contents of the symbolic link in the INodeSymbolicLink object. This information will also be stored in the fsimage.

          The FileSystem.open() and FileSystem.create() calls will detect if the pathname is a symbolic link. If so, it transparently will open the path pointed to by the symbolic link.

          Symbolic links can be either relative or absolute. if the symbolic link is a complete URI, then nothign needs to be done. If it is a relative pathname, the dfs-client side code will invoke makeAbsolute to make it a full pathname.

          Show
          dhruba borthakur added a comment - The FileStatus object will indicate whether a path is a symbolic link. The NameNode will have a new INode type called INodeSymbolicLink. It will store the contents of the symbolic link in the INodeSymbolicLink object. This information will also be stored in the fsimage. The FileSystem.open() and FileSystem.create() calls will detect if the pathname is a symbolic link. If so, it transparently will open the path pointed to by the symbolic link. Symbolic links can be either relative or absolute. if the symbolic link is a complete URI, then nothign needs to be done. If it is a relative pathname, the dfs-client side code will invoke makeAbsolute to make it a full pathname.
          Hide
          Doug Cutting added a comment -

          A great use of symbolic links might be to enhance archives to link to their content.

          Show
          Doug Cutting added a comment - A great use of symbolic links might be to enhance archives to link to their content.
          Hide
          dhruba borthakur added a comment -

          After a discussion with Joydeep, we came up with the idea that that FileSystem.open() will transparently resolve a symbolic link if the symbolic link is a relative or if it is of the form hdfs://bar/foo.txt. This means that a HDFS symbolic link can point only to other hdfs pathnames.

          In future, if the symbolic link is of the form file://dir/foo.txt, then the FileSystem API can return it to higher layers for interpretation. This will allow an HDFS symbolic link to refer to a pathname that is not a HDFS pathname. I do not plan to implement this feature in the short term.

          Show
          dhruba borthakur added a comment - After a discussion with Joydeep, we came up with the idea that that FileSystem.open() will transparently resolve a symbolic link if the symbolic link is a relative or if it is of the form hdfs://bar/foo.txt. This means that a HDFS symbolic link can point only to other hdfs pathnames. In future, if the symbolic link is of the form file://dir/foo.txt , then the FileSystem API can return it to higher layers for interpretation. This will allow an HDFS symbolic link to refer to a pathname that is not a HDFS pathname. I do not plan to implement this feature in the short term.
          Hide
          Doug Cutting added a comment -

          > This means that a HDFS symbolic link can point only to other hdfs pathnames.

          What's the point of this restriction? Why not permit symbolic links to arbitrary URIs?

          Show
          Doug Cutting added a comment - > This means that a HDFS symbolic link can point only to other hdfs pathnames. What's the point of this restriction? Why not permit symbolic links to arbitrary URIs?
          Hide
          Raghu Angadi added a comment -

          I think it will result in a cleaner design/implementation to handle this at FileSystem level (thus handling arbitrary URIs).

          Show
          Raghu Angadi added a comment - I think it will result in a cleaner design/implementation to handle this at FileSystem level (thus handling arbitrary URIs).
          Hide
          dhruba borthakur added a comment -

          > This means that a HDFS symbolic link can point only to other hdfs pathnames.

          @Doug : I meant to say that "A HDFS symbolic link that does not have a full URI (absolute with schema, etc) will point to a path in the same HDFS instance".

          @Raghu: Yes, it should be handed at the FileSystem level.

          Show
          dhruba borthakur added a comment - > This means that a HDFS symbolic link can point only to other hdfs pathnames. @Doug : I meant to say that "A HDFS symbolic link that does not have a full URI (absolute with schema, etc) will point to a path in the same HDFS instance". @Raghu: Yes, it should be handed at the FileSystem level.
          Hide
          dhruba borthakur added a comment -

          This patch demonstrates the proposed changes to the FileSystem API. I would like early-comments to the changes to the public API before I implement the support for symlinks in HDFS.

          Show
          dhruba borthakur added a comment - This patch demonstrates the proposed changes to the FileSystem API. I would like early-comments to the changes to the public API before I implement the support for symlinks in HDFS.
          Hide
          Doug Cutting added a comment -

          I think a better API might be:

          public FSDataInputStream open(Path f, int bufferSize) throws IOException {
            FileSystem fs = this;
            FileStatus stat = fs.getFileStatus(f);
            for (stat.isSymLink()) {
              f = stat.getSymLink();
              fs = f.getFileSystem(getConf());
              stat = fs.getStatus(f);
             }
            return fs.openData(stat, bufferSize);
          }
          
          public abstract FSDataInputStream openData(FileStatus stat, int bufferSize) throws IOException;
          

          We could, for back-compatibility, have openData() default to calling open() for one release, but I think most if not all FileSystem implementations are included in Hadoop, so we're mostly concerned about client back-compatibilty here, not implementation back-compatibility.

          Also, FileStatus#getSymLink() should return a Path, not a String, no?

          Show
          Doug Cutting added a comment - I think a better API might be: public FSDataInputStream open(Path f, int bufferSize) throws IOException { FileSystem fs = this ; FileStatus stat = fs.getFileStatus(f); for (stat.isSymLink()) { f = stat.getSymLink(); fs = f.getFileSystem(getConf()); stat = fs.getStatus(f); } return fs.openData(stat, bufferSize); } public abstract FSDataInputStream openData(FileStatus stat, int bufferSize) throws IOException; We could, for back-compatibility, have openData() default to calling open() for one release, but I think most if not all FileSystem implementations are included in Hadoop, so we're mostly concerned about client back-compatibilty here, not implementation back-compatibility. Also, FileStatus#getSymLink() should return a Path, not a String, no?
          Hide
          Raghu Angadi added a comment -

          Both have extra RPC (or system call in case of LocalFS) for open of a normal file (common case). How about something like :

          public FSDataInputStream open(Path f, int bufferSize) throws IOException {
            FileSystem fs = this;
            for (int r = 0; r < maxRecursion; r++) {
              try {
                  return fs.open(f, bufferSize, ..);
              catch (SymLinkNotFollowedException e) {
                  f = e.getSymLink();
                  fs = f.getFileSystem(getConf());
               }
            }
          }
          
          Show
          Raghu Angadi added a comment - Both have extra RPC (or system call in case of LocalFS) for open of a normal file (common case). How about something like : public FSDataInputStream open(Path f, int bufferSize) throws IOException { FileSystem fs = this ; for ( int r = 0; r < maxRecursion; r++) { try { return fs.open(f, bufferSize, ..); catch (SymLinkNotFollowedException e) { f = e.getSymLink(); fs = f.getFileSystem(getConf()); } } }
          Hide
          Doug Cutting added a comment -

          > Both have extra RPC (or system call in case of LocalFS) for open of a normal file (common case).

          Exceptions should not be used for normal program control. If the extra RPC is a problem, then a FileSystem can implement open() directly itself as an optimization. LocatedBlocks could be altered to optionally contain a symbolic link. LocalFileSystem can also override open(), since the native implementation already does the right thing. Would the code above do something reasonable on S3 & KFS, or would we need to override open() for those too? If we end up overriding it everywhere then it's probably not worth having a default implementation and having openData(), but we should rather just require that everyone implement open() to handle links.

          Show
          Doug Cutting added a comment - > Both have extra RPC (or system call in case of LocalFS) for open of a normal file (common case). Exceptions should not be used for normal program control. If the extra RPC is a problem, then a FileSystem can implement open() directly itself as an optimization. LocatedBlocks could be altered to optionally contain a symbolic link. LocalFileSystem can also override open(), since the native implementation already does the right thing. Would the code above do something reasonable on S3 & KFS, or would we need to override open() for those too? If we end up overriding it everywhere then it's probably not worth having a default implementation and having openData(), but we should rather just require that everyone implement open() to handle links.
          Hide
          Raghu Angadi added a comment -

          > If the extra RPC is a problem, then a FileSystem can implement open() directly itself as an optimization.

          That imposes restriction on what SymLink can point to.

          > Exceptions should not be used for normal program control.

          agreed. we could of course propose another FS.open() that returns something more than an FSInputStream.

          Show
          Raghu Angadi added a comment - > If the extra RPC is a problem, then a FileSystem can implement open() directly itself as an optimization. That imposes restriction on what SymLink can point to. > Exceptions should not be used for normal program control. agreed. we could of course propose another FS.open() that returns something more than an FSInputStream.
          Hide
          Doug Cutting added a comment -

          > That imposes restriction on what SymLink can point to.

          I don't follow what you mean here.

          > we could of course propose another FS.open() that returns something more than an FSInputStream.

          Yes, I drafted this and deleted it, since it's rather hairy and not clearly an advantage to anything but HDFS. Here it is:

          public class FileOpening {
            public FileOpening(boolean isLink, Path link) { ... }
            public boolean isLink() { ... }
            public Path getLink() { ... }
          }
          
          public FileOpening getFileOpening(Path f) throws IOException {
            FileStatus stat = getFileOpening(f);
            return new FileOpening(stat.isLink(), stat.getLink());
          }
          
          public FSDataInputStream open(Path f, int bufferSize) throws IOException {
            FileSystem fs = this;
            FileOpening opening = fs.getFileOpening(f);
            for (opening.isSymLink()) {
              f = opening.getSymLink();
              fs = f.getFileSystem(getConf());
              opening = fs.getOpening(f);
             }
            return fs.openData(f, opening, bufferSize);
          }
          
          public abstract FSDataInputStream openData(Path f, FileOpening opening, int bufferSize)
            throws IOException;
          

          HDFS would override getFileOpening() to return a subclass that also contains block locations. Phew! Would anyone leverage this but HDFS?

          Show
          Doug Cutting added a comment - > That imposes restriction on what SymLink can point to. I don't follow what you mean here. > we could of course propose another FS.open() that returns something more than an FSInputStream. Yes, I drafted this and deleted it, since it's rather hairy and not clearly an advantage to anything but HDFS. Here it is: public class FileOpening { public FileOpening( boolean isLink, Path link) { ... } public boolean isLink() { ... } public Path getLink() { ... } } public FileOpening getFileOpening(Path f) throws IOException { FileStatus stat = getFileOpening(f); return new FileOpening(stat.isLink(), stat.getLink()); } public FSDataInputStream open(Path f, int bufferSize) throws IOException { FileSystem fs = this ; FileOpening opening = fs.getFileOpening(f); for (opening.isSymLink()) { f = opening.getSymLink(); fs = f.getFileSystem(getConf()); opening = fs.getOpening(f); } return fs.openData(f, opening, bufferSize); } public abstract FSDataInputStream openData(Path f, FileOpening opening, int bufferSize) throws IOException; HDFS would override getFileOpening() to return a subclass that also contains block locations. Phew! Would anyone leverage this but HDFS?
          Hide
          Raghu Angadi added a comment - - edited

          > That imposes restriction on what SymLink can point to.
          or may be not.. if an implementation (e.g. DistributedFileSystem) has all the information to instantiate another file system.

          One alternative is to have open() return 'SymLinkFSInputStream extends FSInputStream' when a link is not followed by the implementation. A read on it will return bytes representinng the link (and it can have getSymLink() method).

          Show
          Raghu Angadi added a comment - - edited > That imposes restriction on what SymLink can point to. or may be not.. if an implementation (e.g. DistributedFileSystem) has all the information to instantiate another file system. One alternative is to have open() return 'SymLinkFSInputStream extends FSInputStream' when a link is not followed by the implementation. A read on it will return bytes representinng the link (and it can have getSymLink() method).
          Hide
          Raghu Angadi added a comment -

          We can sort of consider FileSystem to be at similar level as a VFS in Linux. In order to handle similar extensions in future as well, I think this might be good time for an implementation to take a object (wrapping arguments) and return a object that contains relevant info. Main goal is not to require actual implementation API changes when something like this is added.

          symlinks need to be handled for file system operations as well... create(), getListing, etc.

          Show
          Raghu Angadi added a comment - We can sort of consider FileSystem to be at similar level as a VFS in Linux. In order to handle similar extensions in future as well, I think this might be good time for an implementation to take a object (wrapping arguments) and return a object that contains relevant info. Main goal is not to require actual implementation API changes when something like this is added. symlinks need to be handled for file system operations as well... create(), getListing, etc.
          Hide
          Pete Wyckoff added a comment -

          keep in mind that we will want to implement as close to posix as possible an api for this in libhdfs: https://issues.apache.org/jira/browse/HADOOP-4118. We may not expose all the functionality of general URIs, but for the case of hdfs only, I think the API should be as simple as possible.

          Show
          Pete Wyckoff added a comment - keep in mind that we will want to implement as close to posix as possible an api for this in libhdfs: https://issues.apache.org/jira/browse/HADOOP-4118 . We may not expose all the functionality of general URIs, but for the case of hdfs only, I think the API should be as simple as possible.
          Hide
          Sanjay Radia added a comment -

          There are the following 4 kinds of symbolic links:

          • dot relative - relative to the directory in which the symbolic link exists
            • these symbolic links can be processed in the NN without kicking it back to the client
          • volume root relative - this is relative to the root of the NN's root
            • these symbolic links can be processed in the NN without kicking it back to the client
          • Relative to another file system
            • these are essentially a symbolic mount. Good use case are remote NNs or Hadoop Archives
            • these symbolic links needs to be kicked back to the client to be processed on the client side.
          • Relative to the root of the client's root (ie to where the client's / points to)
            • The main use case for these is where I have a symbolic link t a genric location (say /tmp) that is best
              picked from the client's environment.
            • these symbolic links needs to be kicked back to the client to be processed on the client side.
            • I believe we can avoid this last one for the first implementation since I don't think the use case is strong.
          Show
          Sanjay Radia added a comment - There are the following 4 kinds of symbolic links: dot relative - relative to the directory in which the symbolic link exists these symbolic links can be processed in the NN without kicking it back to the client volume root relative - this is relative to the root of the NN's root these symbolic links can be processed in the NN without kicking it back to the client Relative to another file system these are essentially a symbolic mount. Good use case are remote NNs or Hadoop Archives these symbolic links needs to be kicked back to the client to be processed on the client side. Relative to the root of the client's root (ie to where the client's / points to) The main use case for these is where I have a symbolic link t a genric location (say /tmp) that is best picked from the client's environment. these symbolic links needs to be kicked back to the client to be processed on the client side. I believe we can avoid this last one for the first implementation since I don't think the use case is strong.
          Hide
          Doug Cutting added a comment -

          Right, we should optimize dot-relative and volume-root-relative links so that they're resolved with a single RPC.

          In all cases, with HDFS links into the same filesystem, we should make just a single RPC to its namenode on open. We used to have an open() RPC, but that was replaced with getBlockLocations() when we noticed that open was no different than getBlockLocations(). With symlinks, we should probably add an open() call again.

          Open() could return a struct with 3 fields: isLink, linkTarget and blockLocations. Dot-relative and volume-root-relative links can be resolved on the NN, as follows:

          file type isLink linkTarget blockLocations
          regular false null non-null
          dot relative link true resolved path non-null
          volume-root relative link true resolved path non-null
          link to other filesys true foreign path null

          Does that make sense?

          Show
          Doug Cutting added a comment - Right, we should optimize dot-relative and volume-root-relative links so that they're resolved with a single RPC. In all cases, with HDFS links into the same filesystem, we should make just a single RPC to its namenode on open. We used to have an open() RPC, but that was replaced with getBlockLocations() when we noticed that open was no different than getBlockLocations(). With symlinks, we should probably add an open() call again. Open() could return a struct with 3 fields: isLink, linkTarget and blockLocations. Dot-relative and volume-root-relative links can be resolved on the NN, as follows: file type isLink linkTarget blockLocations regular false null non-null dot relative link true resolved path non-null volume-root relative link true resolved path non-null link to other filesys true foreign path null Does that make sense?
          Hide
          Sanjay Radia added a comment -

          Volume relative symbolic links (e.g. dot relative links) can be handled within the NN. However, whenever a symbolic link with a remote target is crossed the client side needs to handle it - I am calling this a kickback.
          Note that a symbolic link can occur during an open, create, chmod or any other operation that supplies a path name. Further it can occur anywhere in the path.
          Hence:
          open("/foo/bar/zoo")
          bar may be a symbolic link to a remote volume.
          A getFileStatus("/foo/bar/zoo") cannot return a status of "is symbolic link": zoo is not a symbolic link; bar is the

          Any operation that involves a path name, can issue a kickback which says: "I crossed a symbolic link whose target is another file system. I processed a part of the path and the remaining path is xxx; please resolve the remaining path xxx at the target of the symbolic link."
          In the above example the xxx (rest of path to process) is "zoo".

          One way to provide this kickback is to use an exception. (I consider an alternate below)
          The issue raised in the other comments in this Jira is whether or not an exception is suitable here - i.e. is this normal or abnormal operation of the NN?
          Following a remote symbolic link is not normal for a NN; it is not normal for a NN to recursively call another remote NN hence the exception is a reasonable way to deal with this situation.
          The FileSystem interface should clearly NOT throw such an exception because following symbolic links are a normal part of that interface; in this case I am not suggesting that the FileSystem throws an exception, merely the NN throws that exception.

          Many NameServices handle remote mounts as exceptions or kickbacks. For example the spring name service at Sun has optional trust relationship between name servers. If a symbolic link was to a name server with which it had trust relation then then it followed the symbolic link recursively (using DNS terminology) and otherwise it did a kickback via an exception and the client side followed the links iteratively (using DNS terminology again).

          An alternate way (really owen's suggestion) is to have open return a "FD" style object that contains the kickback information inside the FD object. The problem is that we will need a similar object for all operations that involve a pathname: open, stat, chmod, rename, etc. (Owen please comment this).
          Hence I feel the use of the exception is a better way to do this.

          BTW I have a document on the above that I started on a few months ago at Yahoo but it is not completed. Dhruba suggested that I attach it above; will do so over the next week or so.

          Show
          Sanjay Radia added a comment - Volume relative symbolic links (e.g. dot relative links) can be handled within the NN. However, whenever a symbolic link with a remote target is crossed the client side needs to handle it - I am calling this a kickback. Note that a symbolic link can occur during an open, create, chmod or any other operation that supplies a path name. Further it can occur anywhere in the path. Hence: open("/foo/bar/zoo") bar may be a symbolic link to a remote volume. A getFileStatus("/foo/bar/zoo") cannot return a status of "is symbolic link": zoo is not a symbolic link; bar is the Any operation that involves a path name, can issue a kickback which says: "I crossed a symbolic link whose target is another file system. I processed a part of the path and the remaining path is xxx; please resolve the remaining path xxx at the target of the symbolic link." In the above example the xxx (rest of path to process) is "zoo". One way to provide this kickback is to use an exception. (I consider an alternate below) The issue raised in the other comments in this Jira is whether or not an exception is suitable here - i.e. is this normal or abnormal operation of the NN? Following a remote symbolic link is not normal for a NN; it is not normal for a NN to recursively call another remote NN hence the exception is a reasonable way to deal with this situation. The FileSystem interface should clearly NOT throw such an exception because following symbolic links are a normal part of that interface; in this case I am not suggesting that the FileSystem throws an exception, merely the NN throws that exception. Many NameServices handle remote mounts as exceptions or kickbacks. For example the spring name service at Sun has optional trust relationship between name servers. If a symbolic link was to a name server with which it had trust relation then then it followed the symbolic link recursively (using DNS terminology) and otherwise it did a kickback via an exception and the client side followed the links iteratively (using DNS terminology again). An alternate way (really owen's suggestion) is to have open return a "FD" style object that contains the kickback information inside the FD object. The problem is that we will need a similar object for all operations that involve a pathname: open, stat, chmod, rename, etc. (Owen please comment this). Hence I feel the use of the exception is a better way to do this. BTW I have a document on the above that I started on a few months ago at Yahoo but it is not completed. Dhruba suggested that I attach it above; will do so over the next week or so.
          Hide
          Doug Cutting added a comment -

          > a symbolic link can occur during an open, create, chmod [ ... ] anywhere in the path.

          Thanks for clarifying. I had not considered all of these cases.

          > I am not suggesting that the FileSystem throws an exception, merely the NN throws that exception.

          That's less of an issue, indeed.

          > The problem is that we will need a similar object for all operations that involve a pathname: open, stat, chmod, rename, etc.

          Could we use a common base class for the return value of all of these RPC methods? The base class could have a "symLink" flag and fields that the client could check. Is that better than exceptions?

          Show
          Doug Cutting added a comment - > a symbolic link can occur during an open, create, chmod [ ... ] anywhere in the path. Thanks for clarifying. I had not considered all of these cases. > I am not suggesting that the FileSystem throws an exception, merely the NN throws that exception. That's less of an issue, indeed. > The problem is that we will need a similar object for all operations that involve a pathname: open, stat, chmod, rename, etc. Could we use a common base class for the return value of all of these RPC methods? The base class could have a "symLink" flag and fields that the client could check. Is that better than exceptions?
          Hide
          Robert Chansler added a comment -

          Is there a presumption that if (as in Sanjay's example) /foo/bar is a link to "X" that /foo/bar/zoo is exactly "X/zoo"? This might be problematic if X is a reference to another system, introducing a requirement for interfaces everywhere to have a "root" and "remaining" parameters. Or maybe as a practical matter this is unimportant. In any case the rules need to be clear. Is X="/a/b" different from X="/a/b/" when resolving /foo/bar/zoo? Are there cases where "/foo/bar/zoo" is not permitted as a path on the local system, requiring that path syntax checking be done late just in case /foo/bar is a link?

          Show
          Robert Chansler added a comment - Is there a presumption that if (as in Sanjay's example) /foo/bar is a link to "X" that /foo/bar/zoo is exactly "X/zoo"? This might be problematic if X is a reference to another system, introducing a requirement for interfaces everywhere to have a "root" and "remaining" parameters. Or maybe as a practical matter this is unimportant. In any case the rules need to be clear. Is X="/a/b" different from X="/a/b/" when resolving /foo/bar/zoo? Are there cases where "/foo/bar/zoo" is not permitted as a path on the local system, requiring that path syntax checking be done late just in case /foo/bar is a link?
          Hide
          Konstantin Shvachko added a comment -

          sanjay> merely the NN throws that exception.

          May be a symlink exception from the NN is not bad, but wrapping the SymLink data inside the exception seems to be a bad way of returning data back to the client. I mean this part of the code proposed above does not look right

          catch (SymLinkNotFollowedException e) {
                  f = e.getSymLink();
          }
          

          I would definitely prefer to have a common base class returned as Doug proposes.

          Show
          Konstantin Shvachko added a comment - sanjay> merely the NN throws that exception. May be a symlink exception from the NN is not bad, but wrapping the SymLink data inside the exception seems to be a bad way of returning data back to the client. I mean this part of the code proposed above does not look right catch (SymLinkNotFollowedException e) { f = e.getSymLink(); } I would definitely prefer to have a common base class returned as Doug proposes.
          Hide
          Konstantin Shvachko added a comment -

          Another observation is that usually symbolic links are local (Volume relative) but they can point to mounted namespaces. So may be we should support 2 extra inode types:

          • symlink: volume relative only
          • mount points.
            That of course does not eliminate problems discussed, but introduces a clear distinction between volume local and remote file system symlinks.
          Show
          Konstantin Shvachko added a comment - Another observation is that usually symbolic links are local (Volume relative) but they can point to mounted namespaces. So may be we should support 2 extra inode types: symlink: volume relative only mount points. That of course does not eliminate problems discussed, but introduces a clear distinction between volume local and remote file system symlinks.
          Hide
          Raghu Angadi added a comment -

          > catch (SymLinkNotFollowedException e) { [...]
          Just to clarify. the above was only mentioned as a way for DFSClient to inform FileSystem to overcome limitation of return type for open(). Not as an exception thrown by NameNode to client.

          Of course, exception is not necessary if open and friends are modified to return class/struct etc.

          Show
          Raghu Angadi added a comment - > catch (SymLinkNotFollowedException e) { [...] Just to clarify. the above was only mentioned as a way for DFSClient to inform FileSystem to overcome limitation of return type for open(). Not as an exception thrown by NameNode to client. Of course, exception is not necessary if open and friends are modified to return class/struct etc.
          Hide
          dhruba borthakur added a comment -

          I would like to propose that the symbolic links point only to files (and not to directories). Why this restriction:

          1. Symbolic links that point to directories are more like mount-points.
          2. My current use-case (HADOOP-4058) does not require symbolic links to point to directories.
          3. Directories in HDFS are more like object-references to file(s), they do not yet have all POSIX semantics associated with them. The clients do not parse components of the directory path, instead sends the entire pathname to the server for a lookup.
          4. Avoiding symbolic links that point to directories keeps the code simple (till we really need it). An open(/foo/bar/zoo) where bar is a symbolic link does not arise.

          In this proposal, a symbolic link can still be relative or absolute. No kickback (via exception or otherwise) is needed to implement this proposal. Also, this proposal does not preclude implementing the "symbolic links to directories" in the future.

          Show
          dhruba borthakur added a comment - I would like to propose that the symbolic links point only to files (and not to directories). Why this restriction: 1. Symbolic links that point to directories are more like mount-points. 2. My current use-case ( HADOOP-4058 ) does not require symbolic links to point to directories. 3. Directories in HDFS are more like object-references to file(s), they do not yet have all POSIX semantics associated with them. The clients do not parse components of the directory path, instead sends the entire pathname to the server for a lookup. 4. Avoiding symbolic links that point to directories keeps the code simple (till we really need it). An open(/foo/bar/zoo) where bar is a symbolic link does not arise. In this proposal, a symbolic link can still be relative or absolute. No kickback (via exception or otherwise) is needed to implement this proposal. Also, this proposal does not preclude implementing the "symbolic links to directories" in the future.
          Hide
          Raghu Angadi added a comment -

          > In this proposal, a symbolic link can still be relative or absolute. No kickback (via exception or otherwise) is needed to implement this proposal.
          Does it mean extra RPC as in the patch? If not, how does open() look?

          Show
          Raghu Angadi added a comment - > In this proposal, a symbolic link can still be relative or absolute. No kickback (via exception or otherwise) is needed to implement this proposal. Does it mean extra RPC as in the patch? If not, how does open() look?
          Hide
          Doug Cutting added a comment -

          > symbolic links point only to files (and not to directories).

          Isn't that hard to control? How would we stop someone from replacing the target of a symlink with a directory?

          Show
          Doug Cutting added a comment - > symbolic links point only to files (and not to directories). Isn't that hard to control? How would we stop someone from replacing the target of a symlink with a directory?
          Hide
          Allen Wittenauer added a comment -

          symbolic links to directories allows for admins to move content around without disrupting user processes. without symbolic links to directories, we'd end up creating a symbolic link for each file.

          that seems extremely counter-productive and more likely to cause the name node significant amount of grief.

          Show
          Allen Wittenauer added a comment - symbolic links to directories allows for admins to move content around without disrupting user processes. without symbolic links to directories, we'd end up creating a symbolic link for each file. that seems extremely counter-productive and more likely to cause the name node significant amount of grief.
          Hide
          dhruba borthakur added a comment -

          @Raghu: The FSDataInputStream will implement another interface StreamType that will store the information on whether this is a symbolc link. No extra RPC is needed.

          @Doug: By definition, the filesystem has no control about the target of a symbolic link. The user could make it point to a non-existent path too. There will be no enforcement of whether the target is a directory or not.

          @Allen: This proposal should not introduce additional grief for namenode. If you replace a file with a symbolic link, it still occupies one inode. No additional inodes are necessary. I agree that admins might like symbolic links to directories, but maybe we can do that at a later date when the user-case becomes more concrete. Does it sound acceptable?

          The simplicity for the filesystem is that no "kickbacks" while traversing paths are needed.

            /**
             * Opens an FSDataInputStream at the indicated Path.
             * This does not follow symbolic links.
             * @param f the file name to open
             * @param bufferSize the size of the buffer to be used.
             */
            public abstract FSDataInputStream openfs(Path f, int bufferSize)
              throws IOException;
          
            /**
             * Opens an FSDataInputStream at the indicated Path.
             * If the specified path is a symbolic link, then open the
             * target of the symbolic link.
             * @param f the file to open
             */
            public FSDataInputStream open(Path f) throws IOException {
              return open(f, getConf().getInt("io.file.buffer.size", 4096));
            }
          
            /**
             * Opens an FSDataInputStream at the indicated Path.
             * @param f the file to open
             * @param bufferSize the size of the buffer to be used.
             */
            public FSDataInputStream open(Path f, int bufferSize)
              throws IOException {
              FileSystem fs = this;
              FSDataInputStream in = fs.openfs(f, bufferSize);
              while (in.isSymlink()) {
                // construct new path pointed to by symlink
                Path newpath = new Path(in.getSymlink());
                if (!newpath.isAbsolute()) {
                  newpath = new Path(f.getParent(), newpath);
                }
                in.close();
                f = newpath;
                fs = f.getFileSystem(getConf());
                LOG.warn("XXX Opening symlink at " + f);
                in = fs.openfs(f, bufferSize);
              }
              return in;
            }
          
          
          public class FSDataInputStream extends DataInputStream
              implements Seekable, PositionedReadable, StreamType {
          ....
          }
          
          
          /** Types of streams */
          interface StreamType {
            /**
             * Is this stream a symbolic link?
             */
            public boolean isSymlink() throws IOException;
          
            /**
             * Return the contents of the symlink
             */
            public String getSymlink() throws IOException;
          }
          
          
          Show
          dhruba borthakur added a comment - @Raghu: The FSDataInputStream will implement another interface StreamType that will store the information on whether this is a symbolc link. No extra RPC is needed. @Doug: By definition, the filesystem has no control about the target of a symbolic link. The user could make it point to a non-existent path too. There will be no enforcement of whether the target is a directory or not. @Allen: This proposal should not introduce additional grief for namenode. If you replace a file with a symbolic link, it still occupies one inode. No additional inodes are necessary. I agree that admins might like symbolic links to directories, but maybe we can do that at a later date when the user-case becomes more concrete. Does it sound acceptable? The simplicity for the filesystem is that no "kickbacks" while traversing paths are needed. /** * Opens an FSDataInputStream at the indicated Path. * This does not follow symbolic links. * @param f the file name to open * @param bufferSize the size of the buffer to be used. */ public abstract FSDataInputStream openfs(Path f, int bufferSize) throws IOException; /** * Opens an FSDataInputStream at the indicated Path. * If the specified path is a symbolic link, then open the * target of the symbolic link. * @param f the file to open */ public FSDataInputStream open(Path f) throws IOException { return open(f, getConf().getInt("io.file.buffer.size", 4096)); } /** * Opens an FSDataInputStream at the indicated Path. * @param f the file to open * @param bufferSize the size of the buffer to be used. */ public FSDataInputStream open(Path f, int bufferSize) throws IOException { FileSystem fs = this; FSDataInputStream in = fs.openfs(f, bufferSize); while (in.isSymlink()) { // construct new path pointed to by symlink Path newpath = new Path(in.getSymlink()); if (!newpath.isAbsolute()) { newpath = new Path(f.getParent(), newpath); } in.close(); f = newpath; fs = f.getFileSystem(getConf()); LOG.warn("XXX Opening symlink at " + f); in = fs.openfs(f, bufferSize); } return in; } public class FSDataInputStream extends DataInputStream implements Seekable, PositionedReadable, StreamType { .... } /** Types of streams */ interface StreamType { /** * Is this stream a symbolic link? */ public boolean isSymlink() throws IOException; /** * Return the contents of the symlink */ public String getSymlink() throws IOException; }
          Hide
          dhruba borthakur added a comment -

          This patch implements supporting symbolic links to files (not directories). This is attached to discuss/demonstrate the code-simplicity of this proposal.

          Show
          dhruba borthakur added a comment - This patch implements supporting symbolic links to files (not directories). This is attached to discuss/demonstrate the code-simplicity of this proposal.
          Hide
          Doug Cutting added a comment -

          > There will be no enforcement of whether the target is a directory or not.

          To be clear, under your proposal, if one does link hdfs:///foo/bar to s3:///bar, and s3:///bar/baz exists, if one tries to list hdfs:///foo/bar/baz one will get a FileNotFound exception. Is that right? If so, it seems like a major loss in functionality.

          Show
          Doug Cutting added a comment - > There will be no enforcement of whether the target is a directory or not. To be clear, under your proposal, if one does link hdfs:///foo/bar to s3:///bar, and s3:///bar/baz exists, if one tries to list hdfs:///foo/bar/baz one will get a FileNotFound exception. Is that right? If so, it seems like a major loss in functionality.
          Hide
          Raghu Angadi added a comment - - edited

          @Raghu: The FSDataInputStream will implement another interface StreamType that will store the information on whether this is a symbolc link. No extra RPC is needed.

          Ok, this is one of the options mentioned earlier that works just for open(). As the discussion in the Jira shows, symlinks is kind of fundamental change to filesystem and affects various parts. Just like open(), once various options for general handling of links is discussed here, I don't think patch will be much more complicated than what you have attached.

          Symlinks have been used for many decades and I don't think there any lack of use cases. I don't think HADOOP-4058 should alone be the criteria for this Jira.

          Show
          Raghu Angadi added a comment - - edited @Raghu: The FSDataInputStream will implement another interface StreamType that will store the information on whether this is a symbolc link. No extra RPC is needed. Ok, this is one of the options mentioned earlier that works just for open(). As the discussion in the Jira shows, symlinks is kind of fundamental change to filesystem and affects various parts. Just like open(), once various options for general handling of links is discussed here, I don't think patch will be much more complicated than what you have attached. Symlinks have been used for many decades and I don't think there any lack of use cases. I don't think HADOOP-4058 should alone be the criteria for this Jira.
          Hide
          dhruba borthakur added a comment -

          @Doug: Your example scenario is correct. It could be a major loss in functionality. But maybe we can put in the additional complexity and reap the additional benefits at a later date. I am assuming that the need for the additional functionality can wait for some time.

          @Raghu: I agree that synlinks have many general purpose use-cases. But there are also file system systems (very similar to Hadoop )that do not support symlinks/hardlinks at all.

          For symlinks supported by various OS: the OS calls the file system with a lookup call for every path component. The filesystem resolves only one piece of the path component at a time. In HDFS, the FileSystem is passed in the entire path. Thus, the technique to resolve partial paths becomes messy and does not have any precedence. That's why I am proposing that we delay this part of the functionality.

          Comments/feedback welcome.

          Show
          dhruba borthakur added a comment - @Doug: Your example scenario is correct. It could be a major loss in functionality. But maybe we can put in the additional complexity and reap the additional benefits at a later date. I am assuming that the need for the additional functionality can wait for some time. @Raghu: I agree that synlinks have many general purpose use-cases. But there are also file system systems (very similar to Hadoop )that do not support symlinks/hardlinks at all. For symlinks supported by various OS: the OS calls the file system with a lookup call for every path component. The filesystem resolves only one piece of the path component at a time. In HDFS, the FileSystem is passed in the entire path. Thus, the technique to resolve partial paths becomes messy and does not have any precedence. That's why I am proposing that we delay this part of the functionality. Comments/feedback welcome.
          Hide
          Raghu Angadi added a comment -

          > I agree that synlinks have many general purpose use-cases. But there are also file system systems (very similar to Hadoop )that do not support symlinks/hardlinks at all.

          yes. They probably don't have symlinks proposed in this jira either. I mainly wanted say once interface is there, implementation of real symlinks is not much more complicated than the attached patch. I could be wrong...

          Show
          Raghu Angadi added a comment - > I agree that synlinks have many general purpose use-cases. But there are also file system systems (very similar to Hadoop )that do not support symlinks/hardlinks at all. yes. They probably don't have symlinks proposed in this jira either. I mainly wanted say once interface is there, implementation of real symlinks is not much more complicated than the attached patch. I could be wrong...
          Hide
          Joydeep Sen Sarma added a comment -

          wrt Allen's comment - i think it's not without merit. he's probably thinking about moving things within the same hdfs instance - in which case having to have a symlink per leaf is going to lead to doubling of inodes (for the subtree being moved).

          Couple of orthogonal ideas:
          1. would it not be easy to write a new file system client that does component by component lookup? (instead of DFSClient behavior). this would not seem to require any namenode changes.
          2. support symlinks only within the same hdfs instance. the namenode can resolve things internally - no changes to the client.

          then we can combine these two approaches. The namenode can return specific error if it finds the path to be resolved has reference to external fs (using work done in #2). The regular DFSClient can revert to special client file system that does component by component lookup (work done in #1) in case of this error. this will prevent having to incur the overhead of component by component lookup unless one hits a symlink that crosses fs boundary.

          Show
          Joydeep Sen Sarma added a comment - wrt Allen's comment - i think it's not without merit. he's probably thinking about moving things within the same hdfs instance - in which case having to have a symlink per leaf is going to lead to doubling of inodes (for the subtree being moved). Couple of orthogonal ideas: 1. would it not be easy to write a new file system client that does component by component lookup? (instead of DFSClient behavior). this would not seem to require any namenode changes. 2. support symlinks only within the same hdfs instance. the namenode can resolve things internally - no changes to the client. then we can combine these two approaches. The namenode can return specific error if it finds the path to be resolved has reference to external fs (using work done in #2). The regular DFSClient can revert to special client file system that does component by component lookup (work done in #1) in case of this error. this will prevent having to incur the overhead of component by component lookup unless one hits a symlink that crosses fs boundary.
          Hide
          Doug Cutting added a comment -

          > would it not be easy to write a new file system client that does component by component lookup?

          Sure, that's possible, but it would result in lots more RPCs per file operation. We don't want to multiply the load on the namenode in this way. So this should only involve one RPC per foreign link traversed rather than per component.

          Show
          Doug Cutting added a comment - > would it not be easy to write a new file system client that does component by component lookup? Sure, that's possible, but it would result in lots more RPCs per file operation. We don't want to multiply the load on the namenode in this way. So this should only involve one RPC per foreign link traversed rather than per component.
          Hide
          Joydeep Sen Sarma added a comment -

          Doug - that's what i was proposing. doing it in a way where the component by component lookup is only done when sym links are encountered.

          Show
          Joydeep Sen Sarma added a comment - Doug - that's what i was proposing. doing it in a way where the component by component lookup is only done when sym links are encountered.
          Hide
          Raghu Angadi added a comment -

          > doing it in a way where the component by component lookup is only done when sym links are encountered.

          +1. this was implicit in most of the proposals I think.

          Show
          Raghu Angadi added a comment - > doing it in a way where the component by component lookup is only done when sym links are encountered. +1. this was implicit in most of the proposals I think.
          Hide
          Doug Cutting added a comment -

          > component by component lookup is only done when sym links are encountered.

          It depends on what you mean by "component-by-component". If, in hdfs://nn/foo/bar/baz/boo, baz is a symbolic link to a different filesystem, then we should only make a single RPC to nn to resolve this link, not one for /foo, one for /foo/bar, etc.

          Show
          Doug Cutting added a comment - > component by component lookup is only done when sym links are encountered. It depends on what you mean by "component-by-component". If, in hdfs://nn/foo/bar/baz/boo, baz is a symbolic link to a different filesystem, then we should only make a single RPC to nn to resolve this link, not one for /foo, one for /foo/bar, etc.
          Hide
          dhruba borthakur added a comment -

          So folks,

          Proposal1 : Do you think that it is a good idea to put in a subset of the full symlink functionality (without support for symlinks to directories) as a first cut? And then, if a use case arises, we can do the full symlink functionality at a later date.

          Proposal 2: Proposal 1 + if the symlink is for a pathname in the same namenode, then the namenode will resolve the symlink transparently. This will allow symbolic links to directories that point to paths inside the same namenode.

          Please respond by putting +1/-1/0 for the two proposals. The reason I am not implementing the full set of functionality right now is because 0.19 cutoff data is only a few days away!

          Show
          dhruba borthakur added a comment - So folks, Proposal1 : Do you think that it is a good idea to put in a subset of the full symlink functionality (without support for symlinks to directories) as a first cut? And then, if a use case arises, we can do the full symlink functionality at a later date. Proposal 2: Proposal 1 + if the symlink is for a pathname in the same namenode, then the namenode will resolve the symlink transparently. This will allow symbolic links to directories that point to paths inside the same namenode. Please respond by putting +1/-1/0 for the two proposals. The reason I am not implementing the full set of functionality right now is because 0.19 cutoff data is only a few days away!
          Hide
          Raghu Angadi added a comment -

          -1 for the reasons for not doing the full symlinks : It will miss 0.19 and/or it is too complex.

          -0 for Proposal1
          -1 for Proposal2

          Show
          Raghu Angadi added a comment - -1 for the reasons for not doing the full symlinks : It will miss 0.19 and/or it is too complex. -0 for Proposal1 -1 for Proposal2
          Hide
          Doug Cutting added a comment -

          What use cases are there for filesystem-internal symlinks? The really interesting use cases are cross-filesytem, since that could make archives transparent, and also permit a volume-monting-like style, permitting a single namespace to span multiple clusters, potentially making the namenode less of a bottleneck.

          Without some strong use cases I'm -1 for any implementation that does not support cross-filesystem links.

          Show
          Doug Cutting added a comment - What use cases are there for filesystem-internal symlinks? The really interesting use cases are cross-filesytem, since that could make archives transparent, and also permit a volume-monting-like style, permitting a single namespace to span multiple clusters, potentially making the namenode less of a bottleneck. Without some strong use cases I'm -1 for any implementation that does not support cross-filesystem links.
          Hide
          Bryan Duxbury added a comment -

          We would use filesystem-internal symlinks as a sort of fast-copy in situations where parts of our dataset are actually processed and others are merely copied. Technically, a hard link or copy-on-write scheme (ala GFS) would be more suited, but we could make do with symlinks.

          Show
          Bryan Duxbury added a comment - We would use filesystem-internal symlinks as a sort of fast-copy in situations where parts of our dataset are actually processed and others are merely copied. Technically, a hard link or copy-on-write scheme (ala GFS) would be more suited, but we could make do with symlinks.
          Hide
          Allen Wittenauer added a comment -

          -1 to both, because I think we're rushing this without a significant amount of thought as to what a multi-filesystem directory structure should look like.

          In my mind, cross-filesystem "symlinks" really don't seem like "real" symlinks to me, but a sort of hacked version of autofs support. How do you determine if a file is "real" or not? How do I "undo" a symlink? What does this look like from an audit log perspective? Is this really just a "short cut" to providing the equivalent of mount?

          If we ignore what I said above , I'm trying to think of a use case where I wouldn't want to symlink an entire dir. A single file? Really?

          I think it is much more interesting to provide a symlink to a directory. I can have "data/v1.423" and then have a symlink called LATEST that points to the latest version of a given versioned data set.

          Show
          Allen Wittenauer added a comment - -1 to both, because I think we're rushing this without a significant amount of thought as to what a multi-filesystem directory structure should look like. In my mind, cross-filesystem "symlinks" really don't seem like "real" symlinks to me, but a sort of hacked version of autofs support. How do you determine if a file is "real" or not? How do I "undo" a symlink? What does this look like from an audit log perspective? Is this really just a "short cut" to providing the equivalent of mount? If we ignore what I said above , I'm trying to think of a use case where I wouldn't want to symlink an entire dir. A single file? Really? I think it is much more interesting to provide a symlink to a directory. I can have "data/v1.423" and then have a symlink called LATEST that points to the latest version of a given versioned data set.
          Hide
          Konstantin Shvachko added a comment -

          I am -1 on partial implementations, because your case of file-only symlinks can be handled on the application level.

          In fact path resolution as Joydeep proposed can be handled outside the file system without checking each path component.
          We just need to throw FileNotFoundException with the exact path prefix that has not been found.
          E.g. you want to open a file path = /dir0/link/dir1/file where link is a file containing a symlink /dir3/dir4/.
          If you call open(path) a FileNotFoundException will be thrown with the message "File is not found: /dir0/link/dir1/".
          Now the application reads contents of file /dir0/link, creates a new path = /dir3/dir4/dir1/file, and calls open(path) again.
          An optimization to this can be that name-node throws a new DirectoryNotFoundException("Directory is not found: /dir0/link")
          then the client knows for sure that "link" does exist, but is a file rather than a directory, and therefore can be a symlink.
          Seems to me

          • this suites your case,
          • requires minimal changes to the fs internals and
          • does not affect existing hdfs behavior when links are not used.
          Show
          Konstantin Shvachko added a comment - I am -1 on partial implementations, because your case of file-only symlinks can be handled on the application level. In fact path resolution as Joydeep proposed can be handled outside the file system without checking each path component. We just need to throw FileNotFoundException with the exact path prefix that has not been found. E.g. you want to open a file path = /dir0/link/dir1/file where link is a file containing a symlink /dir3/dir4/ . If you call open(path) a FileNotFoundException will be thrown with the message "File is not found: /dir0/link/dir1/ ". Now the application reads contents of file /dir0/link , creates a new path = /dir3/dir4/dir1/file , and calls open(path) again. An optimization to this can be that name-node throws a new DirectoryNotFoundException("Directory is not found: /dir0/link") then the client knows for sure that "link" does exist, but is a file rather than a directory, and therefore can be a symlink. Seems to me this suites your case, requires minimal changes to the fs internals and does not affect existing hdfs behavior when links are not used.
          Hide
          Raghu Angadi added a comment -

          Also this is useful and and important enough feature to deserve an exception for 0.19.x release if 0.20.0 is too late or has too much other stuff.

          > In my mind, cross-filesystem "symlinks" really don't seem like "real" symlinks to me, but a sort of hacked version of autofs support.
          If we think of Hadoop FileSystem as VFS, and HDFS, S3, LocalFS etc as various filesystems, then cross FS symlinks look more like real symlinks (like link in ext3 linking to a directory in NTFS). Yes, they are still not very real since a symlink in LocalFS can not point to "hdfs://users" (actually it is possible to handle those to certain extent).

          Show
          Raghu Angadi added a comment - Also this is useful and and important enough feature to deserve an exception for 0.19.x release if 0.20.0 is too late or has too much other stuff. > In my mind, cross-filesystem "symlinks" really don't seem like "real" symlinks to me, but a sort of hacked version of autofs support. If we think of Hadoop FileSystem as VFS, and HDFS, S3, LocalFS etc as various filesystems, then cross FS symlinks look more like real symlinks (like link in ext3 linking to a directory in NTFS). Yes, they are still not very real since a symlink in LocalFS can not point to "hdfs://users" (actually it is possible to handle those to certain extent).
          Hide
          Sanjay Radia added a comment -

          Symlink vs mount distinction.
          -------------------------------------------
          The reason the two concepts merge in Hadoop (while they are distinct in other systems) is because Hadoop has file URIs that can point to arbitrary files/dirs in arbitrary volumes. This allows a mount to be merely specified by a URI of the target (as oppose to server name, protocol etc.).

          I think Allen has a valid point that remote symlinks to dirs are more useful than symlinks to files, especially for moving load (storage load or operational load) off to other name servers.

          @JoyDeep's component by component look up file system.
          He proposes a two step approach – if error occurs in following a symlink, then do component by component look up. If one is willing to return an error on a symlink then it equally easy to send more information with the the error ; the error information can denote the remaining path and one can easily implement symlinks efficiently.

          Implementing a subset
          ------------------------------
          We should do a design that works for symliks to dirs or files; then implementing a subset to files is fine.

          The use case for symlinks to directories is strong:

          • HADOOP-4058 is made stronger using symlinks to dirs as Allen points out. Offloading entire dirs to a remote volume is as useful if not more useful than offloading individual files.
          • Symlinks to archive file systems (har://...) is useful and helps make the archive transparent.
          • Filesystems that have symlinks to files also have symlinks to dirs. (My own use of symlinks in Unix is mostly to dirs rather than files.)
          • Most file systems have a mount operation and its use case is well established.
            Mounts, as pointed above, is best achieved in Hadoop as a symlink to dirs due to Hadoop's URI names. A separate API for mounts is not needed.

          I am a little puzzled about why folks think kickbacks are hard. It will take a few discussions to get the internal interfaces right but it is not that complicated. Hence it is hard to do this for 0.19 but we can do it for 0.20.
          Dhruba, looks like Facebook wants this subset feature in 0.19 - other wise it is best to wait for .20 and do it right.

          My vote is that if we can determine that the API for symlinks to files is strict subset then we can go ahead and do the subset in 0.19. Otherwise wait till 0.20 for the full solution.
          (I think we can do the dot relative symlinks later since Facebook's use case does not need it in 0.19).

          We also need to determine the semantics of other file system operations besides open.
          A good way to proceed is to create a table with a column for the full solution and another for the symlinks-to- files subset.
          Q If the symlink target happens to be a directory (or later gets changed to a dir) do we throw an exception?

          Show
          Sanjay Radia added a comment - Symlink vs mount distinction. ------------------------------------------- The reason the two concepts merge in Hadoop (while they are distinct in other systems) is because Hadoop has file URIs that can point to arbitrary files/dirs in arbitrary volumes. This allows a mount to be merely specified by a URI of the target (as oppose to server name, protocol etc.). I think Allen has a valid point that remote symlinks to dirs are more useful than symlinks to files, especially for moving load (storage load or operational load) off to other name servers. @JoyDeep's component by component look up file system. He proposes a two step approach – if error occurs in following a symlink, then do component by component look up. If one is willing to return an error on a symlink then it equally easy to send more information with the the error ; the error information can denote the remaining path and one can easily implement symlinks efficiently. Implementing a subset ------------------------------ We should do a design that works for symliks to dirs or files; then implementing a subset to files is fine. The use case for symlinks to directories is strong: HADOOP-4058 is made stronger using symlinks to dirs as Allen points out. Offloading entire dirs to a remote volume is as useful if not more useful than offloading individual files. Symlinks to archive file systems (har://...) is useful and helps make the archive transparent. Filesystems that have symlinks to files also have symlinks to dirs. (My own use of symlinks in Unix is mostly to dirs rather than files.) Most file systems have a mount operation and its use case is well established. Mounts, as pointed above, is best achieved in Hadoop as a symlink to dirs due to Hadoop's URI names. A separate API for mounts is not needed. I am a little puzzled about why folks think kickbacks are hard. It will take a few discussions to get the internal interfaces right but it is not that complicated. Hence it is hard to do this for 0.19 but we can do it for 0.20. Dhruba, looks like Facebook wants this subset feature in 0.19 - other wise it is best to wait for .20 and do it right. My vote is that if we can determine that the API for symlinks to files is strict subset then we can go ahead and do the subset in 0.19. Otherwise wait till 0.20 for the full solution. (I think we can do the dot relative symlinks later since Facebook's use case does not need it in 0.19). We also need to determine the semantics of other file system operations besides open. A good way to proceed is to create a table with a column for the full solution and another for the symlinks-to- files subset. Q If the symlink target happens to be a directory (or later gets changed to a dir) do we throw an exception?
          Hide
          dhruba borthakur added a comment -

          +1 to Sanjay's proposal. I am all for symlinks to any arbitrary path (dirs or files). My initial proposal was to gauge the feeling of the community on figuring out whether it is better to get symlinks to files into 0.19 and then continue to work to make symlinks to any arbitrary path in 0.20. The code is not difficult, but there will some changes to existing interfaces for DistributedFileSystem.java.

          "kickbaks" are hard, not from a design point of view but for coding point of view....especially because we do not want to throw UnresolvedPathException all the way back to the client. This means, the NameNode has to send back the "kickback" information as part of the result code for almost every RPC. This, in turn, means that almost all APIs in ClientProtocol will have to be modified.

          Show
          dhruba borthakur added a comment - +1 to Sanjay's proposal. I am all for symlinks to any arbitrary path (dirs or files). My initial proposal was to gauge the feeling of the community on figuring out whether it is better to get symlinks to files into 0.19 and then continue to work to make symlinks to any arbitrary path in 0.20. The code is not difficult, but there will some changes to existing interfaces for DistributedFileSystem.java. "kickbaks" are hard, not from a design point of view but for coding point of view....especially because we do not want to throw UnresolvedPathException all the way back to the client. This means, the NameNode has to send back the "kickback" information as part of the result code for almost every RPC. This, in turn, means that almost all APIs in ClientProtocol will have to be modified.
          Hide
          Robert Chansler added a comment -

          -1
          Sometimes just doing the easy part is not good enough. In this case, the reasonable expectation of users is that if HDFS has links, you can 1) link to anything, and 2) links can be path components. (Those two issues aren't well-separated in the previous discussion.) It's not reasonable to have to keep explaining that links only sort of work to each user. And even the easy case is not completely thought out: Is it an error to make a link to a directory, or is that something that just won't work when using the link? What about modifications to the shell commands to support links?

          Show
          Robert Chansler added a comment - -1 Sometimes just doing the easy part is not good enough. In this case, the reasonable expectation of users is that if HDFS has links, you can 1) link to anything, and 2) links can be path components. (Those two issues aren't well-separated in the previous discussion.) It's not reasonable to have to keep explaining that links only sort of work to each user. And even the easy case is not completely thought out: Is it an error to make a link to a directory, or is that something that just won't work when using the link? What about modifications to the shell commands to support links?
          Hide
          dhruba borthakur added a comment -

          First version of the patch that implements the entire symlink solution. Please review.

          Show
          dhruba borthakur added a comment - First version of the patch that implements the entire symlink solution. Please review.
          Hide
          Allen Wittenauer added a comment -

          1) Am I missing it, or is there no equivalent of chown -h / chmod -H?

          2) What does the audit log look like when someone does operations on a symlink? Does it show the resolved path or does it show the symlink when it is read? What does symlink creation look like?

          Show
          Allen Wittenauer added a comment - 1) Am I missing it, or is there no equivalent of chown -h / chmod -H? 2) What does the audit log look like when someone does operations on a symlink? Does it show the resolved path or does it show the symlink when it is read? What does symlink creation look like?
          Hide
          dhruba borthakur added a comment - - edited

          1. Can you pl point me to what chmod -H should do?

          2. Creation of an symlink log shows up this way in the audit log:

          ugi=dhruba,users,engineers,svnuser ip=/1.2.3.4 cmd=createSymlink src=/user/dhruba/foo dst=../test1 perm=dhruba:none:rw-rw-rw-

          Show
          dhruba borthakur added a comment - - edited 1. Can you pl point me to what chmod -H should do? 2. Creation of an symlink log shows up this way in the audit log: ugi=dhruba,users,engineers,svnuser ip=/1.2.3.4 cmd=createSymlink src=/user/dhruba/foo dst=../test1 perm=dhruba:none:rw-rw-rw-
          Hide
          Allen Wittenauer added a comment - - edited

          chmod -H (under BSD, at least):

          -H If the -R option is specified, symbolic links on the command line
          are followed. (Symbolic links encountered in the tree traversal
          are not followed by default.)

          This likely only makes sense if the symlink is pointing to the local file system or another HDFS, though.

          BSD also supports -L which might be useful as well:

          -L If the -R option is specified, all symbolic links are followed.

          Show
          Allen Wittenauer added a comment - - edited chmod -H (under BSD, at least): -H If the -R option is specified, symbolic links on the command line are followed. (Symbolic links encountered in the tree traversal are not followed by default.) This likely only makes sense if the symlink is pointing to the local file system or another HDFS, though. BSD also supports -L which might be useful as well: -L If the -R option is specified, all symbolic links are followed.
          Hide
          dhruba borthakur added a comment -

          Merged patch with latest trunk.

          Show
          dhruba borthakur added a comment - Merged patch with latest trunk.
          Hide
          Doug Cutting added a comment -

          Some comments:

          • I don't like the 'vfs' package and naming. Symbolic links should not be a distinguished portion of the FileSystem API, but seamlessly integrated. So I suggest that vfs/VfsStatus be renamed to FSLinkable, vfs/VfsStatusBase to FSLink, vfs/VfsStatusBoolean to FSLinkBoolean, vfs/VfsStatusFileStatus to LinkableFileStatus, etc. If these are to be in a separate package, it might be called 'fs/spi', since they are primarily needed only by implementors of the FileSystem API, not by users. The protected implementation methods should be called openImpl(), appendImpl(), etc.
          • getLink() should return a Path, not a String.
          • getLink() should throw an exception when isLink() is false.
          • The check for link cycles is wrong. If the loop starts after the first link traversed it will not be detected. A common approach is simply to limit the number of links traversed to a constant. Alternately you can keep a 'fast' and 'slow' pointer, incrementing the fast pointer through the list twice as fast as the slow. If they are ever equal then there's a loop. This will detect all loops.
          • I don't see the need for both getLink() and getRemainingPath(). Wouldn't it be simpler to always have getLink() return a fully-qualified path? Internally a FileSystem might support relative paths, but why do we need to expose these?

          Instead of repeating the link resolving loop in every method, we might use a "closure", e.g:

          public FSInputStream open(Path p, final int bufferSize) throws IOException {
            return resolve(path, new FSLinkResolver<FSInputStream>() {
              FSInputStream next(Path p) throws IOException { return openImpl(p, bufferSize); }
          };
          

          where FSLinkResolver#resolve implements the loop-detection algorithm, calling #next to traverse the list.

          Show
          Doug Cutting added a comment - Some comments: I don't like the 'vfs' package and naming. Symbolic links should not be a distinguished portion of the FileSystem API, but seamlessly integrated. So I suggest that vfs/VfsStatus be renamed to FSLinkable, vfs/VfsStatusBase to FSLink, vfs/VfsStatusBoolean to FSLinkBoolean, vfs/VfsStatusFileStatus to LinkableFileStatus, etc. If these are to be in a separate package, it might be called 'fs/spi', since they are primarily needed only by implementors of the FileSystem API, not by users. The protected implementation methods should be called openImpl(), appendImpl(), etc. getLink() should return a Path, not a String. getLink() should throw an exception when isLink() is false. The check for link cycles is wrong. If the loop starts after the first link traversed it will not be detected. A common approach is simply to limit the number of links traversed to a constant. Alternately you can keep a 'fast' and 'slow' pointer, incrementing the fast pointer through the list twice as fast as the slow. If they are ever equal then there's a loop. This will detect all loops. I don't see the need for both getLink() and getRemainingPath(). Wouldn't it be simpler to always have getLink() return a fully-qualified path? Internally a FileSystem might support relative paths, but why do we need to expose these? Instead of repeating the link resolving loop in every method, we might use a "closure", e.g: public FSInputStream open(Path p, final int bufferSize) throws IOException { return resolve(path, new FSLinkResolver<FSInputStream>() { FSInputStream next(Path p) throws IOException { return openImpl(p, bufferSize); } }; where FSLinkResolver#resolve implements the loop-detection algorithm, calling #next to traverse the list.
          Hide
          dhruba borthakur added a comment -

          Hi Doug, thanks for your comments. I made all the changes you suggested except one. Please let me know if these changes look good. I am working on another unit test for this feature.

          You had suggested that fs/spi/VfsStatusBase be renamed to FSLink. However, my thinking is that the VfsStatusBase can be used not only for links but for other items too. For example, if we want to return other types of NameNode exceptions to the client, we could use VfsStatusBase. That's the reason why I am hesitant to rename it to FSLink.

          Show
          dhruba borthakur added a comment - Hi Doug, thanks for your comments. I made all the changes you suggested except one. Please let me know if these changes look good. I am working on another unit test for this feature. You had suggested that fs/spi/VfsStatusBase be renamed to FSLink. However, my thinking is that the VfsStatusBase can be used not only for links but for other items too. For example, if we want to return other types of NameNode exceptions to the client, we could use VfsStatusBase. That's the reason why I am hesitant to rename it to FSLink.
          Hide
          Doug Cutting added a comment -

          > However, my thinking is that the VfsStatusBase can be used not only for links but for other items too.

          But aren't all those other cases where the return value can be a link? I'll try to review the patch ASAP.

          Show
          Doug Cutting added a comment - > However, my thinking is that the VfsStatusBase can be used not only for links but for other items too. But aren't all those other cases where the return value can be a link? I'll try to review the patch ASAP.
          Hide
          Doug Cutting added a comment -

          I still much prefer the name 'link' to 'vfs'. You use the name 'vfs' in support of symbolic links only, and thus should be named accordingly. There's no need to invent another name here. No one knows what a 'vfs' is, and we don't need to define it or confuse folks with this notion. Everyone already knows what a link is.

          Can we get away with only a single implementation of the link resolver? You already create a bogus FileStatus in the status-based link resolver, so can't we use the same approach to implement the path-based link resolver in terms of the status-based one, so that we only have a single implementation of the loop-detection logic, not two?

          There are still a few implementations of getSymlink() that don't throw exceptions for non-links.

          You still have not explained the necessity of getRemainingPath().

          Show
          Doug Cutting added a comment - I still much prefer the name 'link' to 'vfs'. You use the name 'vfs' in support of symbolic links only, and thus should be named accordingly. There's no need to invent another name here. No one knows what a 'vfs' is, and we don't need to define it or confuse folks with this notion. Everyone already knows what a link is. Can we get away with only a single implementation of the link resolver? You already create a bogus FileStatus in the status-based link resolver, so can't we use the same approach to implement the path-based link resolver in terms of the status-based one, so that we only have a single implementation of the loop-detection logic, not two? There are still a few implementations of getSymlink() that don't throw exceptions for non-links. You still have not explained the necessity of getRemainingPath().
          Hide
          Raghu Angadi added a comment -

          +1 for having "impl" methods for FileSystem implementations.

          Currently the patch does both of these :

          • Implementation throw UnresolvedPathException
          • Many return types are changed to return this path information.

          Are both of these required or just one? I think throwing the exception is enough and might be cleaner w.r.t interface. The api can return what makes sense for it rather worrying about link related class.

          Show
          Raghu Angadi added a comment - +1 for having "impl" methods for FileSystem implementations. Currently the patch does both of these : Implementation throw UnresolvedPathException Many return types are changed to return this path information. Are both of these required or just one? I think throwing the exception is enough and might be cleaner w.r.t interface. The api can return what makes sense for it rather worrying about link related class.
          Hide
          Doug Cutting added a comment -

          > Many return types are changed to return this path information.

          Note that only the return types in the impl methods are changed. The user methods retain their original signatures.

          > I think throwing the exception is enough [ ... ]

          But a link is not an exceptional situation, it is a normal situation and should thus be handled with normal control flow. It does not require the non-local control flow that exceptions provide. The callee needs to pass specific information directly back to the caller. Using a common base class is appropriate here.

          UnresolvedPathException is only used within HDFS, not in the FileSystem API. My review has focussed on the changes to the changes to the public FileSystem API, not to the HDFS implementation. Someone more familiar with HDFS should review the HDFS portiions of this patch. On a cursory examination, it does appear that this patch does use exceptions for control flow within HDFS, which looks like a bad idea to me. Rather HDFS's internal APIs should be modified to explicitly handle links as return values.

          Show
          Doug Cutting added a comment - > Many return types are changed to return this path information. Note that only the return types in the impl methods are changed. The user methods retain their original signatures. > I think throwing the exception is enough [ ... ] But a link is not an exceptional situation, it is a normal situation and should thus be handled with normal control flow. It does not require the non-local control flow that exceptions provide. The callee needs to pass specific information directly back to the caller. Using a common base class is appropriate here. UnresolvedPathException is only used within HDFS, not in the FileSystem API. My review has focussed on the changes to the changes to the public FileSystem API, not to the HDFS implementation. Someone more familiar with HDFS should review the HDFS portiions of this patch. On a cursory examination, it does appear that this patch does use exceptions for control flow within HDFS, which looks like a bad idea to me. Rather HDFS's internal APIs should be modified to explicitly handle links as return values.
          Hide
          dhruba borthakur added a comment -

          @Doug: Ok, I will post a new patch very soon.

          getRemainingPath() returns the trailing path components that could not be resolved by a file system implementation.

          Suppose one is trying to resolve hdfs://home/dhruba/link/dir/file.txt and link is a symbolic link pointing to file:///etc, then getSymlink() will return file:///etc and getRemainingPath() will return dir/file.txt. Now, FileSystem.java will use these two components to create a new path. Are you saying that getSymlink() can return file:///etc/dir/file.txt?

          @Raghu: UnresolvedPathException is a namenode private class. It is not thrown to the dfs client. The Namenode catches this exception, maps it to FSLink object and returns it to the client.

          Show
          dhruba borthakur added a comment - @Doug: Ok, I will post a new patch very soon. getRemainingPath() returns the trailing path components that could not be resolved by a file system implementation. Suppose one is trying to resolve hdfs://home/dhruba/link/dir/file.txt and link is a symbolic link pointing to file:///etc , then getSymlink() will return file:///etc and getRemainingPath() will return dir/file.txt. Now, FileSystem.java will use these two components to create a new path. Are you saying that getSymlink() can return file:///etc/dir/file.txt? @Raghu: UnresolvedPathException is a namenode private class. It is not thrown to the dfs client. The Namenode catches this exception, maps it to FSLink object and returns it to the client.
          Hide
          dhruba borthakur added a comment -

          I vote for using exceptions internal to the NameNode to keep internal interfaces clean and elegant.

          Show
          dhruba borthakur added a comment - I vote for using exceptions internal to the NameNode to keep internal interfaces clean and elegant.
          Hide
          Doug Cutting added a comment -

          > Are you saying that getSymlink() can return file:///etc/dir/file.txt?

          Yes.

          Show
          Doug Cutting added a comment - > Are you saying that getSymlink() can return file:///etc/dir/file.txt? Yes.
          Hide
          dhruba borthakur added a comment -

          Incorporated Doug's review comments. More unit tests coming soon.

          Show
          dhruba borthakur added a comment - Incorporated Doug's review comments. More unit tests coming soon.
          Hide
          Doug Cutting added a comment -

          This is looking good. A few nits:

          • let's call it just 'link' not 'symlink', okay? If we ever implement generic hard links, we'd make those the qualified case.
          • several implementations of getSymlink() still do not throw exceptions when isLink() is false.
          • FSDataInputStream and FSDataOutputStream should not directly implement FSLinkable. Rather, openImpl, createImpl and appendImpl should return FSDataInputStreamLink and FSDataOutputStreamLink, classes which wrap a stream and a link, then the public method can dereference this to return FSDataInputStream and FSDataOutputStream as before.
          • shouldn't all of the implementations of Impl methods be protected, not public?

          Note: I am reviewing only the src/core sections of this patch, not the src/hdfs.

          Show
          Doug Cutting added a comment - This is looking good. A few nits: let's call it just 'link' not 'symlink', okay? If we ever implement generic hard links, we'd make those the qualified case. several implementations of getSymlink() still do not throw exceptions when isLink() is false. FSDataInputStream and FSDataOutputStream should not directly implement FSLinkable. Rather, openImpl, createImpl and appendImpl should return FSDataInputStreamLink and FSDataOutputStreamLink, classes which wrap a stream and a link, then the public method can dereference this to return FSDataInputStream and FSDataOutputStream as before. shouldn't all of the implementations of Impl methods be protected, not public? Note: I am reviewing only the src/core sections of this patch, not the src/hdfs.
          Hide
          dhruba borthakur added a comment -

          Incorporated Doug's latest review comments.

          can somebody from dfsland please review this too? Thanks.

          Show
          dhruba borthakur added a comment - Incorporated Doug's latest review comments. can somebody from dfsland please review this too? Thanks.
          Hide
          Raghu Angadi added a comment -

          > I vote for using exceptions internal to the NameNode to keep internal interfaces clean and elegant.

          Why should NameNode not be clean and elegant ? I actually don't think returning some base class unrelated to member/api is clean.. it is essentially same as returning "Object" and letting the upper layers checking the type.

          > can somebody from dfsland please review this too? Thanks.
          Because of my above prejudice, I am probably not the right person to review.

          Show
          Raghu Angadi added a comment - > I vote for using exceptions internal to the NameNode to keep internal interfaces clean and elegant. Why should NameNode not be clean and elegant ? I actually don't think returning some base class unrelated to member/api is clean.. it is essentially same as returning "Object" and letting the upper layers checking the type. > can somebody from dfsland please review this too? Thanks. Because of my above prejudice, I am probably not the right person to review.
          Hide
          dhruba borthakur added a comment -

          Hi Raghu, I was saying that "using exceptions inside the namenode keeps NameNode internal interfaces clean and elegant". So, lease go ahead and review this patch if you have some time. Thanks.

          Show
          dhruba borthakur added a comment - Hi Raghu, I was saying that "using exceptions inside the namenode keeps NameNode internal interfaces clean and elegant". So, lease go ahead and review this patch if you have some time. Thanks.
          Hide
          Doug Cutting added a comment -

          > I actually don't think returning some base class unrelated to member/api is clean. [ ... ]

          Except it is related. The contract of all of these methods has changed, so that they can either return, e.g., an open file or a link. We need a uniform way to represent this. A two-step API is unacceptable for performance reasons. A base return class thus seems like a fine way to do this. What would you suggest instead?

          Show
          Doug Cutting added a comment - > I actually don't think returning some base class unrelated to member/api is clean. [ ... ] Except it is related. The contract of all of these methods has changed, so that they can either return, e.g., an open file or a link. We need a uniform way to represent this. A two-step API is unacceptable for performance reasons. A base return class thus seems like a fine way to do this. What would you suggest instead?
          Hide
          Raghu Angadi added a comment - - edited

          I deal with every day and care about both FileSystem and HDFS. We have two very different approaches for the same problem in these two places (changing return types vs throwing exception) . It does not feel quite right for me to +1 though I prefer only one of the approaches. This does not affect external API, but for developers internal APIs are just as important.

          Two-step API is of course not good (as I pointed out in the early proposals), I don't think it is related to 'return type' vs exception issue.

          My preference is to have the same approach as in HDFS. Any API that can throw IOException, can throw UnresolvedLinkException. Upper layers that can handle it handle it. No change in return types is required and we don't need (rather strange) classes like FSLinkBoolean, FSInputStreamLink etc. As minor side benefit, we also remove all the extra objects created (even when no links are involved).

          Finally, I don't think Exceptions are meant only for hard errors (FileNotFound, for e.g.). I doubt if exception is always an error, but my experience with java is less than 2 years .

          edit: minor. my first edit since Sept 16th..

          Show
          Raghu Angadi added a comment - - edited I deal with every day and care about both FileSystem and HDFS. We have two very different approaches for the same problem in these two places (changing return types vs throwing exception) . It does not feel quite right for me to +1 though I prefer only one of the approaches. This does not affect external API, but for developers internal APIs are just as important. Two-step API is of course not good (as I pointed out in the early proposals), I don't think it is related to 'return type' vs exception issue. My preference is to have the same approach as in HDFS. Any API that can throw IOException, can throw UnresolvedLinkException. Upper layers that can handle it handle it. No change in return types is required and we don't need (rather strange) classes like FSLinkBoolean, FSInputStreamLink etc. As minor side benefit, we also remove all the extra objects created (even when no links are involved). Finally, I don't think Exceptions are meant only for hard errors (FileNotFound, for e.g.). I doubt if exception is always an error, but my experience with java is less than 2 years . edit: minor. my first edit since Sept 16th..
          Hide
          Doug Cutting added a comment -

          Exceptions are intended for cases where you're not sure who will catch them. They're a non-local control-flow mechanism. Using them for normal control flow is an abuse and leads to spaghetti logic. In this case we are changing the FileSystem service-provider interface so that many methods can return either a link or a normal value. The appropriate implementation is a union result type, not exceptions. In Java we implement this union by extending a 'Linkable' base class.

          http://192.220.96.201/essays/java-style/exceptions.html
          http://www.javapractices.com/topic/TopicAction.do?Id=19
          http://leepoint.net/notes-java/flow/exceptions/03exceptions.html

          Show
          Doug Cutting added a comment - Exceptions are intended for cases where you're not sure who will catch them. They're a non-local control-flow mechanism. Using them for normal control flow is an abuse and leads to spaghetti logic. In this case we are changing the FileSystem service-provider interface so that many methods can return either a link or a normal value. The appropriate implementation is a union result type, not exceptions. In Java we implement this union by extending a 'Linkable' base class. http://192.220.96.201/essays/java-style/exceptions.html http://www.javapractices.com/topic/TopicAction.do?Id=19 http://leepoint.net/notes-java/flow/exceptions/03exceptions.html
          Hide
          Sanjay Radia added a comment -

          Some minor comments on FileSystem side of the patch.

          • I prefer the name symlink instead of link. Unix terminology uses link for hard link and symlink for symbolic link. I think we should stick to it
            regardless of whether or not we plan to add hard links in the future.
          • Dhruba you left some of the implementation methods public, they should be protected.
          • minor – the method next() suggests that one follows the path to the next symbolic link. It does not strictly. I suggest calling it doOperation() (this is
            style issue and so merely a soft suggestion).
          Show
          Sanjay Radia added a comment - Some minor comments on FileSystem side of the patch. I prefer the name symlink instead of link. Unix terminology uses link for hard link and symlink for symbolic link. I think we should stick to it regardless of whether or not we plan to add hard links in the future. Dhruba you left some of the implementation methods public, they should be protected. minor – the method next() suggests that one follows the path to the next symbolic link. It does not strictly. I suggest calling it doOperation() (this is style issue and so merely a soft suggestion).
          Hide
          Doug Cutting added a comment -

          > I prefer the name symlink instead of link.

          Hmm. 'Symbolic link' is the marked case in unix, the exception, where 'link' alone refers to a hard link. The reason is historic, symbolic links were not added to unix until 4.2 BSD. The term 'link' was already used to describe hard links and so had to be qualified, even though symbolic links are now more commonly used than hard links. In Hadoop I argue we should make hard links the marked case, and that the default understanding of 'link' in Hadoop, as on the web, should be like unix's symbolic link. If we ever add support for hard links, then they would become the marked case in Hadoop, requiring qualification, like server-side redirects. However I would not veto this if the majority prefer a longer name here.

          > the method next() suggests that one follows the path to the next symbolic link. It does not strictly.

          Can you elaborate? Each time this is called a link is resolved, returning a path that may contain more links, following a chain of links, no? Also, 'doOperation()' is next to meaningless. So if 'next' is indeed inappropriate we should replace it with something more specific, not something generic.

          Show
          Doug Cutting added a comment - > I prefer the name symlink instead of link. Hmm. 'Symbolic link' is the marked case in unix, the exception, where 'link' alone refers to a hard link. The reason is historic, symbolic links were not added to unix until 4.2 BSD. The term 'link' was already used to describe hard links and so had to be qualified, even though symbolic links are now more commonly used than hard links. In Hadoop I argue we should make hard links the marked case, and that the default understanding of 'link' in Hadoop, as on the web, should be like unix's symbolic link. If we ever add support for hard links, then they would become the marked case in Hadoop, requiring qualification, like server-side redirects. However I would not veto this if the majority prefer a longer name here. > the method next() suggests that one follows the path to the next symbolic link. It does not strictly. Can you elaborate? Each time this is called a link is resolved, returning a path that may contain more links, following a chain of links, no? Also, 'doOperation()' is next to meaningless. So if 'next' is indeed inappropriate we should replace it with something more specific, not something generic.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I think we don's need Path.createNewPath(...) as a public API. This method may be confusing.

          Show
          Tsz Wo Nicholas Sze added a comment - I think we don's need Path.createNewPath(...) as a public API. This method may be confusing.
          Hide
          Sanjay Radia added a comment -

          If throwing a UnresolvedPathException is considered clean style inside the namenode then it should be clean for RPC interface.
          We need to be consistent here. It is either clean style or it s not.

          In my opnion, throwing the UnresolvedPathException across the rpc interface is clean – it is one of many internal interfaces.

          Joshua Bloch in "Effective Java" (item 40) argues that exceptions can be used for recoverable conditions. The processing the symbolic link is
          a recoverable condition. Later in the same item he argues that such exceptions can provide information to allow the recovery (in this case the remaining
          pathname to resolve).

          Indeed the link that Doug gave argues in favour of using declared exceptions for recoverable conditions
          [ http://192.220.96.201/essays/java-style/exceptions.html]:
          "My rule of thumb for whether I make an exception declared or undeclared is whether localized recovery from the exception being thrown is sensible. If the calling method (or one of its recent callers) of the code is the right place to handle a given failure type, I represent that failure with a declared exception. If, on the other hand, if the failure case is best handled by a global handler which catches all the exceptions for a given component of a program and treats them all as failures of that subsystem, I use undeclared exceptions".
          In the case of symbolic links, localized recovery by following the symbolic link is possible, and hence using an exception is acceptable.
          Here the author is distinguishing between declared and undeclared – but the point is that declared exceptions are okay for recoverable conditions.

          If java did not have the distinction for declared vs undeclared and exceptions were only for errors, then Doug's argument would have been valid.

          Dhruba made the right choice in throwing the exception inside the name node - otherwise all the interfaces and code would have become unclean.
          The same style should have continued for the RPC operation.

          The RPC interfaces in the patch have been changed as follows.

          void create(...) become FSLink create(..)

          and boolean setReplcation() becomes FSLinkBoolean setReplication(..)
          ... similar for many other methods.

          The original interfaces were intuitive – the input parameters and the return values are obvious for each method.
          The newer ones have this FSlinkXXX return value that, IMHO, is not clean.
          They will get worse with the union overloading overtime if we add new reasons to recover on the client side.
          The use of exceptions would have kept the interface clean (just as was done inside the name node code).

          Show
          Sanjay Radia added a comment - If throwing a UnresolvedPathException is considered clean style inside the namenode then it should be clean for RPC interface. We need to be consistent here. It is either clean style or it s not. In my opnion, throwing the UnresolvedPathException across the rpc interface is clean – it is one of many internal interfaces. Joshua Bloch in "Effective Java" (item 40) argues that exceptions can be used for recoverable conditions. The processing the symbolic link is a recoverable condition. Later in the same item he argues that such exceptions can provide information to allow the recovery (in this case the remaining pathname to resolve). Indeed the link that Doug gave argues in favour of using declared exceptions for recoverable conditions [ http://192.220.96.201/essays/java-style/exceptions.html]: "My rule of thumb for whether I make an exception declared or undeclared is whether localized recovery from the exception being thrown is sensible. If the calling method (or one of its recent callers) of the code is the right place to handle a given failure type, I represent that failure with a declared exception. If, on the other hand, if the failure case is best handled by a global handler which catches all the exceptions for a given component of a program and treats them all as failures of that subsystem, I use undeclared exceptions". In the case of symbolic links, localized recovery by following the symbolic link is possible, and hence using an exception is acceptable. Here the author is distinguishing between declared and undeclared – but the point is that declared exceptions are okay for recoverable conditions. If java did not have the distinction for declared vs undeclared and exceptions were only for errors, then Doug's argument would have been valid. Dhruba made the right choice in throwing the exception inside the name node - otherwise all the interfaces and code would have become unclean. The same style should have continued for the RPC operation. The RPC interfaces in the patch have been changed as follows. void create(...) become FSLink create(..) and boolean setReplcation() becomes FSLinkBoolean setReplication(..) ... similar for many other methods. The original interfaces were intuitive – the input parameters and the return values are obvious for each method. The newer ones have this FSlinkXXX return value that, IMHO, is not clean. They will get worse with the union overloading overtime if we add new reasons to recover on the client side. The use of exceptions would have kept the interface clean (just as was done inside the name node code).
          Hide
          Doug Cutting added a comment -

          > Here the author is distinguishing between declared and undeclared - but the point is that declared exceptions are okay for recoverable conditions.

          Unfortunately that is not the question here. He, among many others, says not to use exceptions for normal control flow. Links in a filesystem are normal. They are not conditions out of the control of the code in question.

          > otherwise all the interfaces and code would have become unclean.

          What does 'unclean' mean? The interfaces should reflect their function. We're making a fundamental change to these interfaces. The return types of all of these operations must change. We could implement a return type change with exceptions or thread locals. Either would require fewer changes to the code, but both would be terrible programming style. We need to suck it up and change the return types consistently to reflect the change in semantics. Let's not call something 'unclean' just because it is more work.

          Show
          Doug Cutting added a comment - > Here the author is distinguishing between declared and undeclared - but the point is that declared exceptions are okay for recoverable conditions. Unfortunately that is not the question here. He, among many others, says not to use exceptions for normal control flow. Links in a filesystem are normal. They are not conditions out of the control of the code in question. > otherwise all the interfaces and code would have become unclean. What does 'unclean' mean? The interfaces should reflect their function. We're making a fundamental change to these interfaces. The return types of all of these operations must change. We could implement a return type change with exceptions or thread locals. Either would require fewer changes to the code, but both would be terrible programming style. We need to suck it up and change the return types consistently to reflect the change in semantics. Let's not call something 'unclean' just because it is more work.
          Hide
          Konstantin Shvachko added a comment -

          The code looks bizarre (or unclean) with all the FSLink*** return types.
          I do not like when exceptions are used for normal control flow.
          But looking at what it takes to avoid it I lean to the idea that may be this
          is the exceptional case when we should divert from the general principles.

          I see two compromise solutions.

          1. Introduce a new method Path getSymLink(Path src), which returns first symbolic link in the src path.
            Then all methods {(create(), getBlockLocations(), rename()}}, etc. can throw the UnresolvedPathException all the way to the application via FileSystem API. The application can then call getSymLink() to resolve the mount point.
            UnresolvedPathException should be a regular subclass of IOException without any hints on how the sym link should be resolved.
            I understand that local links are resolved on the name-node.
            So UnresolvedPathException is thrown only when a mount point is crossed. This is uncommon and should not be optimized.
            The application itself should call getSymLink() after either receiving the UnresolvedPathException or if it knows about the mount point from its previous activity.
          2. Another "clean" way to handle this would be to introduce output parameters in RPC framework.
            In current RPC all arguments are input only and the only way to return a value is by explicitly specifying the return value of the protocol method.
            And if you need to return several values you have to wrap them into an object.
            This is exactly what makes FSLink*** so ugly.
            If we had a way to specify that some arguments are output values then the api would more reasonable. E.g.
            public FSLink getFileInfo(String inSrc, FileStatus outStatus) throws IOException;
            

            vs

            public FSLinkFileStatus getFileInfo(String src) throws IOException;
            
          Show
          Konstantin Shvachko added a comment - The code looks bizarre (or unclean) with all the FSLink*** return types. I do not like when exceptions are used for normal control flow. But looking at what it takes to avoid it I lean to the idea that may be this is the exceptional case when we should divert from the general principles. I see two compromise solutions. Introduce a new method Path getSymLink(Path src) , which returns first symbolic link in the src path. Then all methods {(create(), getBlockLocations(), rename()}}, etc. can throw the UnresolvedPathException all the way to the application via FileSystem API. The application can then call getSymLink() to resolve the mount point. UnresolvedPathException should be a regular subclass of IOException without any hints on how the sym link should be resolved. I understand that local links are resolved on the name-node. So UnresolvedPathException is thrown only when a mount point is crossed. This is uncommon and should not be optimized. The application itself should call getSymLink() after either receiving the UnresolvedPathException or if it knows about the mount point from its previous activity. Another "clean" way to handle this would be to introduce output parameters in RPC framework. In current RPC all arguments are input only and the only way to return a value is by explicitly specifying the return value of the protocol method. And if you need to return several values you have to wrap them into an object. This is exactly what makes FSLink*** so ugly. If we had a way to specify that some arguments are output values then the api would more reasonable. E.g. public FSLink getFileInfo( String inSrc, FileStatus outStatus) throws IOException; vs public FSLinkFileStatus getFileInfo( String src) throws IOException;
          Hide
          Doug Cutting added a comment -

          > The code looks bizarre (or unclean) with all the FSLink*** return types.

          It does look different than the FS API always has looked, and different than other FS apis. But to reject it for just looking different, to call it bizzarre and unclean, is xenophobia. It is different. Other FSs (e.g., linux VFS) don't try to do this in a single call, but rather loop, checking whether each element in the path is a link. If we wish to avoid that, then we must pack more data into the return types of our API, and our API will look different. We can try to hack around that by using exceptions or thread locals or output parameters, but those are all just attempts to try to make this API look like other APIs rather than embrace its differences.

          So, do we really want to handle links in a single call? I think so.

          Show
          Doug Cutting added a comment - > The code looks bizarre (or unclean) with all the FSLink*** return types. It does look different than the FS API always has looked, and different than other FS apis. But to reject it for just looking different, to call it bizzarre and unclean, is xenophobia. It is different. Other FSs (e.g., linux VFS) don't try to do this in a single call, but rather loop, checking whether each element in the path is a link. If we wish to avoid that, then we must pack more data into the return types of our API, and our API will look different. We can try to hack around that by using exceptions or thread locals or output parameters, but those are all just attempts to try to make this API look like other APIs rather than embrace its differences. So, do we really want to handle links in a single call? I think so.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I think it is wrong to combine FSLink and the original return type like boolean, BlockLocations, etc. to create new classes. The problem here is that we need to return two items in a single method call. The usually approach is to return a container, say FsReturnType<T>. I suggest the following API:

          //T is the original type
          public class FsReturnType<T> {
            public T getValue() {
              ...
            }
          
            Path getLink() throws IOException {
              ...
            }
          }
          

          BTW, should our naming convention be FsXxx or FSXxx?

          Show
          Tsz Wo Nicholas Sze added a comment - I think it is wrong to combine FSLink and the original return type like boolean, BlockLocations, etc. to create new classes. The problem here is that we need to return two items in a single method call. The usually approach is to return a container, say FsReturnType<T>. I suggest the following API: //T is the original type public class FsReturnType<T> { public T getValue() { ... } Path getLink() throws IOException { ... } } BTW, should our naming convention be FsXxx or FSXxx?
          Hide
          Doug Cutting added a comment -

          > public class FsReturnType<T>

          This looks good to me.

          > BTW, should our naming convention be FsXxx or FSXxx?

          Good question. I prefer FsXxx, but we've not been consistent in Hadoop, nor is Java. I think java.net.URL was a mistake, that it should have been named java.net.Url, that acronyms in class names should be capitalized, not all caps. This makes them easier to parse.

          Show
          Doug Cutting added a comment - > public class FsReturnType<T> This looks good to me. > BTW, should our naming convention be FsXxx or FSXxx? Good question. I prefer FsXxx, but we've not been consistent in Hadoop, nor is Java. I think java.net.URL was a mistake, that it should have been named java.net.Url, that acronyms in class names should be capitalized, not all caps. This makes them easier to parse.
          Hide
          Tsz Wo Nicholas Sze added a comment - - edited

          Good question. I prefer FsXxx, but we've not been consistent in Hadoop, nor is Java. I think java.net.URL was a mistake, that it should have been named java.net.Url, that acronyms in class names should be capitalized, not all caps. This makes them easier to parse.

          It is the time to rename all the FSXxx classes since the pubic API is going to be changed a lot in 0.20 anyway. I have created HADOOP-4357 for renaming the classes.

          Show
          Tsz Wo Nicholas Sze added a comment - - edited Good question. I prefer FsXxx, but we've not been consistent in Hadoop, nor is Java. I think java.net.URL was a mistake, that it should have been named java.net.Url, that acronyms in class names should be capitalized, not all caps. This makes them easier to parse. It is the time to rename all the FSXxx classes since the pubic API is going to be changed a lot in 0.20 anyway. I have created HADOOP-4357 for renaming the classes.
          Hide
          Sanjay Radia added a comment -

          > Let's not call something 'unclean' just because it is more work.

          It has nothing to do with more work – I am always willing to do more work for the right clean solution.
          Konstantine also said it is "bizzare" that all the methods return FSLink. I am not the only one finding this "unclean" or "bizzare".

          There is a the fundamental difference of opinion here: severl of us are arguing that the use of exceptions is fine for recoverable conditions. It keeps the interface simple and clean.

          BTW Do you find such use of exceptions to be acceptable inside the name node? Both Raghu and I raised this.
          Are you suggesting that the namenode side of the code be changed to not use exceptions?
          If so, you are consistent and it will be some work to change the internal name node code to use FSLink style approach; dhruba's patch does not do this.
          If not, why the inconsistency in when exceptions are appropriate or not appropriate?

          >The interfaces should reflect their function.
          Having create() return FSLink does not help understand its function.
          Return types like "FSLinkBoolean" and FSLinkBlockLocations confuse the function of the methods in an interface.
          The symbolic link issue cuts across all interfaces that have path names. Overloading the return type of each method that has path name parameters
          dilutes how the well the method signature reflects it function.
          The method signatures and datatypes are easier to understand if we use exceptions (which in the opinion of some of us is acceptable for recoverable conditions).

          > Other FSs (e.g., linux VFS) don't try to do this in a single call, but rather loop, checking whether each element in the path is a link.
          No one is suggesting that we handle it as a single call – with exceptions it will loop in exactly the same fashion. Perhaps you have misunderstood the
          alternate solution being proposed.

          >We can try to hack around that by using exceptions or thread locals or output parameters,
          > but those are all just attempts to try to make this API look
          >like other APIs rather than embrace its differences.
          (Q. By overloading the return type aren't you are using output parameters? The return type is an output parameter.)
          Many will consider the overloading of return types also hacky (and this will get worse when we have other reasons in the future when client side needs to perform a recovery function.). Clearly we have a difference in opinion on what is hacky or not.

          Comparing to other file systems APIs is not fair here for two reasons:
          1) they do not use a programming language that has first class support for exceptions (both checked and unchecked).
          2) Mounts and symbolic links in most systems are handled on the server side - in most FSs, clients cannot process links.
          The only possible exception (no pun intended) is Slash-relative symbolic links in NFS (I don't know if NFS supports slash-relative symbolic links that are kicked back to be relative to the client's view of Slash (root)).

          In the two approaches being discussed in this Jira, both are looping to process the operation across links/mount points. The difference is how the
          condition is communicated back to caller and the impact it has on the method signatures and data types used as parameters or return types.
          There is a clear difference of opinion here.
          I am interpreting your reference URL link differently - I am reading it and seeing that using exceptions for recovery is acceptable. Further following a mount point link is not normal and hence the use of exception is acceptable. Joshau's Effective Java also, as per my reading, advocates the use of checked exceptions for recoverable conditions. You will probably interpret that text differently or disagree with it.

          Show
          Sanjay Radia added a comment - > Let's not call something 'unclean' just because it is more work. It has nothing to do with more work – I am always willing to do more work for the right clean solution. Konstantine also said it is "bizzare" that all the methods return FSLink. I am not the only one finding this "unclean" or "bizzare". There is a the fundamental difference of opinion here: severl of us are arguing that the use of exceptions is fine for recoverable conditions. It keeps the interface simple and clean. BTW Do you find such use of exceptions to be acceptable inside the name node? Both Raghu and I raised this. Are you suggesting that the namenode side of the code be changed to not use exceptions? If so, you are consistent and it will be some work to change the internal name node code to use FSLink style approach; dhruba's patch does not do this. If not, why the inconsistency in when exceptions are appropriate or not appropriate? >The interfaces should reflect their function. Having create() return FSLink does not help understand its function. Return types like "FSLinkBoolean" and FSLinkBlockLocations confuse the function of the methods in an interface. The symbolic link issue cuts across all interfaces that have path names. Overloading the return type of each method that has path name parameters dilutes how the well the method signature reflects it function. The method signatures and datatypes are easier to understand if we use exceptions (which in the opinion of some of us is acceptable for recoverable conditions). > Other FSs (e.g., linux VFS) don't try to do this in a single call, but rather loop, checking whether each element in the path is a link. No one is suggesting that we handle it as a single call – with exceptions it will loop in exactly the same fashion. Perhaps you have misunderstood the alternate solution being proposed. >We can try to hack around that by using exceptions or thread locals or output parameters, > but those are all just attempts to try to make this API look >like other APIs rather than embrace its differences. (Q. By overloading the return type aren't you are using output parameters? The return type is an output parameter.) Many will consider the overloading of return types also hacky (and this will get worse when we have other reasons in the future when client side needs to perform a recovery function.). Clearly we have a difference in opinion on what is hacky or not. Comparing to other file systems APIs is not fair here for two reasons: 1) they do not use a programming language that has first class support for exceptions (both checked and unchecked). 2) Mounts and symbolic links in most systems are handled on the server side - in most FSs, clients cannot process links. The only possible exception (no pun intended) is Slash-relative symbolic links in NFS (I don't know if NFS supports slash-relative symbolic links that are kicked back to be relative to the client's view of Slash (root)). In the two approaches being discussed in this Jira, both are looping to process the operation across links/mount points. The difference is how the condition is communicated back to caller and the impact it has on the method signatures and data types used as parameters or return types. There is a clear difference of opinion here. I am interpreting your reference URL link differently - I am reading it and seeing that using exceptions for recovery is acceptable. Further following a mount point link is not normal and hence the use of exception is acceptable. Joshau's Effective Java also, as per my reading, advocates the use of checked exceptions for recoverable conditions. You will probably interpret that text differently or disagree with it.
          Hide
          Sanjay Radia added a comment -

          Dhruba, a question about your patch. Why do the data types LocatedBlocks, FSDataInputStream, FSDataOutputStream
          contain the methods isSymlink() and getSymlink()?
          Or is it for overloading the return type to process the remote symbolic links?
          To be consistent with the rest of your patch, wouldn't it have been better to declare the corresponding FSLinkXXX types for each of XXX above.

          Show
          Sanjay Radia added a comment - Dhruba, a question about your patch. Why do the data types LocatedBlocks, FSDataInputStream, FSDataOutputStream contain the methods isSymlink() and getSymlink()? Or is it for overloading the return type to process the remote symbolic links? To be consistent with the rest of your patch, wouldn't it have been better to declare the corresponding FSLinkXXX types for each of XXX above.
          Hide
          Doug Cutting added a comment -

          > BTW Do you find such use of exceptions to be acceptable inside the name node?

          No.

          I thought I made it clear that I had not reviewed the namenode code changes, but only those in the public APIs.

          > No one is suggesting that we handle it as a single call

          Linux's VFS does not handle both open and isLink in a single call, the way we'd like to.

          > I am interpreting your reference URL link differently - I am reading it and seeing that using exceptions for recovery is acceptable.

          That's a separate issue. Recovery from an unexpected condition out of the control of the code in question. That's not the case here. Links are an expected, normal condition.

          Show
          Doug Cutting added a comment - > BTW Do you find such use of exceptions to be acceptable inside the name node? No. I thought I made it clear that I had not reviewed the namenode code changes, but only those in the public APIs. > No one is suggesting that we handle it as a single call Linux's VFS does not handle both open and isLink in a single call, the way we'd like to. > I am interpreting your reference URL link differently - I am reading it and seeing that using exceptions for recovery is acceptable. That's a separate issue. Recovery from an unexpected condition out of the control of the code in question. That's not the case here. Links are an expected, normal condition.
          Hide
          Konstantin Shvachko added a comment -

          > public class FsReturnType<T>

          Interesting idea. You will rather have something like

          public class FSReturnType<T extends Writable> implements Writable {
            public T getValue();
            Path getLink() throws IOException;
          }
          

          And implement serialization methods, and think how to define WritableFactory for the parametrized class.
          And then the protocol methods will look like this

          public FSReturnType<BooleanWritable> setReplication(String src,...); 
          

          This is better than creating a new class for every new return type, but seems rather complex compared to one new protocol method plus the UnresolvedPathException that I was proposing.

          Show
          Konstantin Shvachko added a comment - > public class FsReturnType<T> Interesting idea. You will rather have something like public class FSReturnType<T extends Writable> implements Writable { public T getValue(); Path getLink() throws IOException; } And implement serialization methods, and think how to define WritableFactory for the parametrized class. And then the protocol methods will look like this public FSReturnType<BooleanWritable> setReplication( String src,...); This is better than creating a new class for every new return type, but seems rather complex compared to one new protocol method plus the UnresolvedPathException that I was proposing.
          Hide
          Sanjay Radia added a comment -

          >> Are you saying that getSymlink() can return file:///etc/dir/file.txt?

          >Yes.

          Can the name node always know how to compose the two parts. I recall something bizzare about the composing archive filesystem paths.
          If there is well defined way of composing the two parts then I am fine with the alternative proposed above.
          If not, the two parts may have to be passed to the file system.

          Show
          Sanjay Radia added a comment - >> Are you saying that getSymlink() can return file:///etc/dir/file.txt? >Yes. Can the name node always know how to compose the two parts. I recall something bizzare about the composing archive filesystem paths. If there is well defined way of composing the two parts then I am fine with the alternative proposed above. If not, the two parts may have to be passed to the file system.
          Hide
          Sanjay Radia added a comment -

          Dhruba says:
          > I vote for using exceptions internal to the NameNode to keep internal interfaces clean and elegant.
          I agree with you that using exception keeps the NN interfaces clean and elegant.
          I further agree with Raghu that we need to be consistent here and extend your taste for "clean and elegant" to to the RPC interface.

          Show
          Sanjay Radia added a comment - Dhruba says: > I vote for using exceptions internal to the NameNode to keep internal interfaces clean and elegant. I agree with you that using exception keeps the NN interfaces clean and elegant. I further agree with Raghu that we need to be consistent here and extend your taste for "clean and elegant" to to the RPC interface.
          Hide
          Owen O'Malley added a comment - - edited

          My preference is toward not using exceptions for nominal cases, and I agree with Doug that links seem like nominal cases. Clearly, however, it isn't a slam dunk.

          At the fear of crossing streams, I think that the classes like FSLinkBoolean are problematic. I think we are better off creating a type for each method's return type. I think that a lot of these return types should implement a class/interface (Owen ducks) that provides the common interface for dereferencing the links.

          The reason for having

          RenameResult rename(Path src, Path dst) throws IOException;
          versus
          FSLinkBoolean rename(Path src, Path dst) throws IOException;
          

          would be that if we want our protocol to be future-proof, rename may well need to add additional return values. If we have a bunch of methods all returning the same type, that is hard.

          Thoughts?

          Show
          Owen O'Malley added a comment - - edited My preference is toward not using exceptions for nominal cases, and I agree with Doug that links seem like nominal cases. Clearly, however, it isn't a slam dunk. At the fear of crossing streams, I think that the classes like FSLinkBoolean are problematic. I think we are better off creating a type for each method's return type. I think that a lot of these return types should implement a class/interface ( Owen ducks ) that provides the common interface for dereferencing the links. The reason for having RenameResult rename(Path src, Path dst) throws IOException; versus FSLinkBoolean rename(Path src, Path dst) throws IOException; would be that if we want our protocol to be future-proof, rename may well need to add additional return values. If we have a bunch of methods all returning the same type, that is hard. Thoughts?
          Hide
          Raghu Angadi added a comment -

          Doug wrote :
          > Exceptions are intended for cases where you're not sure who will catch them. [...]

          UnresolvedLinkException seems like such a case, when it is thrown, we don't know whether it will be handled inside NameNode, or DFS or FS on the namenode.. it might be same with other FS implementations too.

          Show
          Raghu Angadi added a comment - Doug wrote : > Exceptions are intended for cases where you're not sure who will catch them. [...] UnresolvedLinkException seems like such a case, when it is thrown, we don't know whether it will be handled inside NameNode, or DFS or FS on the namenode.. it might be same with other FS implementations too.
          Hide
          Sanjay Radia added a comment -

          Doug says:
          >> [minor] the method next() suggests that one follows the path to the next symbolic link. It does not strictly.

          >Can you elaborate? Each time this is called a link is resolved, returning a path that may contain more links, following a chain of links, no? Also, >'doOperation()' is next to meaningless. So if 'next' is indeed inappropriate we should replace it with something more specific, not something generic.

          First I had qualified this as minor (meaning it is just a minor suggestion for the author).
          When I read the code, I first thought that dhruba was processing all sym links and then doing the actual operation at the end.
          Turns out next() actually performs the operation. If the operation succeeds (and it will for most pathnames including for all dot-relative symlinks)
          all is well and done. Indeed if there is another NN in a chain of links that is willing to process remote symlinks internally all is fine and done.
          If the operations fails, then the recovery is performed.
          Now the design is quite elegant. Later we can add new reasons why the NN could not complete the operation and the structure could be modified
          to recover for new conditions. I like that. The name next() suggests that one is following symlinks so I was suggesting something like doOperaiton() or
          performOperation() which I find more accurate than the the generic "next()".
          Perhaps adding a comment to the abstract method may be sufficient. I am okay with whatever the author decides here - it is a very minor comment.

          Show
          Sanjay Radia added a comment - Doug says: >> [minor] the method next() suggests that one follows the path to the next symbolic link. It does not strictly. >Can you elaborate? Each time this is called a link is resolved, returning a path that may contain more links, following a chain of links, no? Also, >'doOperation()' is next to meaningless. So if 'next' is indeed inappropriate we should replace it with something more specific, not something generic. First I had qualified this as minor (meaning it is just a minor suggestion for the author). When I read the code, I first thought that dhruba was processing all sym links and then doing the actual operation at the end. Turns out next() actually performs the operation. If the operation succeeds (and it will for most pathnames including for all dot-relative symlinks) all is well and done. Indeed if there is another NN in a chain of links that is willing to process remote symlinks internally all is fine and done. If the operations fails, then the recovery is performed. Now the design is quite elegant. Later we can add new reasons why the NN could not complete the operation and the structure could be modified to recover for new conditions. I like that. The name next() suggests that one is following symlinks so I was suggesting something like doOperaiton() or performOperation() which I find more accurate than the the generic "next()". Perhaps adding a comment to the abstract method may be sufficient. I am okay with whatever the author decides here - it is a very minor comment .
          Hide
          Sanjay Radia added a comment -

          Owen says:
          >My preference is toward not using exceptions for nominal cases, and I agree with Doug that links seem like nominal cases. Clearly, however, it isn't a slam dunk.
          >At the fear of crossing streams, I think that the classes like FSLinkBoolean are problematic. I think we are better off creating a type for each method's return type. I think that a lot of these return types should implement a class/interface (Owen ducks) that provides the common interface for dereferencing the links.

          What you loose with owen's class/interface suggestion is that the signature not longer indicates
          that the method returns a boolean or an array of locatedBlocks or an file descriptor etc.
          I think Dhruba was trying to preserve that aspect of the original interfaces where by reading the name of the method and the names and types of the
          input and output parameter, once can easily guess the semantics of the method.
          (the eraser I threw at Owen bounced off the screen).
          But Owen is right in that his approach is a little more future proof (and exceptions are more future proof

          Owen what do you propose for the internal interfaces in the name node? – do you suggest that we remove the symlink-related-exception and use your structure every where in the FS and HDFS code?
          If not, why the inconsistency?

          Show
          Sanjay Radia added a comment - Owen says: >My preference is toward not using exceptions for nominal cases, and I agree with Doug that links seem like nominal cases. Clearly, however, it isn't a slam dunk. >At the fear of crossing streams, I think that the classes like FSLinkBoolean are problematic. I think we are better off creating a type for each method's return type. I think that a lot of these return types should implement a class/interface (Owen ducks) that provides the common interface for dereferencing the links. What you loose with owen's class/interface suggestion is that the signature not longer indicates that the method returns a boolean or an array of locatedBlocks or an file descriptor etc. I think Dhruba was trying to preserve that aspect of the original interfaces where by reading the name of the method and the names and types of the input and output parameter, once can easily guess the semantics of the method. (the eraser I threw at Owen bounced off the screen). But Owen is right in that his approach is a little more future proof (and exceptions are more future proof Owen what do you propose for the internal interfaces in the name node? – do you suggest that we remove the symlink-related-exception and use your structure every where in the FS and HDFS code? If not, why the inconsistency?
          Hide
          Konstantin Shvachko added a comment -

          Owen is proposing a reasonable rename of the FSLink*** classes. But the implementations of these classes remain the same.
          Going further this way we could make the API look like this:

          RenameReply rename(RenameRequest arg) throws IOException;
          

          This is universal, but not a transparent api anymore.
          The next step to completely generalize (and exaggerate) this would be to declare one single protocol method.

          Reply execute(enum opCode, Request arg) throws IOException;
          

          Then the api will have to be defined by a pair of Request, Reply classes.

          Show
          Konstantin Shvachko added a comment - Owen is proposing a reasonable rename of the FSLink*** classes. But the implementations of these classes remain the same. Going further this way we could make the API look like this: RenameReply rename(RenameRequest arg) throws IOException; This is universal, but not a transparent api anymore. The next step to completely generalize (and exaggerate) this would be to declare one single protocol method. Reply execute( enum opCode, Request arg) throws IOException; Then the api will have to be defined by a pair of Request, Reply classes.
          Hide
          dhruba borthakur added a comment -

          1. "symlink" vs "link":
          I think it makes sense to call the current implementation as "links" instead of symlinks. None of the existing file system implementations have any support for any kinds of links. It is ok for the first implementation to refer to this new construct as a generic "link". HDFS implements it as a symbolic link, but some other file system may implement "links" as hard links.

          +1 for calling this construct as "link" instead of "symlink".

          2. Exceptions vs Objects-as-return-status in an public API (FileSystem or ClientProtocol API)
          Exceptions or Object-as-return-value approaches are two ways of communication a certain piece of information to the user of the API.
          (a) One goal is to discuss how we can attempt to make that API somewhat future proof. If we consider our current Hadoop RPC, the only way to serialize/deserialize an exception object is to serialize its message string. The client side can de-serialize this string and reconstruct an exception object. If the return status need to contain various different pieces of information, then serializing/deser as a string inside the exception object is not very elegant. Many other RPC systems (e.g. Thrift) allow versioning objects (adding new fields) but many might not allow adding new exceptions to a pre-existing method call. Thus, making API calls return objects-as-return-types seem to be more future-proof than adding exceptions.
          (b) The thumb-rule that we have been following is that exceptions are generated when an abnormal situation occurs. If an exception is thrown by the ClientProtocol, it is logged by the RPC subsystem into an error log. This is a good characteristic to have in a distributed system, makes debugging easy because a scan in the error logs
          pinpoints the exceptions raised by the API. Access control checks or disk-full conditions raise exceptions, and they are logged by the RPC subsystem. We do not want every call to "traverse a symbolic link" to log an exception message in the error logs, do we? (Of course, we can special case it and say that we will not log UnresolvedPathException; but by
          special-casing it, we are acknowledging that this exception is not an abnormal behaviour).

          +1 for RenameResult rename(Path src, Path dst) throws IOException;

          3. Exceptions vs Object-as-return-status inside the NameNode
          (a) Different filesystem implementations can have very different implementations and a very different set of developers. For example, HDFS might implement code in such a way that traversing a link returns a object-status where S3 or KFS throws an exception (internal to the implementation). If we write a file system implementation for Ceph, we are likely to not rewrite the Ceph code to not use exceptions (or vice versa). I would like to draw the distinction that this issue is not related to what is decide in case (2) above.
          (b) The primary focus for designing the internal methods of the NameNode is not future-proof for backward compatibility. Also, there isn't any requirement to serialize/deserialize any exception objects as long as that object is used inside the NameNode. Thus, exceptions could be used here. This keeps most of the HDFS code clean and elegant.

          +1 for Using Exceptions inside the NameNode internal methods.

          Show
          dhruba borthakur added a comment - 1. "symlink" vs "link": I think it makes sense to call the current implementation as "links" instead of symlinks. None of the existing file system implementations have any support for any kinds of links. It is ok for the first implementation to refer to this new construct as a generic "link". HDFS implements it as a symbolic link, but some other file system may implement "links" as hard links. +1 for calling this construct as "link" instead of "symlink". 2. Exceptions vs Objects-as-return-status in an public API (FileSystem or ClientProtocol API) Exceptions or Object-as-return-value approaches are two ways of communication a certain piece of information to the user of the API. (a) One goal is to discuss how we can attempt to make that API somewhat future proof. If we consider our current Hadoop RPC, the only way to serialize/deserialize an exception object is to serialize its message string. The client side can de-serialize this string and reconstruct an exception object. If the return status need to contain various different pieces of information, then serializing/deser as a string inside the exception object is not very elegant. Many other RPC systems (e.g. Thrift) allow versioning objects (adding new fields) but many might not allow adding new exceptions to a pre-existing method call. Thus, making API calls return objects-as-return-types seem to be more future-proof than adding exceptions. (b) The thumb-rule that we have been following is that exceptions are generated when an abnormal situation occurs. If an exception is thrown by the ClientProtocol, it is logged by the RPC subsystem into an error log. This is a good characteristic to have in a distributed system, makes debugging easy because a scan in the error logs pinpoints the exceptions raised by the API. Access control checks or disk-full conditions raise exceptions, and they are logged by the RPC subsystem. We do not want every call to "traverse a symbolic link" to log an exception message in the error logs, do we? (Of course, we can special case it and say that we will not log UnresolvedPathException; but by special-casing it, we are acknowledging that this exception is not an abnormal behaviour). +1 for RenameResult rename(Path src, Path dst) throws IOException; 3. Exceptions vs Object-as-return-status inside the NameNode (a) Different filesystem implementations can have very different implementations and a very different set of developers. For example, HDFS might implement code in such a way that traversing a link returns a object-status where S3 or KFS throws an exception (internal to the implementation). If we write a file system implementation for Ceph, we are likely to not rewrite the Ceph code to not use exceptions (or vice versa). I would like to draw the distinction that this issue is not related to what is decide in case (2) above. (b) The primary focus for designing the internal methods of the NameNode is not future-proof for backward compatibility. Also, there isn't any requirement to serialize/deserialize any exception objects as long as that object is used inside the NameNode. Thus, exceptions could be used here. This keeps most of the HDFS code clean and elegant. +1 for Using Exceptions inside the NameNode internal methods.
          Hide
          Doug Cutting added a comment -

          Dhruba> the NameNode is not future-proof for backward compatibility.
          Dhruba> Thus, exceptions could be used here. This keeps most of the HDFS code clean and elegant.

          Yes, they could be used here, and it could be fixed later without breaking user code, but that doesn't make it clean and elegant. Rather it is a quick and dirty hack. Returning normal values through exceptions is a perversion. The Namenode should be revised to support links as first-class events, not as exceptions.

          Konstantin> This is universal, but not a transparent api anymore.

          It's no less transparent than 'FileStatus getFileStatus(Path)'. In that case we combined the value of many methods (isDirectory(), size(), owner(), etc.) into the value of a single method with a very generic name.

          In this case we wish to collapse three logical calls into one. We want to perform 'isLink()',and then either 'getLink()' or, e.g., 'open()', and return sufficient information so that the caller can tell what happened. In such cases, when multiple values must be grouped together to represent something, we use a data structure, not some other channel like exceptions or global variables. With a data structure comes some loss of transparency, also called abstraction.

          Show
          Doug Cutting added a comment - Dhruba> the NameNode is not future-proof for backward compatibility. Dhruba> Thus, exceptions could be used here. This keeps most of the HDFS code clean and elegant. Yes, they could be used here, and it could be fixed later without breaking user code, but that doesn't make it clean and elegant. Rather it is a quick and dirty hack. Returning normal values through exceptions is a perversion. The Namenode should be revised to support links as first-class events, not as exceptions. Konstantin> This is universal, but not a transparent api anymore. It's no less transparent than 'FileStatus getFileStatus(Path)'. In that case we combined the value of many methods (isDirectory(), size(), owner(), etc.) into the value of a single method with a very generic name. In this case we wish to collapse three logical calls into one. We want to perform 'isLink()',and then either 'getLink()' or, e.g., 'open()', and return sufficient information so that the caller can tell what happened. In such cases, when multiple values must be grouped together to represent something, we use a data structure, not some other channel like exceptions or global variables. With a data structure comes some loss of transparency, also called abstraction.
          Hide
          Sanjay Radia added a comment -

          Sanjay Saya:
          >> Are you saying that getSymlink() can return file:///etc/dir/file.txt?

          >Yes.

          >Can the name node always know how to compose the two parts. I recall something bizzare about the composing archive filesystem paths.
          >If there is well defined way of composing the two parts then I am fine with the alternative proposed above.
          >If not, the two parts may have to be passed to the file system.

          I checked - the above scheme will work with archives. It uses the standard separator.
          +1 on the getSymlink returning the composed pathname.

          Show
          Sanjay Radia added a comment - Sanjay Saya: >> Are you saying that getSymlink() can return file:///etc/dir/file.txt? >Yes. >Can the name node always know how to compose the two parts. I recall something bizzare about the composing archive filesystem paths. >If there is well defined way of composing the two parts then I am fine with the alternative proposed above. >If not, the two parts may have to be passed to the file system. I checked - the above scheme will work with archives. It uses the standard separator. +1 on the getSymlink returning the composed pathname.
          Hide
          Sanjay Radia added a comment -

          Dhruba says:
          >If an exception is thrown by the ClientProtocol, it is logged by the RPC subsystem into an error log.
          >This is a good characteristic to have in a distributed system, makes debugging easy because a scan in the error logs
          >pinpoints the exceptions raised by the API.
          >Access control checks or disk-full conditions raise exceptions, and they are logged by the RPC subsystem. We do not want every call to "traverse a
          >symbolic link" to log an exception message in the error logs, do we? (Of course, we can special case it and say that we will not log >UnresolvedPathException; but by
          >special-casing it, we are acknowledging that this exception is not an abnormal behaviour).

          The NN (or any server) should log server error (like disc full) but not user exceptions (like non-existent path name supplied).
          The reason that access violation exceptions are logged is because it is needed for security audits - security is a special case and the info may even be logged in a separate file.
          Hence, in general, the server side needs to filter what is logged.
          If we has some inheritance hierarchy in our exceptions we could make the filter simpler.
          The special casing argument is not valid since we should add some sort of a filter anyway at some point.

          Show
          Sanjay Radia added a comment - Dhruba says: >If an exception is thrown by the ClientProtocol, it is logged by the RPC subsystem into an error log. >This is a good characteristic to have in a distributed system, makes debugging easy because a scan in the error logs >pinpoints the exceptions raised by the API. >Access control checks or disk-full conditions raise exceptions, and they are logged by the RPC subsystem. We do not want every call to "traverse a >symbolic link" to log an exception message in the error logs, do we? (Of course, we can special case it and say that we will not log >UnresolvedPathException; but by >special-casing it, we are acknowledging that this exception is not an abnormal behaviour). The NN (or any server) should log server error (like disc full) but not user exceptions (like non-existent path name supplied). The reason that access violation exceptions are logged is because it is needed for security audits - security is a special case and the info may even be logged in a separate file. Hence, in general, the server side needs to filter what is logged. If we has some inheritance hierarchy in our exceptions we could make the filter simpler. The special casing argument is not valid since we should add some sort of a filter anyway at some point.
          Hide
          Sanjay Radia added a comment -

          Symlink vs link - I am indifferent and can go either way.

          One advantage of Owen's proposal over the current patch: it will help with versioning in the future as each method can evolve its return type
          independently of the return types of other methods.

          Dhruba> Thus, exceptions could be used here. This keeps most of the HDFS code clean and elegant.

          Doug> Yes, they could be used here, and it could be fixed later without breaking user code, but that doesn't make it clean and elegant.
          > Rather it is a quick and dirty hack. Returning normal values through exceptions is a perversion.
          > The Namenode should be revised to support links as first-class events, not as exceptions.

          It is highly unlikely to get "fixed" later and over time it will get harder and harder to "fix".
          Doug should we pay the cost and do it the "right way" now?
          (BTW I prefer to use exceptions and don't agree with Doug's view; hence my use of quotes above).

          Show
          Sanjay Radia added a comment - Symlink vs link - I am indifferent and can go either way. One advantage of Owen's proposal over the current patch: it will help with versioning in the future as each method can evolve its return type independently of the return types of other methods. Dhruba> Thus, exceptions could be used here. This keeps most of the HDFS code clean and elegant. Doug> Yes, they could be used here, and it could be fixed later without breaking user code, but that doesn't make it clean and elegant. > Rather it is a quick and dirty hack. Returning normal values through exceptions is a perversion. > The Namenode should be revised to support links as first-class events, not as exceptions. It is highly unlikely to get "fixed" later and over time it will get harder and harder to "fix". Doug should we pay the cost and do it the "right way" now? (BTW I prefer to use exceptions and don't agree with Doug's view; hence my use of quotes above).
          Hide
          Doug Cutting added a comment -

          > Doug should we pay the cost and do it the "right way" now?

          Yes, of course.

          > I prefer to use exceptions [ ... ]

          But why? You've said it's "cleaner" and "simpler". I don't follow your argument. "Simpler" could mean fewer lines of code. I don't think it will actually change the lines of code much. Changing the return types will change more lines of code, hence the change is more complex, but the resulting code will probably be around the same size, regardless of which approach is taken. "Cleaner" is subjective, like "better", and requires elaboration.

          My argument is that we should model normal functionality with normal programming: methods, parameters, return values and data structures. Do you disagree with this? Or do you think that links are not normal functionality?

          You keep arguing that "exceptions can be used for recoverable conditions". Indeed they can. But is a link really best modeled as a "recoverable condition"? This sounds to me like a rationalization for using exceptions, not a reason to use exceptions over data structures.

          Please provide a reasoned explanation for your disagreement. Otherwise it is hard to move forward.

          Show
          Doug Cutting added a comment - > Doug should we pay the cost and do it the "right way" now? Yes, of course. > I prefer to use exceptions [ ... ] But why? You've said it's "cleaner" and "simpler". I don't follow your argument. "Simpler" could mean fewer lines of code. I don't think it will actually change the lines of code much. Changing the return types will change more lines of code, hence the change is more complex, but the resulting code will probably be around the same size, regardless of which approach is taken. "Cleaner" is subjective, like "better", and requires elaboration. My argument is that we should model normal functionality with normal programming: methods, parameters, return values and data structures. Do you disagree with this? Or do you think that links are not normal functionality? You keep arguing that "exceptions can be used for recoverable conditions". Indeed they can. But is a link really best modeled as a "recoverable condition"? This sounds to me like a rationalization for using exceptions, not a reason to use exceptions over data structures. Please provide a reasoned explanation for your disagreement. Otherwise it is hard to move forward.
          Hide
          Konstantin Shvachko added a comment -

          Dhruba> In current Hadoop RPC, the only way to serialize/deserialize an exception object is to serialize its message string.

          I am not sure my proposal was understood completely.

          • I have an exception XMountPointException, which does not encode path (or any special fields) in its message.
          • If I call getFileStatus(src) and src crosses a mount point the XMountPointException is thrown by the name-node, which is passed through RPC to the application level.
          • The application catches XMountPointException and calls linkSrc = getSymLink(src), which returns the first link on the path.
          • The application then calls getFileStatus(linkSrc).
          • Encoding linkSrc inside XMountPointException would be an optimization: unnecessary here because crossing mount point
            is not common.
          • If an application knows and expects to cross a mount point it cam call getSymLink() avoiding catching XMountPointException.

          Is this the right usage of exceptions?

          Doug> we combined the value of many methods into the value of a single method with a very generic name.

          The point of RPC is to make transparent the types passed and returned by a method.
          As said Object execute(Object request) replaces all PRCs by one generic call.
          It is like sending and receiving blob buffers and casting them into appropriate structures.
          Bad practice in C, unavoidable in Cobol, cannot qualify for a remote procedure call.

          Show
          Konstantin Shvachko added a comment - Dhruba> In current Hadoop RPC, the only way to serialize/deserialize an exception object is to serialize its message string. I am not sure my proposal was understood completely. I have an exception XMountPointException, which does not encode path (or any special fields) in its message. If I call getFileStatus(src) and src crosses a mount point the XMountPointException is thrown by the name-node, which is passed through RPC to the application level. The application catches XMountPointException and calls linkSrc = getSymLink(src) , which returns the first link on the path. The application then calls getFileStatus(linkSrc) . Encoding linkSrc inside XMountPointException would be an optimization: unnecessary here because crossing mount point is not common. If an application knows and expects to cross a mount point it cam call getSymLink() avoiding catching XMountPointException. Is this the right usage of exceptions? Doug> we combined the value of many methods into the value of a single method with a very generic name. The point of RPC is to make transparent the types passed and returned by a method. As said Object execute(Object request) replaces all PRCs by one generic call. It is like sending and receiving blob buffers and casting them into appropriate structures. Bad practice in C, unavoidable in Cobol, cannot qualify for a remote procedure call.
          Hide
          Raghu Angadi added a comment -

          > Returning normal values through exceptions is a perversion.
          hardly anyone disagrees with it. Another trueism is that returning unrelated values is API pollution. I guess one end of the opinion is that retuning link is normal and other end thinks it isn't. The issue is not about returning values vs exception 'in general', but only about this particular case.

          Show
          Raghu Angadi added a comment - > Returning normal values through exceptions is a perversion. hardly anyone disagrees with it. Another trueism is that returning unrelated values is API pollution. I guess one end of the opinion is that retuning link is normal and other end thinks it isn't. The issue is not about returning values vs exception 'in general', but only about this particular case.
          Hide
          Doug Cutting added a comment -

          > Another trueism is that returning unrelated values is API pollution.

          Sure, but the values here are not unrelated. They're core.

          > I guess one end of the opinion is that retuning link is normal and other end thinks it isn't.

          Okay, so, if that's the case, how is it abnormal? The traditional way to model this would be something like:

          public FileHandle open(Path path) {
            for (element in path) {
              FileStatus stat = fsImpl.getStatus(element);
              if (stat.isLink()) {
                return open(stat.getLinkTarget());
              } else if stat.isFile() {
                return fsImpl.open(element);
              }
            }
          

          I don't think anyone would argue that the values of isLink() and getLinkTarget() are not normal data. But we don't want to make this many calls to our FS implementations, since they're RPCs. So, for performance, we want to package that same data in a different way, with something like:

          public FileHandle open(Path path) {
              LinkOrFileHandle linkOrFileHandle = getNextLinkOrOpen(path);
              while (linkOrFileHandle.isLink()) {
                 linkOrFileHandle = getNextLinkOrOpen(linkOrFileHandle.getLink());
              }
              return linkOrFileHandle.getFileHandle();
          }
          

          Why does this suddenly make the that same data non-normal?

          Show
          Doug Cutting added a comment - > Another trueism is that returning unrelated values is API pollution. Sure, but the values here are not unrelated. They're core. > I guess one end of the opinion is that retuning link is normal and other end thinks it isn't. Okay, so, if that's the case, how is it abnormal? The traditional way to model this would be something like: public FileHandle open(Path path) { for (element in path) { FileStatus stat = fsImpl.getStatus(element); if (stat.isLink()) { return open(stat.getLinkTarget()); } else if stat.isFile() { return fsImpl.open(element); } } I don't think anyone would argue that the values of isLink() and getLinkTarget() are not normal data. But we don't want to make this many calls to our FS implementations, since they're RPCs. So, for performance, we want to package that same data in a different way, with something like: public FileHandle open(Path path) { LinkOrFileHandle linkOrFileHandle = getNextLinkOrOpen(path); while (linkOrFileHandle.isLink()) { linkOrFileHandle = getNextLinkOrOpen(linkOrFileHandle.getLink()); } return linkOrFileHandle.getFileHandle(); } Why does this suddenly make the that same data non-normal?
          Hide
          Raghu Angadi added a comment -

          > I don't think anyone would argue that the values of isLink() and getLinkTarget()

          Yes. It is of course proper data for 'stat' or FileStatus.

          > But we don't want to make this many calls to our FS implementations, since they're RPCs. So, for performance,
          so.. this is a performance compromise. But the arguments till now have been about "perfect" api/implementation for a clearly imperfect/compromise (but well performing) solution.

          > Why does this suddenly make the that same data non-normal?

          in the patch it is not clalled 'nextLinkOrOpenImpl()'.. it is just called 'openImpl()'.

          Show
          Raghu Angadi added a comment - > I don't think anyone would argue that the values of isLink() and getLinkTarget() Yes. It is of course proper data for 'stat' or FileStatus. > But we don't want to make this many calls to our FS implementations, since they're RPCs. So, for performance, so.. this is a performance compromise. But the arguments till now have been about "perfect" api/implementation for a clearly imperfect/compromise (but well performing) solution. > Why does this suddenly make the that same data non-normal? in the patch it is not clalled 'nextLinkOrOpenImpl()'.. it is just called 'openImpl()'.
          Hide
          Raghu Angadi added a comment -

          Lets take another hypothetical case : Our RPC gets rewritten, and for some reason, it wants to inform callers to retry. will all the RPCs calls have a prefix 'retryOr...()' and all the returned values will have extra field to indicate if this needs to be retried? This is obviously an exaggerated example but hopefully makes a point.

          Also I realize that it is hard to be too objective about interfaces...

          Show
          Raghu Angadi added a comment - Lets take another hypothetical case : Our RPC gets rewritten, and for some reason, it wants to inform callers to retry. will all the RPCs calls have a prefix 'retryOr...()' and all the returned values will have extra field to indicate if this needs to be retried? This is obviously an exaggerated example but hopefully makes a point. Also I realize that it is hard to be too objective about interfaces...
          Hide
          Doug Cutting added a comment -

          > so.. this is a performance compromise. But the arguments till now have been about "perfect" ...

          This issue has long been performance-related. The assumption is that it is unacceptable for the addition of symlinks to double or more the number of RPCs for common HDFS operations when no links are present. I have noted this assumption on several occasions and no one has disputed it. Do you agree with this assumption, or do you feel we should examine it more?

          > in the patch it is not clalled 'nextLinkOrOpenImpl()'.. it is just called 'openImpl()'.

          I thought the point of contention is not the name of the method or return value but whether to use an exception or a data structure to pass the return value. (That pseudo code also referred to FileHandle, which is also not in the patch.) Would you like to propose better names for the methods and objects in the patch? I am not wedded to the names used in the patch, but perhaps we should resolve the exception-related issue first?

          > This is obviously an exaggerated example but hopefully makes a point.

          I'm not sure what the point is. If the RPC system wished to return some generic data for all protocols then it could not force a different return type, since return types are protocol-specific, so we'd probably need to add a method to the RPC runtime. If the situation is unusual, then an exception would be more appropriate.

          HTTP is perhaps an analogy. It uses the return code to indicate a redirect. URLConnection.html#getInputStream() doesn't throw an exception when it encounters a redirect. It either follows the redirect or returns the text of the redirect page, depending on whether you've set HttpURLConnection#setFollowRedirects. Redirects are part of the protocol. The HTTP return type is complex. That's the model we must move towards if we wish to handle links without adding lots more RPCs.

          Show
          Doug Cutting added a comment - > so.. this is a performance compromise. But the arguments till now have been about "perfect" ... This issue has long been performance-related. The assumption is that it is unacceptable for the addition of symlinks to double or more the number of RPCs for common HDFS operations when no links are present. I have noted this assumption on several occasions and no one has disputed it. Do you agree with this assumption, or do you feel we should examine it more? > in the patch it is not clalled 'nextLinkOrOpenImpl()'.. it is just called 'openImpl()'. I thought the point of contention is not the name of the method or return value but whether to use an exception or a data structure to pass the return value. (That pseudo code also referred to FileHandle, which is also not in the patch.) Would you like to propose better names for the methods and objects in the patch? I am not wedded to the names used in the patch, but perhaps we should resolve the exception-related issue first? > This is obviously an exaggerated example but hopefully makes a point. I'm not sure what the point is. If the RPC system wished to return some generic data for all protocols then it could not force a different return type, since return types are protocol-specific, so we'd probably need to add a method to the RPC runtime. If the situation is unusual, then an exception would be more appropriate. HTTP is perhaps an analogy. It uses the return code to indicate a redirect. URLConnection.html#getInputStream() doesn't throw an exception when it encounters a redirect. It either follows the redirect or returns the text of the redirect page, depending on whether you've set HttpURLConnection#setFollowRedirects. Redirects are part of the protocol. The HTTP return type is complex. That's the model we must move towards if we wish to handle links without adding lots more RPCs.
          Hide
          Raghu Angadi added a comment -

          > Do you agree with this assumption, or do you feel we should examine it more?

          I agree with assumption. I was only wondering why we are ok with compromise in one aspect and not about the other.

          > I thought the point of contention is not the name of the method or return value but whether to use an exception or a data structure to pass the return value. (That pseudo code also referred to FileHandle, which is also not in the patch.)

          For me at least, the issue was returning something that does not suite the api. If we have a member called 'getNextLinkOrOpen()' that returns next link or opens a file.. it makes perfect sense. My objection has been calling it open() and returning link some times.

          Though I don't like the new names, +1 for current patch if the names of the members is changed to reflect what it does. This also allows HDFS to use exceptions since we don't want to change the names.

          Show
          Raghu Angadi added a comment - > Do you agree with this assumption, or do you feel we should examine it more? I agree with assumption. I was only wondering why we are ok with compromise in one aspect and not about the other. > I thought the point of contention is not the name of the method or return value but whether to use an exception or a data structure to pass the return value. (That pseudo code also referred to FileHandle, which is also not in the patch.) For me at least, the issue was returning something that does not suite the api. If we have a member called 'getNextLinkOrOpen()' that returns next link or opens a file.. it makes perfect sense. My objection has been calling it open() and returning link some times. Though I don't like the new names, +1 for current patch if the names of the members is changed to reflect what it does. This also allows HDFS to use exceptions since we don't want to change the names.
          Hide
          Doug Cutting added a comment -

          > I was only wondering why we are ok with compromise in one aspect and not about the other.

          What compromise is this?

          > it makes perfect sense. My objection has been calling it open() and returning link some times.

          So it was all about the method name? Okay, then please suggest new method names and return type names, rather than suggesting we use exceptions to return normal data. I personally find the 'xOry' names useful in pseudo code, when I'm trying to make a point, but a bit verbose in real code. We don't call FileStatus NameAndModifiedDateAndOwnerAnd....

          > This also allows HDFS to use exceptions since we don't want to change the names.

          What? I don't follow your logic here. Why can't HDFS change method names?

          Show
          Doug Cutting added a comment - > I was only wondering why we are ok with compromise in one aspect and not about the other. What compromise is this? > it makes perfect sense. My objection has been calling it open() and returning link some times. So it was all about the method name? Okay, then please suggest new method names and return type names, rather than suggesting we use exceptions to return normal data. I personally find the 'xOry' names useful in pseudo code, when I'm trying to make a point, but a bit verbose in real code. We don't call FileStatus NameAndModifiedDateAndOwnerAnd.... > This also allows HDFS to use exceptions since we don't want to change the names. What? I don't follow your logic here. Why can't HDFS change method names?
          Hide
          Raghu Angadi added a comment -

          >[...] but a bit verbose in real code.

          So is replacing 'return false;' with 'return new FSLinkBoolean(false)'.

          And as developers we have to look at it and deal with it everyday .

          Show
          Raghu Angadi added a comment - > [...] but a bit verbose in real code. So is replacing 'return false;' with 'return new FSLinkBoolean(false)'. And as developers we have to look at it and deal with it everyday .
          Hide
          Raghu Angadi added a comment -

          > Why can't HDFS change method names?

          Right. They can.

          Show
          Raghu Angadi added a comment - > Why can't HDFS change method names? Right. They can.
          Hide
          Owen O'Malley added a comment -

          I don't think it is reasonable at all to use an extra rpc to always test the existence of links.

          To me it boils down to three things:
          1. Hitting a link in a traversal isn't an error or an exceptional situation.
          2. Exceptions are not handled well by rpc.
          3. The logging by the rpc package is confusing for non-system exceptions.

          It seems like most of the debate seems to be on 1, but it almost entirely a style issue. You either think it is an appropriate use of exceptions or not. I haven't seen any people switching sides based on the arguments on this jira.

          On #2, currently all exceptions are stringified and sent as RemoteException. That isn't very useful for throwing and catching. If we switch to Thrift in the future, it would again, be unclear what the exception semantics are. Return types are well understood and under Thrift are versioned, which would be a very good thing.

          On #3, we can probably use minor changes to rpc to avoid it. (A marker interface that says not to log it when thrown across RPC? That would still be awkward for java exceptions that we are re-using.)

          Show
          Owen O'Malley added a comment - I don't think it is reasonable at all to use an extra rpc to always test the existence of links. To me it boils down to three things: 1. Hitting a link in a traversal isn't an error or an exceptional situation. 2. Exceptions are not handled well by rpc. 3. The logging by the rpc package is confusing for non-system exceptions. It seems like most of the debate seems to be on 1, but it almost entirely a style issue. You either think it is an appropriate use of exceptions or not. I haven't seen any people switching sides based on the arguments on this jira. On #2, currently all exceptions are stringified and sent as RemoteException. That isn't very useful for throwing and catching. If we switch to Thrift in the future, it would again, be unclear what the exception semantics are. Return types are well understood and under Thrift are versioned, which would be a very good thing. On #3, we can probably use minor changes to rpc to avoid it. (A marker interface that says not to log it when thrown across RPC? That would still be awkward for java exceptions that we are re-using.)
          Hide
          Doug Cutting added a comment -

          > So is replacing 'return false;' with 'return new FSLinkBoolean(false)'.

          That's not a fair comparison. I was talking about shorter names versus longer names. You are talking about throwing exceptions versus adding a bit of code to return a value. Both involve more-or-less code, but in one case that's the only issue, in the other there's a lot more to it.

          Show
          Doug Cutting added a comment - > So is replacing 'return false;' with 'return new FSLinkBoolean(false)'. That's not a fair comparison. I was talking about shorter names versus longer names. You are talking about throwing exceptions versus adding a bit of code to return a value. Both involve more-or-less code, but in one case that's the only issue, in the other there's a lot more to it.
          Hide
          Raghu Angadi added a comment -

          I didn't mean to say implementation was wrong. I meant that implementation is a result the interface.

          Show
          Raghu Angadi added a comment - I didn't mean to say implementation was wrong. I meant that implementation is a result the interface.
          Hide
          Konstantin Shvachko added a comment -

          Doug> whether to use an exception or a data structure to pass the return value.

          !!! No need to pass the return value with the exception !!!

          public FileHandle open(Path path) {
            while(path != null) {
              try {
                return open(path); // succeeds if there are no links in the path
              } catch (XMountPointException eX) { // called only if the mount point is crossed
                path = getLinkTarget(path);
              }
            }
          }
          

          There is no need to change api with this. Only add a new method getLinkTarget().
          And XMountPointException does not encode or pass any values.

          Show
          Konstantin Shvachko added a comment - Doug> whether to use an exception or a data structure to pass the return value. !!! No need to pass the return value with the exception !!! public FileHandle open(Path path) { while (path != null ) { try { return open(path); // succeeds if there are no links in the path } catch (XMountPointException eX) { // called only if the mount point is crossed path = getLinkTarget(path); } } } There is no need to change api with this. Only add a new method getLinkTarget(). And XMountPointException does not encode or pass any values.
          Hide
          Sanjay Radia added a comment -

          Doug says>>>>>>
          >> Doug should we pay the cost and do it the "right way" now?
          >Yes, of course.
          >> I prefer to use exceptions [ ... ]
          >But why? You've said it's "cleaner" and "simpler". I don't follow your argument. "Simpler" could mean fewer lines of code. I don't think it will actually change the lines of code much. Changing the return types will change more lines of code, hence the change is more complex, but the resulting code will probably be around the same size, regardless of which approach is taken. "Cleaner" is subjective, like "better", and requires elaboration.
          >My argument is that we should model normal functionality with normal programming: methods, parameters, return values and data structures. Do you disagree with this? Or do you think that links are not normal functionality?
          >You keep arguing that "exceptions can be used for recoverable conditions". Indeed they can. But is a link really best modeled as a "recoverable condition"? This sounds to me like a rationalization for using exceptions, not a reason to use exceptions over data structures.
          >Please provide a reasoned explanation for your disagreement. Otherwise it is hard to move forward.

          Two part answer below: (1) response to the above, and (2) in the sprint of moving forward.

          This a very one-sided evaluation of the discussion that has occurred above.
          First, I have said that we have a difference of opinion (essentially granting you that you have a reasonable point of view).
          I have not come out and said that you are not offering any reasonable arguments.
          Has reasonable arguments been offered for my side? Yes. you are just not willing to see it because you think the use of exceptions in this
          situation is blasphemous.
          I am not the only one who finds the the use of exceptions clean and elegant and the alternate view not so; there are others who have
          said similar things.

          • Konstantine
          • Raghu
          • Dhruba - "Dhruba> Thus, exceptions could be used here [NN]. This keeps most of the HDFS code clean and elegant. " He however argues that we should not use exceptions for RPC. But he does state that use of exceptions make the NN clean and elegant.

          Hence many have provided counter arguments to your view.
          Myself and other have tried to explain what we have meant by "elegant and simple".
          Since several of us are saying the same thing please grant us that we are seeing something you are not even if we are all wrong.
          I agree that terms like simple and elegant are not as precise as one would like but then neither are "hacky" or "perversion"

          What we have is a difference in opinion - both of are convinced that that the other approach is not good or is invalid. (BTW i don't believe
          your approach is invalid and wrong, I just prefer mine.)

          In the spirit of moving forward:

          • symlink vs link - I am okay with either but prefer symlink.
          • use of exceptions inside NN
            +1.
            I believe this keeps the code and the internal interfaces "simple and elegant" (others such as Dhruba have said the same).
            It has nothing to do with how much code needs to get changed.
            Lets agree to disagree here.
            Doug if you want to change the NN code to not use exceptions, I would like to have an opportunity to sit with you and make one last attempt
            to help you see my point of view or get convinced by yours. Eitherway, we can discuss this over a beer someday (I will buy).
          • Not using exceptions for RPC interface - use the patch approach or Owen's variation on it.
            I can live with this.
            I can justify not using exceptions here to some degree because the impact is limited and also some RPC systems do not have exceptions ( not a very strong argument because we already use exceptions in our RPC interface)
            I am okay with the way the patch has done it or the alternative that Owen has offered.
            (Its a six of this half a dozen of the other). As I noted, Owen's approach is potentially better for evolution and versioning.
          • Having the NN compose the symlink and the rest of the path
            I can live with either - I prefer to keep them separate and have the client side do the composition as information is preserved and furthermore it allows us to deal other ways of composing the two parts (I have no concrete examples of needing this at this stage).
          Show
          Sanjay Radia added a comment - Doug says>>>>>> >> Doug should we pay the cost and do it the "right way" now? >Yes, of course. >> I prefer to use exceptions [ ... ] >But why? You've said it's "cleaner" and "simpler". I don't follow your argument. "Simpler" could mean fewer lines of code. I don't think it will actually change the lines of code much. Changing the return types will change more lines of code, hence the change is more complex, but the resulting code will probably be around the same size, regardless of which approach is taken. "Cleaner" is subjective, like "better", and requires elaboration. >My argument is that we should model normal functionality with normal programming: methods, parameters, return values and data structures. Do you disagree with this? Or do you think that links are not normal functionality? >You keep arguing that "exceptions can be used for recoverable conditions". Indeed they can. But is a link really best modeled as a "recoverable condition"? This sounds to me like a rationalization for using exceptions, not a reason to use exceptions over data structures. >Please provide a reasoned explanation for your disagreement. Otherwise it is hard to move forward. Two part answer below: (1) response to the above, and (2) in the sprint of moving forward. This a very one-sided evaluation of the discussion that has occurred above. First, I have said that we have a difference of opinion (essentially granting you that you have a reasonable point of view). I have not come out and said that you are not offering any reasonable arguments. Has reasonable arguments been offered for my side? Yes. you are just not willing to see it because you think the use of exceptions in this situation is blasphemous. I am not the only one who finds the the use of exceptions clean and elegant and the alternate view not so; there are others who have said similar things. Konstantine Raghu Dhruba - "Dhruba> Thus, exceptions could be used here [NN] . This keeps most of the HDFS code clean and elegant. " He however argues that we should not use exceptions for RPC. But he does state that use of exceptions make the NN clean and elegant. Hence many have provided counter arguments to your view. Myself and other have tried to explain what we have meant by "elegant and simple". Since several of us are saying the same thing please grant us that we are seeing something you are not even if we are all wrong. I agree that terms like simple and elegant are not as precise as one would like but then neither are "hacky" or "perversion" What we have is a difference in opinion - both of are convinced that that the other approach is not good or is invalid. (BTW i don't believe your approach is invalid and wrong, I just prefer mine.) In the spirit of moving forward: symlink vs link - I am okay with either but prefer symlink. use of exceptions inside NN +1. I believe this keeps the code and the internal interfaces "simple and elegant" (others such as Dhruba have said the same). It has nothing to do with how much code needs to get changed. Lets agree to disagree here. Doug if you want to change the NN code to not use exceptions, I would like to have an opportunity to sit with you and make one last attempt to help you see my point of view or get convinced by yours. Eitherway, we can discuss this over a beer someday (I will buy). Not using exceptions for RPC interface - use the patch approach or Owen's variation on it. I can live with this. I can justify not using exceptions here to some degree because the impact is limited and also some RPC systems do not have exceptions ( not a very strong argument because we already use exceptions in our RPC interface) I am okay with the way the patch has done it or the alternative that Owen has offered. (Its a six of this half a dozen of the other). As I noted, Owen's approach is potentially better for evolution and versioning. Having the NN compose the symlink and the rest of the path I can live with either - I prefer to keep them separate and have the client side do the composition as information is preserved and furthermore it allows us to deal other ways of composing the two parts (I have no concrete examples of needing this at this stage).
          Hide
          Sanjay Radia added a comment -

          Oops a few of my comments got deleted by mistake as I was doing cut/paste.

          -------------

          >My argument is that we should model normal functionality with normal programming: methods, parameters, return values and data structures. Do you disagree with this? Or do you think that links are not normal functionality?
          >You keep arguing that "exceptions can be used for recoverable conditions". Indeed they can. But is a link really best modeled as a "recoverable condition"? This sounds to me like a rationalization for using exceptions, not a reason to use exceptions over data structures.

          The way you have worded your questions makes it very hard for me to convey the alternative.
          So I response in the best way I can (and you will not find the answer acceptable since I have said some of this before).
          I see following a symbolic link as an exceptional condition but not an error. The NN prefers not to follow the remote symlinks.
          It is however a condition that can be recovered by the client side.
          Either the use of exceptions or data types and parameters are ways to deal with this.
          I believe the use of exceptions in this situation is acceptable and further the better of the two alternative.
          Further I believe it leaves the methods signatures intact and they continue to convey the function they previously did by their name and by the types of the in and out parameters. I like that property of the solution. The alternate solution forces me to create a whole bunch of new data types or overload existing one. If exceptions were not available I would use the alternate solution.

          I don't like your use of term "normal control flow".That term was used in one of your references for an example where and outOfBound exception was used to get out of a for loop at the array index boundary. Clearly use of exceptions to exit out of a for loop for out of bounds is blasphemous.
          So the issue is that if I am forced to argue using the words normal or not normal it makes my point of view ldiotic - the use of those words
          in that example do not apply in this context.

          ----------------

          Show
          Sanjay Radia added a comment - Oops a few of my comments got deleted by mistake as I was doing cut/paste. ------------- >My argument is that we should model normal functionality with normal programming: methods, parameters, return values and data structures. Do you disagree with this? Or do you think that links are not normal functionality? >You keep arguing that "exceptions can be used for recoverable conditions". Indeed they can. But is a link really best modeled as a "recoverable condition"? This sounds to me like a rationalization for using exceptions, not a reason to use exceptions over data structures. The way you have worded your questions makes it very hard for me to convey the alternative. So I response in the best way I can (and you will not find the answer acceptable since I have said some of this before). I see following a symbolic link as an exceptional condition but not an error. The NN prefers not to follow the remote symlinks. It is however a condition that can be recovered by the client side. Either the use of exceptions or data types and parameters are ways to deal with this. I believe the use of exceptions in this situation is acceptable and further the better of the two alternative. Further I believe it leaves the methods signatures intact and they continue to convey the function they previously did by their name and by the types of the in and out parameters. I like that property of the solution. The alternate solution forces me to create a whole bunch of new data types or overload existing one. If exceptions were not available I would use the alternate solution. I don't like your use of term "normal control flow".That term was used in one of your references for an example where and outOfBound exception was used to get out of a for loop at the array index boundary. Clearly use of exceptions to exit out of a for loop for out of bounds is blasphemous. So the issue is that if I am forced to argue using the words normal or not normal it makes my point of view ldiotic - the use of those words in that example do not apply in this context. ----------------
          Hide
          Doug Cutting added a comment -

          Stepping back a bit, I don't believe this dispute is fundamentally about programming style. That, like Owen's arguments about RPCs, are supporting arguments. The primary issue is that we're changing the FileSystem API in a fundamental way. Instead of direct functions that map parameters to values, we wish to implement conditional functions, that do one of two things, and indicate in their response which path was taken. This is a very different style, one that is forced on us for performance reasons. The alternative is to double or more the number of RPCs per file operation, which is unacceptable.

          If we're using a different style, then it should look different. It should not look like the direct mapping of the existing API. We should not attempt to hide this, as that will lead to confusion. We should make it apparent to readers of the code that these methods return not a single type of value, but one of two kinds of values. This is not a matter of exception style, it's a matter of expressiveness and readability. Code should make clear what it does. Style guidelines derive from this. The reason exceptions are discouraged for normal control flow is that they hide control flow. We are factoring our FileSystem API in a different way, and that factoring should be apparent to all as an obvious, natural attribute of the API.

          Show
          Doug Cutting added a comment - Stepping back a bit, I don't believe this dispute is fundamentally about programming style. That, like Owen's arguments about RPCs, are supporting arguments. The primary issue is that we're changing the FileSystem API in a fundamental way. Instead of direct functions that map parameters to values, we wish to implement conditional functions, that do one of two things, and indicate in their response which path was taken. This is a very different style, one that is forced on us for performance reasons. The alternative is to double or more the number of RPCs per file operation, which is unacceptable. If we're using a different style, then it should look different. It should not look like the direct mapping of the existing API. We should not attempt to hide this, as that will lead to confusion. We should make it apparent to readers of the code that these methods return not a single type of value, but one of two kinds of values. This is not a matter of exception style, it's a matter of expressiveness and readability. Code should make clear what it does. Style guidelines derive from this. The reason exceptions are discouraged for normal control flow is that they hide control flow. We are factoring our FileSystem API in a different way, and that factoring should be apparent to all as an obvious, natural attribute of the API.
          Hide
          Sanjay Radia added a comment -

          Assuming that the vote for not using exceptions for rpc interface passes (I have voted the +0), should we consider modifying the
          rpc interface a little further:

          • return all HDFS exceptions as part of the return status rather then as exceptions. The only exception returned is the IOException when the rpc fails for some reason.
          • this would make the approach consistent - all status and information returned from the NN is part of the XXXResult object for each method XXX.
          Show
          Sanjay Radia added a comment - Assuming that the vote for not using exceptions for rpc interface passes (I have voted the +0), should we consider modifying the rpc interface a little further: return all HDFS exceptions as part of the return status rather then as exceptions. The only exception returned is the IOException when the rpc fails for some reason. this would make the approach consistent - all status and information returned from the NN is part of the XXXResult object for each method XXX.
          Hide
          Konstantin Shvachko added a comment -

          Doug> Instead of direct functions that map parameters to values, we wish to implement conditional functions, that do one of two things, and indicate in their response which path was taken.

          My proposal allows to keep the code within the current API style: "direct functions that map parameters to values".
          HADOOP-4044#action_12637702
          HADOOP-4044#action_12637610
          HADOOP-4044#action_12637240

          Doug> The alternative is to double or more the number of RPCs per file operation, which is unacceptable.

          My proposal does "not double the number of RPCs per file operation".
          HADOOP-4044#action_12637702

          So if this dispute is not about programming style, then why don't we accept the solution that does not require any fundamental changes to the FileSystem API and does not have performance penalties for the regular case? Or is it nevertheless about the programming style?
          I keep wondering what is your selective criteria for ignoring some arguments and answering other.

          Show
          Konstantin Shvachko added a comment - Doug> Instead of direct functions that map parameters to values, we wish to implement conditional functions, that do one of two things, and indicate in their response which path was taken. My proposal allows to keep the code within the current API style: "direct functions that map parameters to values". HADOOP-4044#action_12637702 HADOOP-4044#action_12637610 HADOOP-4044#action_12637240 Doug> The alternative is to double or more the number of RPCs per file operation, which is unacceptable. My proposal does "not double the number of RPCs per file operation". HADOOP-4044#action_12637702 So if this dispute is not about programming style, then why don't we accept the solution that does not require any fundamental changes to the FileSystem API and does not have performance penalties for the regular case? Or is it nevertheless about the programming style? I keep wondering what is your selective criteria for ignoring some arguments and answering other.
          Hide
          Doug Cutting added a comment -

          > why don't we accept the solution that does not require any fundamental changes to the FileSystem API

          But it does change the API fundamentally. All of these methods are now conditional. They now either return the declared value or a boolean indicating that the declared value is to be ignored and that one should follow up with a request for the link value. That's a substantial change, that shouldn't be cloaked in exception handling.

          We might instead consider changing the contract of these methods to permit the return of a null value, and interpret a null value as the need to request the link value. Unfortunately that won't work with the boolean methods, and it's a fragile approach, since folks might forget to check for null. It's better to make it explicit with a datastructure.

          > My proposal does "not double the number of RPCs per file operation".

          I did not mean to imply that it did. I'm sorry that you thought I did. When I referred to doubling the number of RPCs I was contrasting with the naive approach of calling resolveLink() or somesuch explicitly before open(), so that open() is never passed a path that contains a link, as this is implemented in Unix and other VFSs.

          Your approach would add an RPC whenever a symbolic link is encountered over the other approaches, since the link value would not be returned directly with the call to open() but would require a second RPC, right?

          > I keep wondering what is your selective criteria for ignoring some arguments and answering other.

          What arguments am I ignoring? I am sorry if you feel I've ignored some of your arguments. Please highlight them, if you like.

          Show
          Doug Cutting added a comment - > why don't we accept the solution that does not require any fundamental changes to the FileSystem API But it does change the API fundamentally. All of these methods are now conditional. They now either return the declared value or a boolean indicating that the declared value is to be ignored and that one should follow up with a request for the link value. That's a substantial change, that shouldn't be cloaked in exception handling. We might instead consider changing the contract of these methods to permit the return of a null value, and interpret a null value as the need to request the link value. Unfortunately that won't work with the boolean methods, and it's a fragile approach, since folks might forget to check for null. It's better to make it explicit with a datastructure. > My proposal does "not double the number of RPCs per file operation". I did not mean to imply that it did. I'm sorry that you thought I did. When I referred to doubling the number of RPCs I was contrasting with the naive approach of calling resolveLink() or somesuch explicitly before open(), so that open() is never passed a path that contains a link, as this is implemented in Unix and other VFSs. Your approach would add an RPC whenever a symbolic link is encountered over the other approaches, since the link value would not be returned directly with the call to open() but would require a second RPC, right? > I keep wondering what is your selective criteria for ignoring some arguments and answering other. What arguments am I ignoring? I am sorry if you feel I've ignored some of your arguments. Please highlight them, if you like.
          Hide
          Konstantin Shvachko added a comment -

          > All of these methods are now conditional. They now either return the declared value or a boolean

          To make sure we are on the same page - open() does not literally return a boolean, you mean that throwing an exception should be treated as an (boolean) indicator to whether the link should be retrieved, right?
          But the methods have always been conditional in that sense, because they are throwing AccessControlException, QuotaExceededException, and FileNotFoundException. And XMountPointException is just another exception of the same nature.
          I do not see it as a fundamental change to the FileSystem API.

          > When I referred to doubling the number of RPCs I was contrasting with the naive approach

          I think everybody agrees that the naive approach is unacceptable.

          > the link value would not be returned directly with the call to open() but would require a second RPC, right?

          Correct. The point is that there is no degradation of performance in regular case, which is when there are no links or the link is local, but there is an extra rpc when the link points to another cluster.
          And I do not think we should optimize it exactly because this leads to the fundamental change of the api, just as you describe.
          And because an application can avoid this inefficiency by directly resolving the link before calling open() if it knows about the mount point.

          Show
          Konstantin Shvachko added a comment - > All of these methods are now conditional. They now either return the declared value or a boolean To make sure we are on the same page - open() does not literally return a boolean, you mean that throwing an exception should be treated as an (boolean) indicator to whether the link should be retrieved, right? But the methods have always been conditional in that sense , because they are throwing AccessControlException, QuotaExceededException, and FileNotFoundException. And XMountPointException is just another exception of the same nature. I do not see it as a fundamental change to the FileSystem API. > When I referred to doubling the number of RPCs I was contrasting with the naive approach I think everybody agrees that the naive approach is unacceptable. > the link value would not be returned directly with the call to open() but would require a second RPC, right? Correct. The point is that there is no degradation of performance in regular case, which is when there are no links or the link is local, but there is an extra rpc when the link points to another cluster. And I do not think we should optimize it exactly because this leads to the fundamental change of the api, just as you describe. And because an application can avoid this inefficiency by directly resolving the link before calling open() if it knows about the mount point.
          Hide
          Konstantin Shvachko added a comment -

          So the question is why this approach is not acceptable for you (if it is not)?
          Is it because the programming style is not good or is it the performance or is it something else?
          I am sorry that could not understand that from your previous comments.

          Show
          Konstantin Shvachko added a comment - So the question is why this approach is not acceptable for you (if it is not)? Is it because the programming style is not good or is it the performance or is it something else? I am sorry that could not understand that from your previous comments.
          Hide
          Owen O'Malley added a comment -

          Doug,
          I don't see your point. Clearly this is not black and white at all or it wouldn't have caused so much disagreement among the developers. I still think the core disagreement is whether hitting a link is enough of an exception to be reasonably implemented as a Java exception. Most of the rest of it I consider supporting arguments. smile

          In particular, we aren't changing the FileSystem API at all. No client code should be able to detect the difference between the versions before and after. What is changing is the RPC layer and I agree that we effectively have a tagged union between the link result and the actual result. I still don't think either is very nice looking, which is why I was falling back to supporting arguments.

          I think defining result types will give us significant advantages going forward.
          1. It is easy to add additional information in a forward/backward compatible manner.
          2. It is declared and strongly type checked.
          3. It doesn't throw exceptions. grin

          Show
          Owen O'Malley added a comment - Doug, I don't see your point. Clearly this is not black and white at all or it wouldn't have caused so much disagreement among the developers. I still think the core disagreement is whether hitting a link is enough of an exception to be reasonably implemented as a Java exception. Most of the rest of it I consider supporting arguments. smile In particular, we aren't changing the FileSystem API at all. No client code should be able to detect the difference between the versions before and after. What is changing is the RPC layer and I agree that we effectively have a tagged union between the link result and the actual result. I still don't think either is very nice looking, which is why I was falling back to supporting arguments. I think defining result types will give us significant advantages going forward. 1. It is easy to add additional information in a forward/backward compatible manner. 2. It is declared and strongly type checked. 3. It doesn't throw exceptions. grin
          Hide
          dhruba borthakur added a comment -

          From my perspective, I am not buying (yet) Doug's argument of saying that not-throwing exceptions inside the namenode is a better programming style. If there is a proposal to change all internal namenode methods to return object-return-types instead of throwing UnresolvedPathException, then I am against it.

          However, for the RPC interfaces, I like Doug's programming style of returning object-as-return-types rather than throwing exceptions, especially in the case of symbolic links.

          Show
          dhruba borthakur added a comment - From my perspective, I am not buying (yet) Doug's argument of saying that not-throwing exceptions inside the namenode is a better programming style. If there is a proposal to change all internal namenode methods to return object-return-types instead of throwing UnresolvedPathException, then I am against it. However, for the RPC interfaces, I like Doug's programming style of returning object-as-return-types rather than throwing exceptions, especially in the case of symbolic links.
          Hide
          Doug Cutting added a comment -

          Owen> I am not buying (yet) Doug's argument of saying that not-throwing exceptions inside the namenode is a better programming style.

          One could in theory use exceptions to handle any condition. For example, we could, if you attempt to open a directory, return the directory listing in an exception. But, rather, we prefer to use data structures unless exceptions are clearly more appropriate. So, the question is, when are exceptions appropriate?

          A primary advantage of exceptions is their long-distance throwing. The maxim is, "throw early", "catch late", since in most cases its better to pass exceptions through and let the highest layers deal with them. So one sign that exceptions are appropriate is when you don't know who should handle them. That is not the case here. There's only one place where this is intended to be caught, in boilerplate methods of FileSystem.java.

          In http://www.onjava.com/pub/a/onjava/2003/11/19/exceptions.html they list just three cases where exceptions are appropriate: programming errors, client code errors and resource failures. Links are clearly not programming errors or client code errors. But are they resource failures? Examples of resource failures are things like network timeouts, out of memory, lack of permission, etc. These generally concern resources that are out of the direct control of the application, and are hence present unexpected outcomes. Reinforcing that interpretation one finds sentences like, "This is a misuse of the idea of exceptions, which are meant only for defects or for items outside the direct control of the program.", in one of the links I provided earlier. So, are links a condition outside the direct control of the program? Are they external resources? I don't see that. The same component that implements open() also directly implements the storage of the link. So the fact that something is a link is entirely within the filesystem's domain. So, again, using exceptions does not seem justified here.

          Sanjay has argued that it is abnormal for a filesystem to link to a different filesystem. This misses the intended sense of "normal". The value of a link is normal data, under the control of the filesystem implementation, regardless of whether it points to a different filesystem. As I tried to elaborate above, "abnormal", with regards to exceptions, is defined as out of the direct control of the module, or in error. One could reasonably call directories as "abnormal" in some sense-they're not files-but they are not errors nor are they conditions out of the control of the application, nor are they issues that are best handled at some unknown spot higher on the stack. Directories, like links, are thus best modelled directly by the API, not as exceptions.

          The primary argument folks have provided for the superiority of exceptions is that they better preserve the current service provider interface, and that this interface more obviously describes the actions in question. It does describe the non-link part of these actions more concisely, but it hides the link part, so preserving that API is not a clear win.

          This proposed use of exceptions in fact seems to me a poster child for how not to use exceptions. If we go down this route, I don't see how we'll be able to argue against exceptions in other places. Non-local control is not required. No error is involved. External conditions are not involved. None of the standard reasons for using exceptions are met. Exceptions are used here purely as a convenient control flow mechanism. It's effectively setjmp/longjmp, whose use we should not encourage.

          Owen> we aren't changing the FileSystem API at all.

          We're not changing the public API much, but we are changing the service provider API substantially. We'd like most FileSystems to support links, and to do so they'll need to change what many methods return. It would be nice to be able to do this back-compatibly, without altering the existing API much. But I don't see a way to do that and still achieve our performance goals without setting a bad precedent for the appropriate use of exceptions.

          I'm not trying to be an uncompromising ass here. ("You don't have to try, it comes naturally", I hear you say.) If you look at the early history of this issue, you'll see that I only with reluctance and skepticism originally outlined the return type approach, because it is clearly a big, disruptive change that's hard to justify. But I really can see no other way that doesn't set precedents that we cannot live with as a project. I'd love for someone to provide a "hail mary" solution, that leaves the API's return types mostly alone, performs well and doesn't set a bad example, but until that happens I don't see an alternative. (Or for someone to convince me that this is actually a reasonable, appropriate, intended use of exceptions.) But until then, I think we just need to accept that FileSystem's SPI must get more complex in order to efficiently support links.

          Show
          Doug Cutting added a comment - Owen> I am not buying (yet) Doug's argument of saying that not-throwing exceptions inside the namenode is a better programming style. One could in theory use exceptions to handle any condition. For example, we could, if you attempt to open a directory, return the directory listing in an exception. But, rather, we prefer to use data structures unless exceptions are clearly more appropriate. So, the question is, when are exceptions appropriate? A primary advantage of exceptions is their long-distance throwing. The maxim is, "throw early", "catch late", since in most cases its better to pass exceptions through and let the highest layers deal with them. So one sign that exceptions are appropriate is when you don't know who should handle them. That is not the case here. There's only one place where this is intended to be caught, in boilerplate methods of FileSystem.java. In http://www.onjava.com/pub/a/onjava/2003/11/19/exceptions.html they list just three cases where exceptions are appropriate: programming errors, client code errors and resource failures. Links are clearly not programming errors or client code errors. But are they resource failures? Examples of resource failures are things like network timeouts, out of memory, lack of permission, etc. These generally concern resources that are out of the direct control of the application, and are hence present unexpected outcomes. Reinforcing that interpretation one finds sentences like, "This is a misuse of the idea of exceptions, which are meant only for defects or for items outside the direct control of the program.", in one of the links I provided earlier. So, are links a condition outside the direct control of the program? Are they external resources? I don't see that. The same component that implements open() also directly implements the storage of the link. So the fact that something is a link is entirely within the filesystem's domain. So, again, using exceptions does not seem justified here. Sanjay has argued that it is abnormal for a filesystem to link to a different filesystem. This misses the intended sense of "normal". The value of a link is normal data, under the control of the filesystem implementation, regardless of whether it points to a different filesystem. As I tried to elaborate above, "abnormal", with regards to exceptions, is defined as out of the direct control of the module, or in error. One could reasonably call directories as "abnormal" in some sense- they're not files -but they are not errors nor are they conditions out of the control of the application, nor are they issues that are best handled at some unknown spot higher on the stack. Directories, like links, are thus best modelled directly by the API, not as exceptions. The primary argument folks have provided for the superiority of exceptions is that they better preserve the current service provider interface, and that this interface more obviously describes the actions in question. It does describe the non-link part of these actions more concisely, but it hides the link part, so preserving that API is not a clear win. This proposed use of exceptions in fact seems to me a poster child for how not to use exceptions. If we go down this route, I don't see how we'll be able to argue against exceptions in other places. Non-local control is not required. No error is involved. External conditions are not involved. None of the standard reasons for using exceptions are met. Exceptions are used here purely as a convenient control flow mechanism. It's effectively setjmp/longjmp, whose use we should not encourage. Owen> we aren't changing the FileSystem API at all. We're not changing the public API much, but we are changing the service provider API substantially. We'd like most FileSystems to support links, and to do so they'll need to change what many methods return. It would be nice to be able to do this back-compatibly, without altering the existing API much. But I don't see a way to do that and still achieve our performance goals without setting a bad precedent for the appropriate use of exceptions. I'm not trying to be an uncompromising ass here. ("You don't have to try, it comes naturally", I hear you say.) If you look at the early history of this issue, you'll see that I only with reluctance and skepticism originally outlined the return type approach, because it is clearly a big, disruptive change that's hard to justify. But I really can see no other way that doesn't set precedents that we cannot live with as a project. I'd love for someone to provide a "hail mary" solution, that leaves the API's return types mostly alone, performs well and doesn't set a bad example, but until that happens I don't see an alternative. (Or for someone to convince me that this is actually a reasonable, appropriate, intended use of exceptions.) But until then, I think we just need to accept that FileSystem's SPI must get more complex in order to efficiently support links.
          Hide
          Doug Cutting added a comment -

          To make my position clear: I am -1 on using exceptions to signal links in the FileSystem SPI. I would also recommend against using them in the Namenode too, but, as I stated before, that's not a public API, and we can clean it up later. I believe we should not use exceptions for control flow there either, but would not veto this patch for that.

          Show
          Doug Cutting added a comment - To make my position clear: I am -1 on using exceptions to signal links in the FileSystem SPI. I would also recommend against using them in the Namenode too, but, as I stated before, that's not a public API, and we can clean it up later. I believe we should not use exceptions for control flow there either, but would not veto this patch for that.
          Hide
          Doug Cutting added a comment -

          Konstantin> open() does not literally return a boolean, you mean that throwing an exception should be treated as an (boolean) indicator to whether the link should be retrieved, right?

          Yes, that's what I meant. We'd be using an exception to do something that could equivalently done with a boolean.

          > But the methods have always been conditional in that sense, because they are throwing AccessControlException, QuotaExceededException, and FileNotFoundException. And XMountPointException is just another exception of the same nature.

          I think its nature is different. The former are all cases where the long-distance throwing is valuable. These are intended to be caught and processed at a higher level. In the successful execution of a typical application one would not expect to see these. An XMountPointException on the other hand would never be handled at some unknown higher level, it's always handled at a known point. It is expected to occur frequently in the course of a typical application's execution as part of the intended control path.

          Show
          Doug Cutting added a comment - Konstantin> open() does not literally return a boolean, you mean that throwing an exception should be treated as an (boolean) indicator to whether the link should be retrieved, right? Yes, that's what I meant. We'd be using an exception to do something that could equivalently done with a boolean. > But the methods have always been conditional in that sense, because they are throwing AccessControlException, QuotaExceededException, and FileNotFoundException. And XMountPointException is just another exception of the same nature. I think its nature is different. The former are all cases where the long-distance throwing is valuable. These are intended to be caught and processed at a higher level. In the successful execution of a typical application one would not expect to see these. An XMountPointException on the other hand would never be handled at some unknown higher level, it's always handled at a known point. It is expected to occur frequently in the course of a typical application's execution as part of the intended control path.
          Hide
          Doug Cutting added a comment -

          I mistakenly attributed a comment above to Owen, when it was really Dhruba. Sorry! I should be more careful.

          Show
          Doug Cutting added a comment - I mistakenly attributed a comment above to Owen, when it was really Dhruba. Sorry! I should be more careful.
          Hide
          Konstantin Shvachko added a comment -

          I understand your answer to my question HADOOP-4044#action_12638156 is that
          this argument is about the programming style
          rather than performance or anything else. Or do you call differently?

          Qouting one of the links you provide
          "Checked exceptions : represent invalid conditions in areas outside the immediate control of the program"

          If a program cannot cannot handle some condition it throws a checked exception.
          So the question of whether the value of a link is normal data is actually program layer-dependent.

          • Can it be handled within the name-node. Yes for local links - handle them. No for external links. Throw the exception.
          • Can it be handled by ClientProtocol. No. Throw the exception.
          • Can it be handled by DFSClient. No. Throw the exception.
          • Can it be handled by DistributedFileSystem. No. Throw the exception.
          • Can it be handled by FileSystem. Yes! because FileSystem cashes file system implementations.

          FileSystem is the first layer for which link values make sense.

          Show
          Konstantin Shvachko added a comment - I understand your answer to my question HADOOP-4044#action_12638156 is that this argument is about the programming style rather than performance or anything else. Or do you call differently? Qouting one of the links you provide "Checked exceptions : represent invalid conditions in areas outside the immediate control of the program" If a program cannot cannot handle some condition it throws a checked exception. So the question of whether the value of a link is normal data is actually program layer-dependent. Can it be handled within the name-node. Yes for local links - handle them. No for external links. Throw the exception. Can it be handled by ClientProtocol. No. Throw the exception. Can it be handled by DFSClient. No. Throw the exception. Can it be handled by DistributedFileSystem. No. Throw the exception. Can it be handled by FileSystem. Yes! because FileSystem cashes file system implementations. FileSystem is the first layer for which link values make sense.
          Hide
          Doug Cutting added a comment -

          Konstantin> this argument is about the programming style rather than performance or anything else

          It's clearly about both of those. We wish to make a change motivated by performance, and the question is how best to implement that change. We have two proposals on the table, one that is less invasive but uses a discouraged style, and one that involves substantial changes but avoids that style problem. I tried earlier to remove the discussion from style guidelines to the underlying reason for those guidelines -that an API's purpose is to model expected control and data flow-but that argument seems to have caused more confusion than it has helped, so I've mostly reverted to citing style guidelines.

          Konstantin> FileSystem is the first layer for which link values make sense.

          The fact that that layer is fully-resolved and known indicates that links should be a first-class value in the API. All of the other layers you present are part of a single module (HDFS) whose job is to present itself as a FileSystem. If HDFS is implementing the FileSystem API, then FileSystem's datatypes are not alien to HDFS. This is the same twist that Sanjay pursued. Yes, the link URI is to a file outside of that particular HDFS filesystem, but link URIs are normal data that HDFS accepts as parameters (when adding a symbolic link), stores, and returns (when dereferencing a link). So this is not an alien object or an unexpected situation. A link to another filesystem is a sensible, first class value within HDFS.

          Show
          Doug Cutting added a comment - Konstantin> this argument is about the programming style rather than performance or anything else It's clearly about both of those. We wish to make a change motivated by performance, and the question is how best to implement that change. We have two proposals on the table, one that is less invasive but uses a discouraged style, and one that involves substantial changes but avoids that style problem. I tried earlier to remove the discussion from style guidelines to the underlying reason for those guidelines - that an API's purpose is to model expected control and data flow -but that argument seems to have caused more confusion than it has helped, so I've mostly reverted to citing style guidelines. Konstantin> FileSystem is the first layer for which link values make sense. The fact that that layer is fully-resolved and known indicates that links should be a first-class value in the API. All of the other layers you present are part of a single module (HDFS) whose job is to present itself as a FileSystem. If HDFS is implementing the FileSystem API, then FileSystem's datatypes are not alien to HDFS. This is the same twist that Sanjay pursued. Yes, the link URI is to a file outside of that particular HDFS filesystem, but link URIs are normal data that HDFS accepts as parameters (when adding a symbolic link), stores, and returns (when dereferencing a link). So this is not an alien object or an unexpected situation. A link to another filesystem is a sensible, first class value within HDFS.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Doug> Links are clearly not programming errors or client code errors.

          It depends on how the API is defined. e.g. we could define the API open(Path f) such that the parameter f is supposed to be a non-linked path. An exception will be thrown if the client codes use it with a linked path.

          Show
          Tsz Wo Nicholas Sze added a comment - Doug> Links are clearly not programming errors or client code errors. It depends on how the API is defined. e.g. we could define the API open(Path f) such that the parameter f is supposed to be a non-linked path. An exception will be thrown if the client codes use it with a linked path.
          Hide
          Owen O'Malley added a comment -

          Getting back to the nuts and bolts of this patch, I don't really like the structure of this implementation. I think the LinkResolvers and the myriad anonymous classes that extend it are very hard to read. I'd much rather have something like:

          abstract class LinkResult<T> {
            boolean isLink();
            T getResult();
            FileSystem getFileSystem() throws IOException;
            /** Throws an exception if too deep. */
            Path getNewPath() throws IOException;
          }
          
          class DistributedFileSystem extends FileSystem {
            ...
            public FSDataInputStream open(Path p, int buffer) throws IOException {
               LinkResult<FSDataInputStream> result = dfs.open(p, buffer, verifyChecksum, statistics);
               if (result.isLink()) {
                 return result.getFileSystem().open(result.getPath(), buffer);
               } else {
                 return result.getResult();
               }
            }
          }
          

          It is much easier to follow and understand and forces no changes on FileSystems that don't support links.

          Show
          Owen O'Malley added a comment - Getting back to the nuts and bolts of this patch, I don't really like the structure of this implementation. I think the LinkResolvers and the myriad anonymous classes that extend it are very hard to read. I'd much rather have something like: abstract class LinkResult<T> { boolean isLink(); T getResult(); FileSystem getFileSystem() throws IOException; /** Throws an exception if too deep. */ Path getNewPath() throws IOException; } class DistributedFileSystem extends FileSystem { ... public FSDataInputStream open(Path p, int buffer) throws IOException { LinkResult<FSDataInputStream> result = dfs.open(p, buffer, verifyChecksum, statistics); if (result.isLink()) { return result.getFileSystem().open(result.getPath(), buffer); } else { return result.getResult(); } } } It is much easier to follow and understand and forces no changes on FileSystems that don't support links.
          Hide
          Doug Cutting added a comment -

          Owen, do you think this logic is unique to HDFS? If we add support for links to another FileSystem implementation would we want to share this logic with HDFS? If so, could we do it through a common base-class, like LinkableFileSystem. Then FileSystems that never intend to support links could continue to directly implement the end-user methods, while FileSystems that support links would extend the new subclass and implement the LinkResult<T> SPI methods. Is that consistent with your proposal?

          Also, the pseudo code above will cause a stack overflow for a circular link. Shouldn't we catch loops earlier than that? And if the loop-detection logic is non-trivial, won't we end up with either lots of duplicated code or something like the anonymous classes in Dhruba's patch?

          Show
          Doug Cutting added a comment - Owen, do you think this logic is unique to HDFS? If we add support for links to another FileSystem implementation would we want to share this logic with HDFS? If so, could we do it through a common base-class, like LinkableFileSystem. Then FileSystems that never intend to support links could continue to directly implement the end-user methods, while FileSystems that support links would extend the new subclass and implement the LinkResult<T> SPI methods. Is that consistent with your proposal? Also, the pseudo code above will cause a stack overflow for a circular link. Shouldn't we catch loops earlier than that? And if the loop-detection logic is non-trivial, won't we end up with either lots of duplicated code or something like the anonymous classes in Dhruba's patch?
          Hide
          Doug Cutting added a comment -

          > we could define the API open(Path f) such that the parameter f is supposed to be a non-linked path.

          Does calling something an "error" really make it an "error"? It seems like the only motivation to call this an error is to justify using exceptions to handle links, not that it would in fact be a user or program error to call it in this way. Rather, it would be an expected, documented, supported event, but we'd call it an error anyway, because otherwise we'd have to actually model the functionality in our API.

          Show
          Doug Cutting added a comment - > we could define the API open(Path f) such that the parameter f is supposed to be a non-linked path. Does calling something an "error" really make it an "error"? It seems like the only motivation to call this an error is to justify using exceptions to handle links, not that it would in fact be a user or program error to call it in this way. Rather, it would be an expected, documented, supported event, but we'd call it an error anyway, because otherwise we'd have to actually model the functionality in our API.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > It seems like the only motivation to call this an error is to justify using exceptions to handle links, not that it would in fact be a user or program error to call it in this way.

          It is nothing wrong to assume parameter f a non-linked path. For performance reason, clients are responsible to resolve links before using it since resolving links is expensive, especially for cross-file-system links.

          Suppose we have the following link
          hdfs://clusterA/project/xxx -> hdfs://clusterB/yyy

          If a client submit a job with hdfs://clusterA/project/xxx/input and hdfs://clusterA/project/xxx/output, all the input and output file accesses involve two clusters and require two calls. However, if the client uses hdfs://clusterB/yyy, only one cluster is involved.

          Show
          Tsz Wo Nicholas Sze added a comment - > It seems like the only motivation to call this an error is to justify using exceptions to handle links, not that it would in fact be a user or program error to call it in this way. It is nothing wrong to assume parameter f a non-linked path. For performance reason, clients are responsible to resolve links before using it since resolving links is expensive, especially for cross-file-system links. Suppose we have the following link hdfs://clusterA/project/xxx -> hdfs://clusterB/yyy If a client submit a job with hdfs://clusterA/project/xxx/input and hdfs://clusterA/project/xxx/output, all the input and output file accesses involve two clusters and require two calls. However, if the client uses hdfs://clusterB/yyy, only one cluster is involved.
          Hide
          Doug Cutting added a comment -

          > clients are responsible to resolve links before using it

          That might be a reasonable contract, but correct use of the API would then involve extra RPCs to check links before trying to open them, no? But if it is expected that folks will regularly pass unresolved links, then we shouldn't use an exception, since callers of this method would be expected to immediately handle the case of unresolved links. There's no need for a throw, as there might be for, e.g., FileNotFound, which must be handled at some unknown layer above. Exceptions are a long-distance goto, a control mechanism that should be reserved for cases that require them, where the catcher is unknown.

          Show
          Doug Cutting added a comment - > clients are responsible to resolve links before using it That might be a reasonable contract, but correct use of the API would then involve extra RPCs to check links before trying to open them, no? But if it is expected that folks will regularly pass unresolved links, then we shouldn't use an exception, since callers of this method would be expected to immediately handle the case of unresolved links. There's no need for a throw, as there might be for, e.g., FileNotFound, which must be handled at some unknown layer above. Exceptions are a long-distance goto, a control mechanism that should be reserved for cases that require them, where the catcher is unknown.
          Hide
          Owen O'Malley added a comment -

          Ok, I realized after I posted that in fact the LinkResult is not abstract.

          Doug> do you think this logic is unique to HDFS?

          Well, currently yes, but in the long term hopefully not. In my proposal, most of the work would happen in LinkResult.

          Doug> Also, the pseudo code above will cause a stack overflow for a circular link.

          I was assuming (and included in the comments) that getNewPath would throw if it was too deeply nested. I would probably use a thread local variable to measure the depth of the traversal. It isn't great, but it would be encapsulated in LinkResult.

          Show
          Owen O'Malley added a comment - Ok, I realized after I posted that in fact the LinkResult is not abstract. Doug> do you think this logic is unique to HDFS? Well, currently yes, but in the long term hopefully not. In my proposal, most of the work would happen in LinkResult. Doug> Also, the pseudo code above will cause a stack overflow for a circular link. I was assuming (and included in the comments) that getNewPath would throw if it was too deeply nested. I would probably use a thread local variable to measure the depth of the traversal. It isn't great, but it would be encapsulated in LinkResult.
          Hide
          Sanjay Radia added a comment -

          Owen, my first instinct was to implement it the way you suggested. However, one you add loop detection you may want to consider using the structure similar the one that Dhruba used to share that code. The addition of FOOImpl for each method does clutter the FileSystem code.
          Doug's suggestion of LinkableFileSystem is interesting, Not I don't think this is unique to HDFS. KFS may want to support this,
          BTW the local file system will want to support the normal dot-relative symlinks but perhaps not the remote one (ie the mounts ). So Linkable is not quite the distinction. (but that is just a question of naming).

          Show
          Sanjay Radia added a comment - Owen, my first instinct was to implement it the way you suggested. However, one you add loop detection you may want to consider using the structure similar the one that Dhruba used to share that code. The addition of FOOImpl for each method does clutter the FileSystem code. Doug's suggestion of LinkableFileSystem is interesting, Not I don't think this is unique to HDFS. KFS may want to support this, BTW the local file system will want to support the normal dot-relative symlinks but perhaps not the remote one (ie the mounts ). So Linkable is not quite the distinction. (but that is just a question of naming).
          Hide
          Doug Cutting added a comment -

          > the local file system will want to support the normal dot-relative symlinks but perhaps not the remote one (ie the mounts ).

          Internal symlinks are supported automatically by local FS, at least on unix. And support for remote symlinks on the local FS might be very cool. Unix will gladly let me create symbolics link to anHDFS URI (Try 'ln -s hdfs://foo/bar foo; ls -l foo'). These cannot be resolved by a unix shell, but they could by 'bin/hadoop fs -ls', by MapReduce programs etc. One could, e.g., keep in one's home directory on unix links to the HDFS directories that one commonly uses.

          Show
          Doug Cutting added a comment - > the local file system will want to support the normal dot-relative symlinks but perhaps not the remote one (ie the mounts ). Internal symlinks are supported automatically by local FS, at least on unix. And support for remote symlinks on the local FS might be very cool. Unix will gladly let me create symbolics link to anHDFS URI (Try 'ln -s hdfs://foo/bar foo; ls -l foo'). These cannot be resolved by a unix shell, but they could by 'bin/hadoop fs -ls', by MapReduce programs etc. One could, e.g., keep in one's home directory on unix links to the HDFS directories that one commonly uses.
          Hide
          Doug Cutting added a comment -

          Owen> I would probably use a thread local variable to measure the depth of the traversal. It isn't great, but it would be encapsulated in LinkResult.

          I see. Yuck. It wouldn't in fact be entirely encapsulated, since it would assume that all the resolution of a path happens in the same thread. Thus we'd want to mark each of the FileSystem methods that uses a LinkResult final, add some comments noting that the LinkResult API has this restriction, etc. (It'd be odd to ever try to run these in different threads, but folks have tried odder things!) In general, thread locals seem like a poor design choice for this. They're best for things like caches, to avoid thread contention, rather than for storing program state, no?

          We could instead perhaps add something like a LinkResolution object that keeps a counter internally that is incremented as the link is resolved. The SPI would pass this around rather than a Path? Is that better than Dhruba's patch?

          Show
          Doug Cutting added a comment - Owen> I would probably use a thread local variable to measure the depth of the traversal. It isn't great, but it would be encapsulated in LinkResult. I see. Yuck. It wouldn't in fact be entirely encapsulated, since it would assume that all the resolution of a path happens in the same thread. Thus we'd want to mark each of the FileSystem methods that uses a LinkResult final, add some comments noting that the LinkResult API has this restriction, etc. (It'd be odd to ever try to run these in different threads, but folks have tried odder things!) In general, thread locals seem like a poor design choice for this. They're best for things like caches, to avoid thread contention, rather than for storing program state, no? We could instead perhaps add something like a LinkResolution object that keeps a counter internally that is incremented as the link is resolved. The SPI would pass this around rather than a Path? Is that better than Dhruba's patch?
          Hide
          Doug Cutting added a comment -

          Perhaps we're prematurely optimizing here. Perhaps the extra RPCs to resolve paths won't hurt things that much. The namenode benchmarks I've seen show that it handles 50k opens/second. How much slower would that be if we always first resolved the path. In this naive approach we'd add a method which must always be called before a file is opened (or renamed, etc.) that would return an 'isResolved' boolean and a path. If the path is resolved then it can be used, otherwise we must resolve it further. The resolution loop could then be independent of the FileSystem method, so no ThreadLocal or other mechanism would be required to maintain this state.

          public class ResolvedPath extends Path {}
          
          public class LinkResolution {
            public isResolved() { ...}
            public Path getUnresolvedPath();
            public ResolvedPath getResolvedPath();
          }
          
          public class FileSystem {
            ...
            public abstract LinkResolution resolveLinks(Path p);
            public abstract FSInputStream open(ResolvedPath p);
          
            private ResolvedPath getResolvedPath(Path p) throws IOException {
              LinkResolution resolution = resolveLinks(p);
              int count = 0;
              while (!resolution.isResolved()) && count < MAX_LINKS) {
                p = resolution.getUnresolvedPath();
                resolution p.getFileSystem(conf).resolveLinks(p);
              }
              return resolution.getResolvedPath();
            }
          
            public FSInputStream open(Path p, ...) throws IOException {
              return open(getResolvedPath(p), ...);
            }
           ...
          }
          

          No exceptions, no thread locals, no anonymous classes, etc. Should we benchmark this before we rule it out?

          Show
          Doug Cutting added a comment - Perhaps we're prematurely optimizing here. Perhaps the extra RPCs to resolve paths won't hurt things that much. The namenode benchmarks I've seen show that it handles 50k opens/second. How much slower would that be if we always first resolved the path. In this naive approach we'd add a method which must always be called before a file is opened (or renamed, etc.) that would return an 'isResolved' boolean and a path. If the path is resolved then it can be used, otherwise we must resolve it further. The resolution loop could then be independent of the FileSystem method, so no ThreadLocal or other mechanism would be required to maintain this state. public class ResolvedPath extends Path {} public class LinkResolution { public isResolved() { ...} public Path getUnresolvedPath(); public ResolvedPath getResolvedPath(); } public class FileSystem { ... public abstract LinkResolution resolveLinks(Path p); public abstract FSInputStream open(ResolvedPath p); private ResolvedPath getResolvedPath(Path p) throws IOException { LinkResolution resolution = resolveLinks(p); int count = 0; while (!resolution.isResolved()) && count < MAX_LINKS) { p = resolution.getUnresolvedPath(); resolution p.getFileSystem(conf).resolveLinks(p); } return resolution.getResolvedPath(); } public FSInputStream open(Path p, ...) throws IOException { return open(getResolvedPath(p), ...); } ... } No exceptions, no thread locals, no anonymous classes, etc. Should we benchmark this before we rule it out?
          Hide
          Doug Cutting added a comment -

          Oops. That last method should instead read:

          public FSInputStream open(Path p, ...) throws IOException {
            ResolvedPath resolved = getResolvedPath(p);
            return p.getFileSystem(conf).open(p);
          }
          
          Show
          Doug Cutting added a comment - Oops. That last method should instead read: public FSInputStream open(Path p, ...) throws IOException { ResolvedPath resolved = getResolvedPath(p); return p.getFileSystem(conf).open(p); }
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > That last method should instead read ...

          I guess we need an openImpl(...) method. Otherwise, it would be an infinite loop.

          Show
          Tsz Wo Nicholas Sze added a comment - > That last method should instead read ... I guess we need an openImpl(...) method. Otherwise, it would be an infinite loop.
          Hide
          Konstantin Shvachko added a comment -

          I am replying to this Doug's comment HADOOP-4044#action_12638405.
          Was thinking about it all night and have to admit I am not quite sure how to understand it.

          We were talking about the mount point exception handling on different layers of the implementation of HDFS.
          Looks like that based on the references you provide checked exceptions can be used when a particular layer does not have an immediate response to the condition causing the exception.

          If your comment means that you can tolerate mount point exception on all levels of HDFS implementation up to (excluding) the FileSystem api, then I anticipate a big sigh of relief from a large group of developers. And we can turn to discussion of the FileSystem api return types, which Owen is already successfully conducting.

          Otherwise, my understanding of the comment is that you are introducing a new argument (line of defense) for your dual-return-type approach, and that means that we are back at the beginning and will have to start over the whole discussion again.
          Please clarify.

          Show
          Konstantin Shvachko added a comment - I am replying to this Doug's comment HADOOP-4044 #action_12638405. Was thinking about it all night and have to admit I am not quite sure how to understand it. We were talking about the mount point exception handling on different layers of the implementation of HDFS. Looks like that based on the references you provide checked exceptions can be used when a particular layer does not have an immediate response to the condition causing the exception. If your comment means that you can tolerate mount point exception on all levels of HDFS implementation up to (excluding) the FileSystem api, then I anticipate a big sigh of relief from a large group of developers. And we can turn to discussion of the FileSystem api return types, which Owen is already successfully conducting. Otherwise, my understanding of the comment is that you are introducing a new argument (line of defense) for your dual-return-type approach, and that means that we are back at the beginning and will have to start over the whole discussion again. Please clarify.
          Hide
          Doug Cutting added a comment -

          Konstantin> checked exceptions can be used when a particular layer does not have an immediate response to the condition causing the exception

          First, in this case, I consider HDFS to be a single layer, or module. HDFS controls all of its components. The calls on the stack above that are still part of HDFS are known to HDFS, are under HDFS's control and are modified in a coordinated manner, as a unit. In particular, all of HDFS knows that a link can be to an external FileSystem, since this is a generic property of all FileSystems. This is a safe assumption for any code in HDFS to make, and does not violate any abstraction boundaries within HDFS.

          An exception is justified when the best handler for it is unknown. It is not the immediacy of the handler, or the proximity of the handler, but whether the handler is fully determined. If the handler is known, then exceptions are just being used as a convenient message passing channel. When the destination is known, methods, parameters and return values are the preferred message passing mechanisms in object-oriented programming.

          Does that address your question?

          Show
          Doug Cutting added a comment - Konstantin> checked exceptions can be used when a particular layer does not have an immediate response to the condition causing the exception First, in this case, I consider HDFS to be a single layer, or module. HDFS controls all of its components. The calls on the stack above that are still part of HDFS are known to HDFS, are under HDFS's control and are modified in a coordinated manner, as a unit. In particular, all of HDFS knows that a link can be to an external FileSystem, since this is a generic property of all FileSystems. This is a safe assumption for any code in HDFS to make, and does not violate any abstraction boundaries within HDFS. An exception is justified when the best handler for it is unknown. It is not the immediacy of the handler, or the proximity of the handler, but whether the handler is fully determined. If the handler is known, then exceptions are just being used as a convenient message passing channel. When the destination is known, methods, parameters and return values are the preferred message passing mechanisms in object-oriented programming. Does that address your question?
          Hide
          Doug Cutting added a comment -

          > I guess we need an openImpl(...) method. Otherwise, it would be an infinite loop.

          No, they have different signatures. I've attached a patch for FileSystem that implements this. For FileSystem's that do not support links this is completely back-compatible. HDFS could implement this by adding an RPC per file before it is opened. If that proves too slow, it could perhaps optimize things by returning and caching the block list with the link information, so that a second RPC could be avoided.

          The patch is not complete, but it works. I've only implemented open. It's meant to be an illustration of a style that we might use.

          Show
          Doug Cutting added a comment - > I guess we need an openImpl(...) method. Otherwise, it would be an infinite loop. No, they have different signatures. I've attached a patch for FileSystem that implements this. For FileSystem's that do not support links this is completely back-compatible. HDFS could implement this by adding an RPC per file before it is opened. If that proves too slow, it could perhaps optimize things by returning and caching the block list with the link information, so that a second RPC could be avoided. The patch is not complete, but it works. I've only implemented open. It's meant to be an illustration of a style that we might use.
          Hide
          Konstantin Shvachko added a comment -

          Doug> Does that address your question?

          Yes it does in the way that I see you are changing the subject.
          It was about exception usage, now it is about monolithic layers of HDFS.
          Since the programming style correctness is not measurable as opposed to performance there is no definite answer on what is correct/incorrect, and therefore the discussion clearly can go on forever. If one argument is exhausted another can be raised or linked to the previous, and so on.

          Although the question you raise is interesting by itself.
          HDFS is not a single monolithic layer the same way as Hadoop is not a single layer. MapReduce as a layer on top of HDFS but we do not impose the knowledge of splits upon the file system. Going further up the stack we also do not require HDFS to recognize HBase records or files corresponding to columns in a table store.

          HDFS layers are clearly different abstractions. E.g., NameNode does not know about input and output streams. Sure, the layers have common structures, say FileStatus is visible to all layers, but ContentSummary is constructed on the top FileSystem level.

          I don't think this argument works in favor of your approach and justifies passing link values through the whole hierarchy of HDFS layers.

          Show
          Konstantin Shvachko added a comment - Doug> Does that address your question? Yes it does in the way that I see you are changing the subject. It was about exception usage, now it is about monolithic layers of HDFS. Since the programming style correctness is not measurable as opposed to performance there is no definite answer on what is correct/incorrect, and therefore the discussion clearly can go on forever. If one argument is exhausted another can be raised or linked to the previous, and so on. Although the question you raise is interesting by itself. HDFS is not a single monolithic layer the same way as Hadoop is not a single layer. MapReduce as a layer on top of HDFS but we do not impose the knowledge of splits upon the file system. Going further up the stack we also do not require HDFS to recognize HBase records or files corresponding to columns in a table store. HDFS layers are clearly different abstractions. E.g., NameNode does not know about input and output streams. Sure, the layers have common structures, say FileStatus is visible to all layers, but ContentSummary is constructed on the top FileSystem level. I don't think this argument works in favor of your approach and justifies passing link values through the whole hierarchy of HDFS layers.
          Hide
          Doug Cutting added a comment -

          > I see you are changing the subject.

          I am saddened that you think that. I have tried to say the same thing in many different ways, to make myself better understood. I will say it one last time: I will veto any patch that uses exceptions to to communicate links within the FileSystem implentation. Exceptions are an inappropriate mechanism for communicating normal, expected values. In the namesystem sub-component of HDFS, which stores links, and any other sub-components between that and the generic FileSystem API, links are normal, expected events, even links to different FileSystems. There. I've said it again. I am now done saying it. Do not expect me to respond again about the issue of using exceptions to communicate the presence of links in HDFS. My answer will not change. Let's move on.

          Show
          Doug Cutting added a comment - > I see you are changing the subject. I am saddened that you think that. I have tried to say the same thing in many different ways, to make myself better understood. I will say it one last time: I will veto any patch that uses exceptions to to communicate links within the FileSystem implentation. Exceptions are an inappropriate mechanism for communicating normal, expected values. In the namesystem sub-component of HDFS, which stores links, and any other sub-components between that and the generic FileSystem API, links are normal, expected events, even links to different FileSystems. There. I've said it again. I am now done saying it. Do not expect me to respond again about the issue of using exceptions to communicate the presence of links in HDFS. My answer will not change. Let's move on.
          Hide
          Sanjay Radia added a comment -

          Doug Says: >Internal symlinks are supported automatically by local FS, at least on unix. And support for remote symlinks on the local FS might be very cool. Unix will gladly let me create symbolics link to anHDFS URI (Try 'ln -s hdfs://foo/bar foo; ls -l foo'). These cannot be resolved by a unix shell, but they could by 'bin/hadoop fs -ls', by MapReduce programs etc. One could, e.g., keep in one's home directory on unix links to the HDFS directories that one commonly uses ..

          It would be very cool to support them. (I guess unix thinks the : is simply part of the name.)
          BTW if someone happens to create a dir called "hdfs:" and valid path foo/bar under it then the local file system will follow it and not return an error. Is this a concern?

          When trying to go through the remote mount symlink, the local file system reports that the file does not exist. Hence, on a "file does not exist" error, the client side lib will have to check if there are symlinks in the path. The only snag is that "file not found error"s will take a little longer to report as the
          library ties to check if there is a symlink in the path. Is this an issue? Most apps consider "file not found" to be an error and hence the speed of reporting that is not a concern; however, there are some apps that create files if they are missing and their performance will suffer a little.

          Show
          Sanjay Radia added a comment - Doug Says: >Internal symlinks are supported automatically by local FS, at least on unix. And support for remote symlinks on the local FS might be very cool. Unix will gladly let me create symbolics link to anHDFS URI (Try 'ln -s hdfs://foo/bar foo; ls -l foo'). These cannot be resolved by a unix shell, but they could by 'bin/hadoop fs -ls', by MapReduce programs etc. One could, e.g., keep in one's home directory on unix links to the HDFS directories that one commonly uses .. It would be very cool to support them. (I guess unix thinks the : is simply part of the name.) BTW if someone happens to create a dir called "hdfs:" and valid path foo/bar under it then the local file system will follow it and not return an error. Is this a concern? When trying to go through the remote mount symlink, the local file system reports that the file does not exist. Hence, on a "file does not exist" error, the client side lib will have to check if there are symlinks in the path. The only snag is that "file not found error"s will take a little longer to report as the library ties to check if there is a symlink in the path. Is this an issue? Most apps consider "file not found" to be an error and hence the speed of reporting that is not a concern; however, there are some apps that create files if they are missing and their performance will suffer a little.
          Hide
          Allen Wittenauer added a comment -

          >(I guess unix thinks the : is simply part of the name.)

          Pretty much. ln doesn't validate the link at all, so you can put all sorts of interesting things there. It is up to the calling program or filesystem to verify the validity in its context. [For example, I think it is rsh that if you link a hostname to rsh, it will then rsh to that host.]

          It should also be noted that : is a valid directory separator under HFS+ (Mac OS X) as well. You're likely to see very bizarro results on those operating systems that use : as a metachar as a result.

          Show
          Allen Wittenauer added a comment - >(I guess unix thinks the : is simply part of the name.) Pretty much. ln doesn't validate the link at all, so you can put all sorts of interesting things there. It is up to the calling program or filesystem to verify the validity in its context. [For example, I think it is rsh that if you link a hostname to rsh, it will then rsh to that host.] It should also be noted that : is a valid directory separator under HFS+ (Mac OS X) as well. You're likely to see very bizarro results on those operating systems that use : as a metachar as a result.
          Hide
          Doug Cutting added a comment -

          Sanjay> if someone happens to create a dir called "hdfs:" and valid path foo/bar under it then the local file system will follow it and not return an error. Is this a concern?

          Folks would have to use a double-slash in the symbolic link (which unix ignores) for Hadoop to interpret it as an external link. So folks would have to use single-slash when linking to a directory whose name ends in colon to keep behavior consistent. That doesn't seem like a big imposition.

          Sanjay> Hence, on a "file does not exist" error, the client side lib will have to check if there are symlinks in the path.

          Good point. If we leverage Unix's automatic link following then we'd have to handle external links by catching FileNotFoundException. We could instead loop calling lstat(), following links that are internal, and returning when either a link to a non file:// uri or an actual file is found. This would require native code, but, this is a unix-only feature and we already provide native code for linux, so maybe that's acceptable. Or we could call out to a shell, but that would probably be too slow. Note that Java 7 adds API support for symbolic links, which will permit such looping logic, so that will provide a good way to implement this.

          FileNotFoundException doesn't actually make this any easier, since it doesn't contain the path we need, as far as I can tell. We'd still have to loop in native code or a shell script to find the external link, no? So maybe the native code route is best? We could implement just this bit of the Java 7 API, so when that arrives we can use it without changing much?

          Show
          Doug Cutting added a comment - Sanjay> if someone happens to create a dir called "hdfs:" and valid path foo/bar under it then the local file system will follow it and not return an error. Is this a concern? Folks would have to use a double-slash in the symbolic link (which unix ignores) for Hadoop to interpret it as an external link. So folks would have to use single-slash when linking to a directory whose name ends in colon to keep behavior consistent. That doesn't seem like a big imposition. Sanjay> Hence, on a "file does not exist" error, the client side lib will have to check if there are symlinks in the path. Good point. If we leverage Unix's automatic link following then we'd have to handle external links by catching FileNotFoundException. We could instead loop calling lstat(), following links that are internal, and returning when either a link to a non file:// uri or an actual file is found. This would require native code, but, this is a unix-only feature and we already provide native code for linux, so maybe that's acceptable. Or we could call out to a shell, but that would probably be too slow. Note that Java 7 adds API support for symbolic links, which will permit such looping logic, so that will provide a good way to implement this. FileNotFoundException doesn't actually make this any easier, since it doesn't contain the path we need, as far as I can tell. We'd still have to loop in native code or a shell script to find the external link, no? So maybe the native code route is best? We could implement just this bit of the Java 7 API, so when that arrives we can use it without changing much?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Sanjay> if someone happens to create a dir called "hdfs:" and ...

          Currently, ":" is probably not supported in path. See HADOOP-2066.

          Show
          Tsz Wo Nicholas Sze added a comment - Sanjay> if someone happens to create a dir called "hdfs:" and ... Currently, ":" is probably not supported in path. See HADOOP-2066 .
          Hide
          Sanjay Radia added a comment -

          Owen> I would probably use a thread local variable to measure the depth of the traversal. It isn't great, but it would be encapsulated in LinkResult.

          Doug> I see. Yuck.

          I second the Yuck.

          Doug> Perhaps we're prematurely optimizing here. Perhaps the extra RPCs to resolve paths won't hurt things that much.

          The above was in reference to Dhruba's approach of using the FSLinkResolver abstract class verses Owen's more inline approach.
          While Dhruba's FSLinkResolver is a little more complex, it is not complex enough to warrant doing the additional RPC.
          We might be able to further simplify it by using a LinkableFileSystem base class as Doug suggested.

          I really don't like the extra rpc for any method that has a pathname as a parameter regardless of whether or not a link is in the path.

          Show
          Sanjay Radia added a comment - Owen> I would probably use a thread local variable to measure the depth of the traversal. It isn't great, but it would be encapsulated in LinkResult. Doug> I see. Yuck. I second the Yuck. Doug> Perhaps we're prematurely optimizing here. Perhaps the extra RPCs to resolve paths won't hurt things that much. The above was in reference to Dhruba's approach of using the FSLinkResolver abstract class verses Owen's more inline approach. While Dhruba's FSLinkResolver is a little more complex, it is not complex enough to warrant doing the additional RPC. We might be able to further simplify it by using a LinkableFileSystem base class as Doug suggested. I really don't like the extra rpc for any method that has a pathname as a parameter regardless of whether or not a link is in the path.
          Hide
          Doug Cutting added a comment -

          Sanjay> I really don't like the extra rpc [ ... ]

          What did you think about my suggestion above that we might use a cache to avoid this? First, we implement the naive approach, benchmark it, and, it it's too slow, optimize it with a pre-fetch cache of block locations.

          Show
          Doug Cutting added a comment - Sanjay> I really don't like the extra rpc [ ... ] What did you think about my suggestion above that we might use a cache to avoid this? First, we implement the naive approach, benchmark it, and, it it's too slow, optimize it with a pre-fetch cache of block locations.
          Hide
          Raghu Angadi added a comment -

          >First, we implement the naive approach, benchmark it, and, it it's too slow, optimize it with a pre-fetch cache of block locations.

          I am almost certain that it won't affect any benchmark other than NNBench.

          Show
          Raghu Angadi added a comment - >First, we implement the naive approach, benchmark it, and, it it's too slow, optimize it with a pre-fetch cache of block locations. I am almost certain that it won't affect any benchmark other than NNBench.
          Hide
          Raghu Angadi added a comment -

          One of the issues with 'resolve then open' approach is correctness. What happens if a link or directory is changed between these two operations? open() fails though it should not.

          Show
          Raghu Angadi added a comment - One of the issues with 'resolve then open' approach is correctness. What happens if a link or directory is changed between these two operations? open() fails though it should not.
          Hide
          Doug Cutting added a comment -

          > I am almost certain that it won't affect any benchmark other than NNBench.

          If that's really the case, what are we worried about here?

          > What happens if a link or directory is changed between these two operations? open() fails though it should not.

          The same thing that happens if that change is made just after a file is opened. If you open a file, then someone else deletes it, subsequent accesses to that file will fail. The namenode doesn't keep any state for files open for read, so a short-lived cache of block locations doesn't change things fundamentally.

          That said, the cache idea only works for open, and doesn't work for rename, delete, etc. In these cases we don't want to pre-fetch a list of block locations. So nevermind the cache idea anyway.

          The current options on the table seem to be:

          • Dhruba's patch modified to use Nicholas's idea of LinkResult<T> style, to avoid defining new return type classes for the SPI methods.
          • A less-invasive approach that requires two RPCs. We may later optimize this by converting FileSystem's API to use the above style, but we may not need to. We do need to be careful not to incompatibly change FileSystem's public API, but the SPI's not so constrained, since all FileSystem implementations are currently in trunk and can be easily maintained in a coordinated manner. In the meantime, we can start using symbolic links in archives, etc. while we work out if and how to better optimize them.

          Does that sound right?

          I don't have a strong preference. If I were implementing it myself I'd probably go for the simple approach first, early in a release cycle, then benchmark things and optimize it subsequently if needed. The risk is not that great, since we already have good ideas of how to optimize it. But the optimization will clearly help scalability, so it wouldn't hurt to have it from the outset either.

          FYI, I tried implementing my patch above as a LinkedFileSystem subclass, to better contain the changes. This turned out to be messy, since a LinkedFileSystem can link to an unlinked FileSystem. With the subclass approach this must be explicitly handled by casts and 'instanceof', while when FileSystem itself supports links this can be handled by default method implementations. So I am not convinced that a LinkedFileSystem subclass is a net win.

          Show
          Doug Cutting added a comment - > I am almost certain that it won't affect any benchmark other than NNBench. If that's really the case, what are we worried about here? > What happens if a link or directory is changed between these two operations? open() fails though it should not. The same thing that happens if that change is made just after a file is opened. If you open a file, then someone else deletes it, subsequent accesses to that file will fail. The namenode doesn't keep any state for files open for read, so a short-lived cache of block locations doesn't change things fundamentally. That said, the cache idea only works for open, and doesn't work for rename, delete, etc. In these cases we don't want to pre-fetch a list of block locations. So nevermind the cache idea anyway. The current options on the table seem to be: Dhruba's patch modified to use Nicholas's idea of LinkResult<T> style, to avoid defining new return type classes for the SPI methods. A less-invasive approach that requires two RPCs. We may later optimize this by converting FileSystem's API to use the above style, but we may not need to. We do need to be careful not to incompatibly change FileSystem's public API, but the SPI's not so constrained, since all FileSystem implementations are currently in trunk and can be easily maintained in a coordinated manner. In the meantime, we can start using symbolic links in archives, etc. while we work out if and how to better optimize them. Does that sound right? I don't have a strong preference. If I were implementing it myself I'd probably go for the simple approach first, early in a release cycle, then benchmark things and optimize it subsequently if needed. The risk is not that great, since we already have good ideas of how to optimize it. But the optimization will clearly help scalability, so it wouldn't hurt to have it from the outset either. FYI, I tried implementing my patch above as a LinkedFileSystem subclass, to better contain the changes. This turned out to be messy, since a LinkedFileSystem can link to an unlinked FileSystem. With the subclass approach this must be explicitly handled by casts and 'instanceof', while when FileSystem itself supports links this can be handled by default method implementations. So I am not convinced that a LinkedFileSystem subclass is a net win.
          Hide
          Raghu Angadi added a comment -

          > The same thing that happens if that change is made just after a file is opened. If you open a file, then someone else deletes it, subsequent accesses to that file will fail.

          yes, thats an HDFS implementation problems which could be fixed. That is qualitatively different from similar issues 'forced' by FS layer/API (which could also be fixed in future).

          I don't have strong preference either way regd implementation.

          With extra RPC some of the negatives I see (nothing too urgent):

          • we won't see many negative affects since there are very few loads that are gated by NameNode currently. It might change
          • Every couple of months we might need to answer "why is this so..?" question on the list
          • In my experience any extra work done that could be avoided will end up being a problem in future. Especially large distribute systems. So sooner or later we will end up 'correct' it.
          Show
          Raghu Angadi added a comment - > The same thing that happens if that change is made just after a file is opened. If you open a file, then someone else deletes it, subsequent accesses to that file will fail. yes, thats an HDFS implementation problems which could be fixed. That is qualitatively different from similar issues 'forced' by FS layer/API (which could also be fixed in future). I don't have strong preference either way regd implementation. With extra RPC some of the negatives I see (nothing too urgent): we won't see many negative affects since there are very few loads that are gated by NameNode currently. It might change Every couple of months we might need to answer "why is this so..?" question on the list In my experience any extra work done that could be avoided will end up being a problem in future. Especially large distribute systems. So sooner or later we will end up 'correct' it.
          Hide
          dhruba borthakur added a comment -

          I would like to avoid a design that incurs an overhead of an additional RPC everytime a link is traversed. I will post a new patch that incorporates Nicholas's idea of FsResult<T> style return type classes for SPI methods. I am hoping that this will be acceptable to all.

          Show
          dhruba borthakur added a comment - I would like to avoid a design that incurs an overhead of an additional RPC everytime a link is traversed. I will post a new patch that incorporates Nicholas's idea of FsResult<T> style return type classes for SPI methods. I am hoping that this will be acceptable to all.
          Hide
          Konstantin Shvachko added a comment -

          > I would like to avoid a design that incurs an overhead of an additional RPC everytime a link is traversed.

          +1. This will affect not only NNBench but all benchmarks including DFSIO and especially NNThroughputBenchmark.
          GridMix and Sort will probably be less affected, but will suffer too.

          Show
          Konstantin Shvachko added a comment - > I would like to avoid a design that incurs an overhead of an additional RPC everytime a link is traversed. +1. This will affect not only NNBench but all benchmarks including DFSIO and especially NNThroughputBenchmark. GridMix and Sort will probably be less affected, but will suffer too.
          Hide
          Sanjay Radia added a comment -

          > I would like to avoid a design that incurs an overhead of an additional RPC everytime a link is traversed.

          >+1. This will affect not only NNBench but all benchmarks including DFSIO and especially NNThroughputBenchmark.
          >GridMix and Sort will probably be less affected, but will suffer too.

          +1. I would also like to avoid an extra rpc, since avoiding one is straight forward.

          Doug >What did you think about my suggestion above that we might use a cache to avoid this? First, we implement the naive approach, benchmark it, and, it it's too slow, optimize it with a pre-fetch cache of block locations.

          Clearly your cache solution deals with the extra RPC issue.
          Generally I see a cache as a way of improving the performance of an ordinarily good design or algorithm. I don't like the use of caches as part of a design to make an algorithm work when alternate good designs are available that don't need a cache. Would we have come up with this design if we didn't have such an emotionally charged discussion on exceptions?

          We have a good design where if the resolution fails due to a symlink, we return this information to the caller. It does not require the use of a cache.
          We are divided over how to return this information - use the return status or use an exception.
          The cache solution is a way to avoid making the painfully emotionally charged decision for the Hadoop community.
          I don't want to explain the reason we use the cache to hadoop developers again and again down the road.
          We should not avoid the decision, but make it.
          A couple of weeks ago I was confident that a compromise vote would pass. I am hoping that the same is true now.

          Show
          Sanjay Radia added a comment - > I would like to avoid a design that incurs an overhead of an additional RPC everytime a link is traversed. >+1. This will affect not only NNBench but all benchmarks including DFSIO and especially NNThroughputBenchmark. >GridMix and Sort will probably be less affected, but will suffer too. +1. I would also like to avoid an extra rpc, since avoiding one is straight forward. Doug >What did you think about my suggestion above that we might use a cache to avoid this? First, we implement the naive approach, benchmark it, and, it it's too slow, optimize it with a pre-fetch cache of block locations. Clearly your cache solution deals with the extra RPC issue. Generally I see a cache as a way of improving the performance of an ordinarily good design or algorithm. I don't like the use of caches as part of a design to make an algorithm work when alternate good designs are available that don't need a cache. Would we have come up with this design if we didn't have such an emotionally charged discussion on exceptions? We have a good design where if the resolution fails due to a symlink, we return this information to the caller. It does not require the use of a cache. We are divided over how to return this information - use the return status or use an exception. The cache solution is a way to avoid making the painfully emotionally charged decision for the Hadoop community. I don't want to explain the reason we use the cache to hadoop developers again and again down the road. We should not avoid the decision, but make it. A couple of weeks ago I was confident that a compromise vote would pass. I am hoping that the same is true now.
          Hide
          Sanjay Radia added a comment -

          Doug says> Sanjay has argued that it is abnormal for a filesystem to link to a different filesystem. This misses the intended sense of "normal". The value of a link is normal data, under the control of the filesystem implementation, regardless of whether it points to a different filesystem.

          Don't want to restart the debate, but need to correct something that was attributed to me:
          I never said that symlinks to remote files were not normal. Processing them (ie following them to a different NN) is not normal for a NN, Some NNs may follow through and some may not. Indeed DNS does both (follows or redirects the caller).

          Show
          Sanjay Radia added a comment - Doug says> Sanjay has argued that it is abnormal for a filesystem to link to a different filesystem. This misses the intended sense of "normal". The value of a link is normal data, under the control of the filesystem implementation, regardless of whether it points to a different filesystem. Don't want to restart the debate, but need to correct something that was attributed to me: I never said that symlinks to remote files were not normal. Processing them (ie following them to a different NN) is not normal for a NN, Some NNs may follow through and some may not. Indeed DNS does both (follows or redirects the caller).
          Hide
          dhruba borthakur added a comment -

          The Client Protocol uses types of the form FsResult<BooleanWritable>, FsResult<LocatedBlocks>, etc. This patch is ready for review.

          Show
          dhruba borthakur added a comment - The Client Protocol uses types of the form FsResult<BooleanWritable>, FsResult<LocatedBlocks>, etc. This patch is ready for review.
          Hide
          Doug Cutting added a comment -

          The src/core parts of this patch look good to me.

          One nit: since maxPathLinks is not settable, it should be a private constant MAX_PATH_LINKS for now. And 1000 seems way too big. Linux limited links to 5 for a long time, but Posix specifies a minimum of 8. So we should probably choose somewhere between 8 and 32.

          If folks find this to be a problem, we could make the limit configurable, or we could simply detect loops:

            T resolve(FileSystem fs, Path p) throws IOException {
              T in = next(fs, p);
              Path first = p;
              HashSet seen = new HashSet();
                    
              // continue looping till there are no more symbolic links in the path.
              while (in.isLink()) {
                seen.add(p);
          
                // construct new path
                p = in.getLink();
          
                if (seen.contains(p))
                  throw new IOException("Loop in symbolic links " + first);
                }
                ...
          }
          
          Show
          Doug Cutting added a comment - The src/core parts of this patch look good to me. One nit: since maxPathLinks is not settable, it should be a private constant MAX_PATH_LINKS for now. And 1000 seems way too big. Linux limited links to 5 for a long time, but Posix specifies a minimum of 8. So we should probably choose somewhere between 8 and 32. If folks find this to be a problem, we could make the limit configurable, or we could simply detect loops: T resolve(FileSystem fs, Path p) throws IOException { T in = next(fs, p); Path first = p; HashSet seen = new HashSet(); // continue looping till there are no more symbolic links in the path. while (in.isLink()) { seen.add(p); // construct new path p = in.getLink(); if (seen.contains(p)) throw new IOException( "Loop in symbolic links " + first); } ... }
          Hide
          Tsz Wo Nicholas Sze added a comment -

          FsResult can be divided into two classes since there are two kind of results, Writables (e.g. FsResult<BooleanWritable>) and non-writables (e.g. FSDataOutputStreamLink). I suggest the following layout:

          public class FsResult<T> implements FSLinkable {
            ...
          }
          
          public class FsResultWritable<T extends Writable> extends FsResult<T> implements Writable {
            ...
          }
          

          Methods like appendImpl(...) should return FsResult<FSDataOutputStream> and methods like setReplication(...) should return FsResultWritable<BooleanWritable>.

          Then, we don't need classes like FSDataInputStreamLink and FSDataOutputStreamLink. We also probably don't need FSLinkable anymore.

          4044_20081030spi.java shows the complete source codes for FsResult and FsResultWritable. I have not removed FSLinkable yet.

          Show
          Tsz Wo Nicholas Sze added a comment - FsResult can be divided into two classes since there are two kind of results, Writables (e.g. FsResult<BooleanWritable>) and non-writables (e.g. FSDataOutputStreamLink). I suggest the following layout: public class FsResult<T> implements FSLinkable { ... } public class FsResultWritable<T extends Writable> extends FsResult<T> implements Writable { ... } Methods like appendImpl(...) should return FsResult<FSDataOutputStream> and methods like setReplication(...) should return FsResultWritable<BooleanWritable>. Then, we don't need classes like FSDataInputStreamLink and FSDataOutputStreamLink. We also probably don't need FSLinkable anymore. 4044_20081030spi.java shows the complete source codes for FsResult and FsResultWritable. I have not removed FSLinkable yet.
          Hide
          Sanjay Radia added a comment -

          Went though the patch. Looks good.
          Feedback:
          1) Why does createSymLink create the group name as "none" rather than pass a null parameter to use the
          group name from the parent dir similar to the create and mkdir operations.

          2) rename and createSymlink both take two path parameters - either of the paths can traverse a symlink.
          createSymlink does not handle this while rename does.
          An alternate way to handle this (compared to how rename does) is to have the FsResult return a flag saying which of the two paths cross a symlink.

          3) getFileInode throws UnresolvedPathException. However, LeaseManager:findPath() was changed to
          throw IOException (it threw no exceptions before). findPath should be declared to throw only the UnresolvedPathException.

          4) There is a test to check that symlinks are correctly persisted in fsimage/editslogs.
          Please add TestSymlinks that tests for symlinks for each of FileSystem methods that take one or more path names. For those that take multiple path names, the test should test for symlinks in either parameter.

          Show
          Sanjay Radia added a comment - Went though the patch. Looks good. Feedback: 1) Why does createSymLink create the group name as "none" rather than pass a null parameter to use the group name from the parent dir similar to the create and mkdir operations. 2) rename and createSymlink both take two path parameters - either of the paths can traverse a symlink. createSymlink does not handle this while rename does. An alternate way to handle this (compared to how rename does) is to have the FsResult return a flag saying which of the two paths cross a symlink. 3) getFileInode throws UnresolvedPathException. However, LeaseManager:findPath() was changed to throw IOException (it threw no exceptions before). findPath should be declared to throw only the UnresolvedPathException. 4) There is a test to check that symlinks are correctly persisted in fsimage/editslogs. Please add TestSymlinks that tests for symlinks for each of FileSystem methods that take one or more path names. For those that take multiple path names, the test should test for symlinks in either parameter.
          Hide
          Robert Chansler added a comment -

          Just a list of important test cases saved from a hallway conversation:

          • Shell commands
          • Trash
          • Archives
          • distcp
          • chmod, chown, chgrp APIs
          • Rename
          Show
          Robert Chansler added a comment - Just a list of important test cases saved from a hallway conversation: Shell commands Trash Archives distcp chmod, chown, chgrp APIs Rename
          Hide
          dhruba borthakur added a comment -

          Thanks Sanjay and Robert. Will work your feedback and post a new patch.

          Show
          dhruba borthakur added a comment - Thanks Sanjay and Robert. Will work your feedback and post a new patch.
          Hide
          dhruba borthakur added a comment -

          Incorporated Sanjay's review comments. The FileSystem.createSymLink() does take two parameters, but only the first one needs to be "resolved". The second one can be anything (just a blob) and does not need to be resolved. In fact, the second parameter could be a dangling non-existent path.

          Included test cases for all FileSystem APIs that were modified.

          From Rob''s list of tests

          • Shell commands – unit test included
          • Trash – unit test included (because Trash is a pure FileSystem client)
          • Archives – current implementation does not follow symbolic links.
          • distcp – pure FileSystem client, does not follow symbolic links
          • chmod, chown, chgrp APIs – unit test included
          • Rename - unit test included
          Show
          dhruba borthakur added a comment - Incorporated Sanjay's review comments. The FileSystem.createSymLink() does take two parameters, but only the first one needs to be "resolved". The second one can be anything (just a blob) and does not need to be resolved. In fact, the second parameter could be a dangling non-existent path. Included test cases for all FileSystem APIs that were modified. From Rob''s list of tests Shell commands – unit test included Trash – unit test included (because Trash is a pure FileSystem client) Archives – current implementation does not follow symbolic links. distcp – pure FileSystem client, does not follow symbolic links chmod, chown, chgrp APIs – unit test included Rename - unit test included
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12393857/symLink12.patch
          against trunk revision 713847.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 62 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 3 new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3588/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3588/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3588/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3588/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12393857/symLink12.patch against trunk revision 713847. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 62 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 3 new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3588/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3588/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3588/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3588/console This message is automatically generated.
          Hide
          dhruba borthakur added a comment -

          Investigate failed unit test and findbugs problems.

          Show
          dhruba borthakur added a comment - Investigate failed unit test and findbugs problems.
          Hide
          dhruba borthakur added a comment -

          Fixed findbugs warnings. Also fixed the reason for the unit test failure: FileSystem.exists() was not implemented correctly in the earlier version of this patch.

          Show
          dhruba borthakur added a comment - Fixed findbugs warnings. Also fixed the reason for the unit test failure: FileSystem.exists() was not implemented correctly in the earlier version of this patch.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Talked to Dhruba offline. He mentioned that the idea my previous comment is not easy to implement. I did not have much time to take a closer look. So, I suggest we use Dhruba's implementation and do the improvement later.

          Show
          Tsz Wo Nicholas Sze added a comment - Talked to Dhruba offline. He mentioned that the idea my previous comment is not easy to implement. I did not have much time to take a closer look. So, I suggest we use Dhruba's implementation and do the improvement later.
          Hide
          Konstantin Shvachko added a comment -
          1. We should have some performance testing. There is a lot of wrapping/unwrapping going on. So we need to make sure that old operations like getBlockLocations(), create(), list() etc. on the FileSystem level perform the same as the old code.
          2. > I suggest we use Dhruba's implementation and do the improvement later.
            I don't understand what is the problem here.
            Nichlas proposes to have 2 classes FsResult<T> and FsResultWritable<T> which extends FsResult<T>.
            This is exactly what should be done here. ClientProtocol methods should return FsResultWritable<T>, while DFSClient and all other *FileSystem classes should return FsResult<T>. This is because Writable is needed only in the RPC level, after that Writable is not required. E.g.
            interface ClientProtocol {
              FsResultWritable<LocatedBlocks>  setReplication(...);
              FsResultWritable<NullWritable> create(...);
            }
            
            class DFSClient {
              FsResult<LocatedBlocks>  setReplication(...) {
                return namenode.setReplication(...); // can do this because FsResultWritable is a subclass of FsResult
              }
            
              FsResult<OutputStream> create(...) {
                return new FsResult<OutputStream>(new DFSOutputStream(...));
              }
            }
            
          3. I did not review the code. Just looked in few places.
          • Looks like FsResult constructors do not need to throw any exceptions.
          • DFSClient.getHints() is deprecated. It is a good time to remove it.
          Show
          Konstantin Shvachko added a comment - We should have some performance testing . There is a lot of wrapping/unwrapping going on. So we need to make sure that old operations like getBlockLocations(), create(), list() etc. on the FileSystem level perform the same as the old code. > I suggest we use Dhruba's implementation and do the improvement later. I don't understand what is the problem here. Nichlas proposes to have 2 classes FsResult<T> and FsResultWritable<T> which extends FsResult<T> . This is exactly what should be done here. ClientProtocol methods should return FsResultWritable<T> , while DFSClient and all other *FileSystem classes should return FsResult<T> . This is because Writable is needed only in the RPC level, after that Writable is not required. E.g. interface ClientProtocol { FsResultWritable<LocatedBlocks> setReplication(...); FsResultWritable<NullWritable> create(...); } class DFSClient { FsResult<LocatedBlocks> setReplication(...) { return namenode.setReplication(...); // can do this because FsResultWritable is a subclass of FsResult } FsResult<OutputStream> create(...) { return new FsResult<OutputStream>( new DFSOutputStream(...)); } } I did not review the code. Just looked in few places. Looks like FsResult constructors do not need to throw any exceptions. DFSClient.getHints() is deprecated. It is a good time to remove it.
          Hide
          dhruba borthakur added a comment -

          Incorporate Konstantin's comments.

          Show
          dhruba borthakur added a comment - Incorporate Konstantin's comments.
          Hide
          dhruba borthakur added a comment -

          1. Performance testing: I have run DFSIOTest on 10 nodes only. I did not seen any performance degradation for any operations, including creates, deletes, reads or writes. I can re-run the test and post them here.

          2. The problem is that the FileSystem.FSLinkResolver and FileSystem.resolve() will become more complicated. Also, every operation in DFSClient will have to convert a FsResultWritbale object into a FsResult object.

          3. Fixed issues with FsResult constructor as you suggested.

          4. I would like to remove the deprecated method DFSClient.getHints as part of a separate JIRA.

          Show
          dhruba borthakur added a comment - 1. Performance testing: I have run DFSIOTest on 10 nodes only. I did not seen any performance degradation for any operations, including creates, deletes, reads or writes. I can re-run the test and post them here. 2. The problem is that the FileSystem.FSLinkResolver and FileSystem.resolve() will become more complicated. Also, every operation in DFSClient will have to convert a FsResultWritbale object into a FsResult object. 3. Fixed issues with FsResult constructor as you suggested. 4. I would like to remove the deprecated method DFSClient.getHints as part of a separate JIRA.
          Hide
          Konstantin Shvachko added a comment -

          > 1. can re-run the test and post them here.
          Yes, that's what I mean the results should be posted.

          > 2. every operation in DFSClient will have to convert a FsResultWritbale object into a FsResult
          No they don't because FsResultWritbale is a subclass of FsResult. So RPC call returns FsResultWritbale and DFSClient can return at further up as FsResult no conversion needed.
          Please look at the implementation of DFSClient.setReplication() above.

          > 4. Sure you can remove the method in a separate jira prior this one.

          Show
          Konstantin Shvachko added a comment - > 1. can re-run the test and post them here. Yes, that's what I mean the results should be posted. > 2. every operation in DFSClient will have to convert a FsResultWritbale object into a FsResult No they don't because FsResultWritbale is a subclass of FsResult. So RPC call returns FsResultWritbale and DFSClient can return at further up as FsResult no conversion needed. Please look at the implementation of DFSClient.setReplication() above. > 4. Sure you can remove the method in a separate jira prior this one.
          Hide
          Nigel Daley added a comment -

          Dhruba, I'm having trouble enumerating all the test cases that are covered by the unit tests. Can you write a test plan that outlines the test cases? More than that, I'm having understanding the design of this 200Kb patch from the code and the last 2.5 months of comments on this Jira. Is there a design document for this?

          From first glance, I don't see test cases for these:

          • deleting a symlink that points to a file
          • deleting a symlink that points to a directory
          • symlink to a symlink
          • any test cases with relative links
          • delete linked to file, but not link, access link
          • delete linked to file, recreate the same file, access link
          • links to other filesystems
          • archives containing links
          • rename a file to an existing symlink
          • rename a symlink to an existing file
          • setTimes and setReplication cases should check the result on both the symlink and linked to file
          • tests involving lack of permissions to create symlinks
          • tests that try to createSymlinks with filesystems that don't support them
          • ...

          I'm also left with a number of questions:

          1. Does setReplication, setTimes, chmod, chgrp, chown apply to the symlink or the underlying file/dir that the symlink point to?
          2. Does fsck report symlinks?
          3. Does har:// protocol support symlinks in the path leading to har file? In har file itself?
          4. Why doesn't the main method for this patch, FileSystem.createSymlink(), which declares to throw IOException, no have any @throws clause in the javadoc.
          5. Under what conditions does FileSystem.createSymlink() throw IOException?
          ...and many more.

          Show
          Nigel Daley added a comment - Dhruba, I'm having trouble enumerating all the test cases that are covered by the unit tests. Can you write a test plan that outlines the test cases? More than that, I'm having understanding the design of this 200Kb patch from the code and the last 2.5 months of comments on this Jira. Is there a design document for this? From first glance, I don't see test cases for these: deleting a symlink that points to a file deleting a symlink that points to a directory symlink to a symlink any test cases with relative links delete linked to file, but not link, access link delete linked to file, recreate the same file, access link links to other filesystems archives containing links rename a file to an existing symlink rename a symlink to an existing file setTimes and setReplication cases should check the result on both the symlink and linked to file tests involving lack of permissions to create symlinks tests that try to createSymlinks with filesystems that don't support them ... I'm also left with a number of questions: 1. Does setReplication, setTimes, chmod, chgrp, chown apply to the symlink or the underlying file/dir that the symlink point to? 2. Does fsck report symlinks? 3. Does har:// protocol support symlinks in the path leading to har file? In har file itself? 4. Why doesn't the main method for this patch, FileSystem.createSymlink(), which declares to throw IOException, no have any @throws clause in the javadoc. 5. Under what conditions does FileSystem.createSymlink() throw IOException? ...and many more.
          Hide
          Sanjay Radia added a comment -

          Druba wrote

          • Archives - current implementation does not follow symbolic links.
          • distcp - pure FileSystem client, does not follow symbolic links

          What is the desirable default behavior when a user does a distcp or an archive of a subtree that has symlinks to other
          volumes?
          To avoid accidental copying of large amounts of data, I suspect that the default should be to copy the symlink as a symlink
          rather than follow the symlink and copy its target.
          If this is indeed the desired default behavior then the patch will need to fix that.

          On a related note, symlinks to an archive will throw an exception/error when one tries to follow it. That would require a fix to the archive file system; we should file a separate jira to fix that in the future.

          Show
          Sanjay Radia added a comment - Druba wrote Archives - current implementation does not follow symbolic links. distcp - pure FileSystem client, does not follow symbolic links What is the desirable default behavior when a user does a distcp or an archive of a subtree that has symlinks to other volumes? To avoid accidental copying of large amounts of data, I suspect that the default should be to copy the symlink as a symlink rather than follow the symlink and copy its target. If this is indeed the desired default behavior then the patch will need to fix that. On a related note, symlinks to an archive will throw an exception/error when one tries to follow it. That would require a fix to the archive file system; we should file a separate jira to fix that in the future.
          Hide
          Edward Capriolo added a comment -

          I am using/helping the hadoop-hive subproject. I wanted to share a use case for symlinks.

          For example suppose a directory inside hadoop:
          /user/edward/weblogs/

          {web1.log,web2.log,web3.log}

          . I can use a Hive EXTERNAL
          table to point to the parent directory. I can then use Hive to query this external table. This is very powerful. This will work unless another file in this directory with a different format is also in the directory web_logsummary.csv. (this is my case)

          Being able to drop in a 'symlink' where a file would go could be used to create structures from already existing data. Imagine a user that has a large hadoop deployment and is wishing to migrate/ start using hive. External table is constrained to one directory. They would need to recode application paths and or move files. If you had a 'symlink' concept anyone can start using hive without re-organizing or copying data.

          Right now, hive has a lot of facilities to deal with input formats, such as specifying delimiters etc, but forcing the data either into a warehouse or into an external table is limiting. 'Symlinks' tied together with hive's current input format capabilities would make hive more versatile.

          Show
          Edward Capriolo added a comment - I am using/helping the hadoop-hive subproject. I wanted to share a use case for symlinks. For example suppose a directory inside hadoop: /user/edward/weblogs/ {web1.log,web2.log,web3.log} . I can use a Hive EXTERNAL table to point to the parent directory. I can then use Hive to query this external table. This is very powerful. This will work unless another file in this directory with a different format is also in the directory web_logsummary.csv. (this is my case) Being able to drop in a 'symlink' where a file would go could be used to create structures from already existing data. Imagine a user that has a large hadoop deployment and is wishing to migrate/ start using hive. External table is constrained to one directory. They would need to recode application paths and or move files. If you had a 'symlink' concept anyone can start using hive without re-organizing or copying data. Right now, hive has a lot of facilities to deal with input formats, such as specifying delimiters etc, but forcing the data either into a warehouse or into an external table is limiting. 'Symlinks' tied together with hive's current input format capabilities would make hive more versatile.
          Hide
          dhruba borthakur added a comment -

          Patch merged with latest trunk. There are still some unit test failures.

          Show
          dhruba borthakur added a comment - Patch merged with latest trunk. There are still some unit test failures.
          Hide
          dhruba borthakur added a comment -

          The unit test TestLink was failing because the dfs.support.append was not set. Fixed.

          Show
          dhruba borthakur added a comment - The unit test TestLink was failing because the dfs.support.append was not set. Fixed.
          Hide
          Yajun Dong added a comment -

          Hi, how is the feature progress going, and what's time to complete this patch ?

          Show
          Yajun Dong added a comment - Hi, how is the feature progress going, and what's time to complete this patch ?
          Hide
          dhruba borthakur added a comment -

          Hi Yajun, this patch might not be part of 0.21 because I am sort on resources to write the test plan, negative unit tests, performance test as scale, etc. If somebody can volunteer to write the test plan, user guide and a few negative unit tests, that will be of great help (HDFS-254, HDFS-255).

          Show
          dhruba borthakur added a comment - Hi Yajun, this patch might not be part of 0.21 because I am sort on resources to write the test plan, negative unit tests, performance test as scale, etc. If somebody can volunteer to write the test plan, user guide and a few negative unit tests, that will be of great help ( HDFS-254 , HDFS-255 ).
          Hide
          Yajun Dong added a comment -

          @dhruba borthakur, Thanks for your comment.

          Can I ask whether this feature has been completed? I suppose the only thing left are the test plan, negative unit tests and performance test, right?

          What patch above Shall I used to test it, and by the way, which version you think this patch shall be merged into.

          Show
          Yajun Dong added a comment - @dhruba borthakur, Thanks for your comment. Can I ask whether this feature has been completed? I suppose the only thing left are the test plan, negative unit tests and performance test, right? What patch above Shall I used to test it, and by the way, which version you think this patch shall be merged into.
          Hide
          weimin zhu added a comment -

          this a copy of symLink14.patch for hadoop released version 0.20.0

          usage:
          put symlink-0.20.0.patch into $HADOOP_HOME

          cd $HADOOP_HOME
          patch -p0 < symlink-0.20.0.patch
          ant jar
          mv hadoop-0.20.0-core.jar hadoop-0.20.0-core.jar.bak
          cp build/hadoop-0.20.1-dev-core.jar

          restart the hadoop system at last

          Show
          weimin zhu added a comment - this a copy of symLink14.patch for hadoop released version 0.20.0 usage: put symlink-0.20.0.patch into $HADOOP_HOME cd $HADOOP_HOME patch -p0 < symlink-0.20.0.patch ant jar mv hadoop-0.20.0-core.jar hadoop-0.20.0-core.jar.bak cp build/hadoop-0.20.1-dev-core.jar restart the hadoop system at last
          Hide
          Eli Collins added a comment -

          Discussed with Dhruba. I'll start putting together a design doc to cover the above questions/comments which deserve some more discussion. Will also start on a test plan and user guide.

          I rebased the latest diffs against trunk and have patches for hdfs, common and mapreduce. I'll file jiras for the last two and update each with the relevant patch.

          Show
          Eli Collins added a comment - Discussed with Dhruba. I'll start putting together a design doc to cover the above questions/comments which deserve some more discussion. Will also start on a test plan and user guide. I rebased the latest diffs against trunk and have patches for hdfs, common and mapreduce. I'll file jiras for the last two and update each with the relevant patch.
          Hide
          eric baldeschwieler added a comment -

          Hi Folks,

          Expect more discussion here! I'll let others express their opinions, but just FYI, there are unresolved design issues here that folks on our HDFS team feel strongly about...

          E14

          Show
          eric baldeschwieler added a comment - Hi Folks, Expect more discussion here! I'll let others express their opinions, but just FYI, there are unresolved design issues here that folks on our HDFS team feel strongly about... E14
          Hide
          Allen Wittenauer added a comment -

          Can we get a summary of what those issues are rather than pouring through all of the comments?

          Show
          Allen Wittenauer added a comment - Can we get a summary of what those issues are rather than pouring through all of the comments?
          Hide
          Owen O'Malley added a comment -

          I'd like to bring up the issue again of throwing exceptions between the FileSystem implementations to notify the framework that there is a link to an external resource. That seems a lot cleaner than making every single method return a variant type that is sometimes the file handle (or other outcome of the method) and sometimes an address of the new path to look in. If the SPI always returned link info, I would lean the other way. But since the SPI does do a dereference of the link, except in the case of an external link, leads me to like the use of the exception as a mechanism to signal the condition.

          It is a shame that this isn't looking like it will make 0.21's cutoff.

          Show
          Owen O'Malley added a comment - I'd like to bring up the issue again of throwing exceptions between the FileSystem implementations to notify the framework that there is a link to an external resource. That seems a lot cleaner than making every single method return a variant type that is sometimes the file handle (or other outcome of the method) and sometimes an address of the new path to look in. If the SPI always returned link info, I would lean the other way. But since the SPI does do a dereference of the link, except in the case of an external link, leads me to like the use of the exception as a mechanism to signal the condition. It is a shame that this isn't looking like it will make 0.21's cutoff.
          Hide
          Doug Cutting added a comment -

          The most succinct test I've come up with for when it's appropriate to use exceptions is whether the target is known. If you know exactly what class and where on the stack an exception will be caught, then it's just a glorified goto statement working around poor API modeling. If you genuinely have no idea how a situation should be handled, then an exception is appropriate.

          In this case the exception will always be caught by FileSytem/FileContext and always be handled in precisely the same way by the same boilerplate code. There's nothing ambiguous about this situation that justifies exceptions.

          That's the negative case. The postive case is that we are adding the notion of symlinks to our FileSystem API, including cross-FileSystem symlinks. Thus, a FileSystem becomes not just fundamentally a mapping of path->bytes, but a mapping of path->[bytes|path]. This is a fundamental change to our FileSystem metaphor, and, as such, it requires a fundamental change to its SPI. The SPI can and should model this functionality directly with parameters and return values, not through values passed in exceptions.

          > It is a shame that this isn't looking like it will make 0.21's cutoff.

          I agree that it would be shameful if this does not make it into 0.21. The patch has been here for 9 months!

          Show
          Doug Cutting added a comment - The most succinct test I've come up with for when it's appropriate to use exceptions is whether the target is known. If you know exactly what class and where on the stack an exception will be caught, then it's just a glorified goto statement working around poor API modeling. If you genuinely have no idea how a situation should be handled, then an exception is appropriate. In this case the exception will always be caught by FileSytem/FileContext and always be handled in precisely the same way by the same boilerplate code. There's nothing ambiguous about this situation that justifies exceptions. That's the negative case. The postive case is that we are adding the notion of symlinks to our FileSystem API, including cross-FileSystem symlinks. Thus, a FileSystem becomes not just fundamentally a mapping of path->bytes, but a mapping of path-> [bytes|path] . This is a fundamental change to our FileSystem metaphor, and, as such, it requires a fundamental change to its SPI. The SPI can and should model this functionality directly with parameters and return values, not through values passed in exceptions. > It is a shame that this isn't looking like it will make 0.21's cutoff. I agree that it would be shameful if this does not make it into 0.21. The patch has been here for 9 months!
          Hide
          Philip Zeyliger added a comment -

          In my opinion (I'd say IME, but I've never sneaked one of these past code review),using exceptions for flow control is shady. Joshua Bloch would say the same ("Don't force client to use exceptions for control flow" see page 38, http://74.125.155.132/search?q=cache:MmIXuflmcvcJ:lcsd05.cs.tamu.edu/slides/keynote.pdf+joshua+bloch+exceptions&cd=4&hl=en&ct=clnk&gl=us&client=firefox-a or http://lcsd05.cs.tamu.edu/slides/keynote.pdf ).

          If the worry is about boilerplate, it seems like this exception
          would have to be checked — otherwise, people would not know that
          this exception is part of what they have to handle and implement —
          and nothing introduces misunderstood boilerplate quite as fast as
          checked exceptions.

          I can't think of many analogous interfaces here; the lack
          of which probably makes this argument more contentious.
          HTTP is vaguely similar: there's sometimes a full response,
          and sometimes there's a redirection. In most APIs wrapping
          HTTP that I've seen, you get a response object in response
          to a request. TCP/IP errors (socket timeouts, name resolution, etc.)
          throw exceptions, but nothing that has to do with HTTP does.
          One inspects response.status to see if it's 200 (ok), 301 (redirect), etc.

          – Philip

          Show
          Philip Zeyliger added a comment - In my opinion (I'd say IME, but I've never sneaked one of these past code review),using exceptions for flow control is shady. Joshua Bloch would say the same ("Don't force client to use exceptions for control flow" see page 38, http://74.125.155.132/search?q=cache:MmIXuflmcvcJ:lcsd05.cs.tamu.edu/slides/keynote.pdf+joshua+bloch+exceptions&cd=4&hl=en&ct=clnk&gl=us&client=firefox-a or http://lcsd05.cs.tamu.edu/slides/keynote.pdf ). If the worry is about boilerplate, it seems like this exception would have to be checked — otherwise, people would not know that this exception is part of what they have to handle and implement — and nothing introduces misunderstood boilerplate quite as fast as checked exceptions. I can't think of many analogous interfaces here; the lack of which probably makes this argument more contentious. HTTP is vaguely similar: there's sometimes a full response, and sometimes there's a redirection. In most APIs wrapping HTTP that I've seen, you get a response object in response to a request. TCP/IP errors (socket timeouts, name resolution, etc.) throw exceptions, but nothing that has to do with HTTP does. One inspects response.status to see if it's 200 (ok), 301 (redirect), etc. – Philip
          Hide
          Eli Collins added a comment -

          +1 for not using exceptions.

          My experience is, and to quote Josh Bloch, "a well-designed API must not force its clients to use exceptions for ordinary control flow". Resolving links in HDFS paths, regardless of the component it's implemented in, will definitely be considered normal control flow (as opposed to eg something that causes an IOException).

          The case for exceptions being made in this jira is that the return type of the current interface doesn't allow for any case besides returning eg a block report, therefore exceptions are a low-impact way to propagate this new return type. This deficiency in the current API seems to be a good reason for changing it rather than a motivation for working around it using exceptions. Whether the resulting patch is too invasive for 21 is a very reasonable concern, but hopefully orthogonal.

          Show
          Eli Collins added a comment - +1 for not using exceptions. My experience is, and to quote Josh Bloch, "a well-designed API must not force its clients to use exceptions for ordinary control flow". Resolving links in HDFS paths, regardless of the component it's implemented in, will definitely be considered normal control flow (as opposed to eg something that causes an IOException). The case for exceptions being made in this jira is that the return type of the current interface doesn't allow for any case besides returning eg a block report, therefore exceptions are a low-impact way to propagate this new return type. This deficiency in the current API seems to be a good reason for changing it rather than a motivation for working around it using exceptions. Whether the resulting patch is too invasive for 21 is a very reasonable concern, but hopefully orthogonal.
          Hide
          Eli Collins added a comment -

          Also Sanjay made the good point several times above that FS and the NN shouldn't be inconsistent wrt to their use of exceptions, ie UnresolvedPathException is not a good use of exceptions for the same reason--it's possible to write clean code that returns values via normal undwinding (ie if the NN were written in C I don't think anyone would be arguing that we usesetjmp/longjump here).

          Show
          Eli Collins added a comment - Also Sanjay made the good point several times above that FS and the NN shouldn't be inconsistent wrt to their use of exceptions, ie UnresolvedPathException is not a good use of exceptions for the same reason--it's possible to write clean code that returns values via normal undwinding (ie if the NN were written in C I don't think anyone would be arguing that we usesetjmp/longjump here).
          Hide
          dhruba borthakur added a comment -

          I was of the opinion that using Exceptions keeps the code cleaner and easy to read. But you raise a good question "what if HDFS was written in C, would we use setjmp/longjmp"? Definitely no. I am beginning to come round to the fact that the code looks cleaner when we use Exceptions just because of the deficiencies of the current API.

          Also, is there any performance difference between throwing an exception vs returning a status code?

          Show
          dhruba borthakur added a comment - I was of the opinion that using Exceptions keeps the code cleaner and easy to read. But you raise a good question "what if HDFS was written in C, would we use setjmp/longjmp"? Definitely no. I am beginning to come round to the fact that the code looks cleaner when we use Exceptions just because of the deficiencies of the current API. Also, is there any performance difference between throwing an exception vs returning a status code?
          Hide
          Doug Cutting added a comment -

          > Also, is there any performance difference between throwing an exception vs returning a status code?

          Java exceptions are substantially slower than normal returns.

          I can't find a lot of benchmarks, but here's one that says its 66x slower!

          http://stackoverflow.com/questions/299068/how-slow-are-java-exceptions/299315

          Show
          Doug Cutting added a comment - > Also, is there any performance difference between throwing an exception vs returning a status code? Java exceptions are substantially slower than normal returns. I can't find a lot of benchmarks, but here's one that says its 66x slower! http://stackoverflow.com/questions/299068/how-slow-are-java-exceptions/299315
          Hide
          Philip Zeyliger added a comment -

          I'm not up to date on the current lore of exceptions performance, but Java Exceptions tend to carry around stack frames of where they were thrown (Throwable has a handful of fields including StackTraceElement[]), and those are certainly heavier than any return value object we might construct.

          Show
          Philip Zeyliger added a comment - I'm not up to date on the current lore of exceptions performance, but Java Exceptions tend to carry around stack frames of where they were thrown (Throwable has a handful of fields including StackTraceElement[]), and those are certainly heavier than any return value object we might construct.
          Hide
          Konstantin Boudnik added a comment -

          I'd discussed the article from Doug's point with some folks from Hotspot team and what I hear is basically this:

          • using exceptions for control flow is bad idea (doh!)
          • exceptions are slow enough although it's getting better because of relatively cheap object creation

          Also, Exceptions are tend to carry on a handful of information about internal states of a system which might get exposed by a devoted 'user'.

          Show
          Konstantin Boudnik added a comment - I'd discussed the article from Doug's point with some folks from Hotspot team and what I hear is basically this: using exceptions for control flow is bad idea (doh!) exceptions are slow enough although it's getting better because of relatively cheap object creation Also, Exceptions are tend to carry on a handful of information about internal states of a system which might get exposed by a devoted 'user'.
          Hide
          Joydeep Sen Sarma added a comment -

          the left side of my brain says don't comment, don't comment ..

          the right side says the comparison to C is entirely inappropriate. There are so many places in Java (everywhere!) where exceptions are thrown where in C one would have a return code. Integer.parseInt and NumberFormatException versus atoi anyone? There may be other ways of reasoning about this - but not comparison to C ..

          there i did it. stupid right side again.

          Show
          Joydeep Sen Sarma added a comment - the left side of my brain says don't comment, don't comment .. the right side says the comparison to C is entirely inappropriate. There are so many places in Java (everywhere!) where exceptions are thrown where in C one would have a return code. Integer.parseInt and NumberFormatException versus atoi anyone? There may be other ways of reasoning about this - but not comparison to C .. there i did it. stupid right side again.
          Hide
          Eli Collins added a comment -

          My mentioned C to say it's possible to write clean code that returns values via normal function call unwinding.

          Does anyone still think we should be using exceptions for normal control flow?

          Currently for example checkPermission returns void but throws AccessControlException and UnresolvedPathException. Both represent normal control flow (ie if the user is using access control or symlinks). How about instead we return a value (eg similar to FsResult) to indicate the status of the operation? I realize that will be a sizable diff, it doesn't necessarily have to all go in with symlinks, would be great if we could stage it.

          Btw I'll post a design doc soon which attempts to capture the jira so far and follow up with a test plan. Want to make sure there's consensus there since the discussion has spread several months and there's still open questions.

          Thanks,
          Eli

          Show
          Eli Collins added a comment - My mentioned C to say it's possible to write clean code that returns values via normal function call unwinding. Does anyone still think we should be using exceptions for normal control flow? Currently for example checkPermission returns void but throws AccessControlException and UnresolvedPathException. Both represent normal control flow (ie if the user is using access control or symlinks). How about instead we return a value (eg similar to FsResult) to indicate the status of the operation? I realize that will be a sizable diff, it doesn't necessarily have to all go in with symlinks, would be great if we could stage it. Btw I'll post a design doc soon which attempts to capture the jira so far and follow up with a test plan. Want to make sure there's consensus there since the discussion has spread several months and there's still open questions. Thanks, Eli
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > Currently for example checkPermission returns void but throws AccessControlException ...

          Hi Eli, do you think that all checkXxx methods in java.lang.SecurityManager are badly designed because it throws exceptions for normal control flows?

          Show
          Tsz Wo Nicholas Sze added a comment - > Currently for example checkPermission returns void but throws AccessControlException ... Hi Eli, do you think that all checkXxx methods in java.lang.SecurityManager are badly designed because it throws exceptions for normal control flows?
          Hide
          Konstantin Shvachko added a comment -

          Alan asked for the summary. Here is the one before history takes a new turn in this issue. Tried to be objective.

          Overview of the SymLink proposal.

          The SymLink proposal HDFS-245 introduces symbolic links in HDFS, which can point to files or directories in the same (internal link) or a different (external link) files systems.

          1. Internal link example /users/shv/test
          2. External link example hdfs://nn1.yahoo.com:50030/users/shv/data
            External links may be also viewed as mount points of other Hadoop file system(s), those that implement FileSystem API.

          Internal links can be resolved inside the name-node, there is no contradiction here.

          External links should be returned back to an HDFS client, which should connect to the linked file system on a different cluster and get information from there. The controversy here is how the information about the external symbolic links should be returned to and handled by the client.
          Note that such handling can be only done at the top FileSystem level, because lower level abstractions represent a single Hadoop file system.

          The following major approaches have been considered.

          I. Replace each call to the name-node with two: the first resolves the input path if it is a symbolic link or does nothing if the path is local, the second performs the required action, like open() a file.
          II. Each name-node operation with a path parameter throws an UnresolvedLinkException if the path crosses an external link (a mount point). The exception contains a new path pointed to by the symlink, which the client uses to repeat the call with a different file system.
          III. Every call to the name-node returns a result, which incorporates a potential symlink resolution, if any, in addition to the actual return type of the method. Thus the result of a method is sort of a union of a potential link and the real type. See examples below.
          IV. Let the name-node throw MountPointException if the specified path contains an external symlink. The client catches MountPointException, calls a new method getLinkTarget() to resolve the link, and uses the resolved path further to perform the action - open() a file.
          V. Similar to (III) but the potential symlink resolution is hidden in a thread local variable.

          Most people agreed that (I) doubles the access time in the regular case, when there are no symlinks. Doug proposed at some point to benchmark it before ruling it out, because the name-node is fast based on the latest benchmarks.
          (II), (III) and (V) do not require extra RPC calls.
          (IV) does not require extra RPCs in the regular (no symlinks) case, but does require to resolve external symlinks explicitly if one is on the path.

          The latest patch submitted by Dhruba implements (III). There were several iterations over this. The latest API looks like this:

          FsResult<LocatedBlocks> getBlockLocations(path, ...) throws IOException;
          FsResult<FileStatus> getFileInfo(String src) throws IOException;
          FsResult<BooleanWritable> delete(String src, boolean recursive) throws IOException;
          FsResult<NullWritable> setOwner(String src, ...) throws IOException;
          

          This is much better than the starting patch, which was defining a separate union-class for every return type, but still, replacing a simple void type with FsResult<NullWritable> and Boolean with FsResult<BooleanWritable> looks controversial.

          • Doug opposes (II) and (IV) on the basis that "Exceptions should not be used for normal program control."
          • He refers to ["Best Practices for Exception Handling":http://onjava.com/pub/a/onjava/2003/11/19/exceptions.html]
            This lists just three cases where exceptions are appropriate: programming errors, client code errors, and resource failures, which the links are clearly not.
          • I, Sanjay, et al. argue that this and other sources mention and accept usage of so called Checked exceptions, which "represent invalid conditions in areas outside the immediate control of the program". UnresolvedLinkException and MountPointException fall under this category, because the name-node cannot immediately handle external symlinks being unaware of other clusters.
          • Doug's counter argument to this is that the whole HDFS is a single monolithic layer, which should handle symlinks and therefore this is not an exceptional condition for any sub-layers of HDFS.
          Show
          Konstantin Shvachko added a comment - Alan asked for the summary. Here is the one before history takes a new turn in this issue. Tried to be objective. Overview of the SymLink proposal. The SymLink proposal HDFS-245 introduces symbolic links in HDFS, which can point to files or directories in the same (internal link) or a different (external link) files systems. Internal link example /users/shv/test External link example hdfs://nn1.yahoo.com:50030/users/shv/data External links may be also viewed as mount points of other Hadoop file system(s), those that implement FileSystem API. Internal links can be resolved inside the name-node, there is no contradiction here. External links should be returned back to an HDFS client, which should connect to the linked file system on a different cluster and get information from there. The controversy here is how the information about the external symbolic links should be returned to and handled by the client. Note that such handling can be only done at the top FileSystem level, because lower level abstractions represent a single Hadoop file system. The following major approaches have been considered. I. Replace each call to the name-node with two: the first resolves the input path if it is a symbolic link or does nothing if the path is local, the second performs the required action, like open() a file. II. Each name-node operation with a path parameter throws an UnresolvedLinkException if the path crosses an external link (a mount point). The exception contains a new path pointed to by the symlink, which the client uses to repeat the call with a different file system. III. Every call to the name-node returns a result, which incorporates a potential symlink resolution, if any, in addition to the actual return type of the method. Thus the result of a method is sort of a union of a potential link and the real type. See examples below. IV. Let the name-node throw MountPointException if the specified path contains an external symlink. The client catches MountPointException, calls a new method getLinkTarget() to resolve the link, and uses the resolved path further to perform the action - open() a file. V. Similar to (III) but the potential symlink resolution is hidden in a thread local variable. Most people agreed that (I) doubles the access time in the regular case, when there are no symlinks. Doug proposed at some point to benchmark it before ruling it out, because the name-node is fast based on the latest benchmarks. (II), (III) and (V) do not require extra RPC calls. (IV) does not require extra RPCs in the regular (no symlinks) case, but does require to resolve external symlinks explicitly if one is on the path. The latest patch submitted by Dhruba implements (III). There were several iterations over this. The latest API looks like this: FsResult<LocatedBlocks> getBlockLocations(path, ...) throws IOException; FsResult<FileStatus> getFileInfo( String src) throws IOException; FsResult<BooleanWritable> delete( String src, boolean recursive) throws IOException; FsResult<NullWritable> setOwner( String src, ...) throws IOException; This is much better than the starting patch, which was defining a separate union-class for every return type, but still, replacing a simple void type with FsResult<NullWritable> and Boolean with FsResult<BooleanWritable> looks controversial. Doug opposes (II) and (IV) on the basis that "Exceptions should not be used for normal program control." He refers to ["Best Practices for Exception Handling":http://onjava.com/pub/a/onjava/2003/11/19/exceptions.html] This lists just three cases where exceptions are appropriate: programming errors, client code errors, and resource failures, which the links are clearly not. I, Sanjay, et al. argue that this and other sources mention and accept usage of so called Checked exceptions, which "represent invalid conditions in areas outside the immediate control of the program". UnresolvedLinkException and MountPointException fall under this category, because the name-node cannot immediately handle external symlinks being unaware of other clusters. Doug's counter argument to this is that the whole HDFS is a single monolithic layer, which should handle symlinks and therefore this is not an exceptional condition for any sub-layers of HDFS.
          Hide
          Konstantin Shvachko added a comment -

          My opinion as I expressed it previously in this jira:

          • HDFS is not a single monolithic layer. It consists of several abstract layers: name-node, client, FileSystem.
            Most of them, up to the top one, do not have a clue what to do with a link to an external file system, and therefore should treat it as an exceptional condition and throw a Checked exception.
          • The current exception-less implementation substantially complicates (pollutes) the file system api through all of its layers.
            Speaking as a developer, supporting this code will be a tough effort. Understandability of the code suffers as well.
          • My personal preference is approach VI. I think Checked exception is exactly what should be used here, but passing any data inside the exception is a bad practice. Performance-wise, paying the price of an extra RPC call to the name-node ONLY when the path REALLY contains the external link is justifiable.
          Show
          Konstantin Shvachko added a comment - My opinion as I expressed it previously in this jira: HDFS is not a single monolithic layer. It consists of several abstract layers: name-node, client, FileSystem. Most of them, up to the top one, do not have a clue what to do with a link to an external file system, and therefore should treat it as an exceptional condition and throw a Checked exception. The current exception-less implementation substantially complicates (pollutes) the file system api through all of its layers. Speaking as a developer, supporting this code will be a tough effort. Understandability of the code suffers as well. My personal preference is approach VI. I think Checked exception is exactly what should be used here, but passing any data inside the exception is a bad practice. Performance-wise, paying the price of an extra RPC call to the name-node ONLY when the path REALLY contains the external link is justifiable.
          Hide
          Konstantin Shvachko added a comment -

          There were no explanations on the fact of reassignment of this jira. Eli, are you committing to support this patch or do you see your mission to provide documentation and testing?

          Show
          Konstantin Shvachko added a comment - There were no explanations on the fact of reassignment of this jira. Eli, are you committing to support this patch or do you see your mission to provide documentation and testing?
          Hide
          Eli Collins added a comment -

          I think well designed APIs don't require catching exceptions for normal control flow. My guess is that they chose to throw a SecurityException because they wanted to make failing permission checks impossible to ignore (you have to either catch or rethrow) where as it's much easier to ignore a function return value, leading to a security flaw. Perhaps that's the right trade off in this case. Also perhaps continuing to throw an exception is an AccessControlException is reasonable given that it's the convention for a library used extensively. I also don't consider java.lang to be the final word on what's best for a given system (just look how it's evolved), you have to consider the trade off. For example, if exceptions turn out to be pretty slow that may not be a big deal for SecurityException since it's not thrown frequently in the common case, but not at all good for UnresolvedPathException since it may be thrown heavily (eg if the root NN has nothing but symlinks to other NNs).

          So in this case, and in general, I think the above is a good api design principle. There are real reasons why exceptions are not appropriate for normal control flow, the tradeoffs need to be considered for the given case. In this jira the only argument I read for throwing the exception is that it means not modifying a lot of function prototypes, which, to me, does not justify bending a good principle.

          Thanks,
          Eli

          Show
          Eli Collins added a comment - I think well designed APIs don't require catching exceptions for normal control flow. My guess is that they chose to throw a SecurityException because they wanted to make failing permission checks impossible to ignore (you have to either catch or rethrow) where as it's much easier to ignore a function return value, leading to a security flaw. Perhaps that's the right trade off in this case. Also perhaps continuing to throw an exception is an AccessControlException is reasonable given that it's the convention for a library used extensively. I also don't consider java.lang to be the final word on what's best for a given system (just look how it's evolved), you have to consider the trade off. For example, if exceptions turn out to be pretty slow that may not be a big deal for SecurityException since it's not thrown frequently in the common case, but not at all good for UnresolvedPathException since it may be thrown heavily (eg if the root NN has nothing but symlinks to other NNs). So in this case, and in general, I think the above is a good api design principle. There are real reasons why exceptions are not appropriate for normal control flow, the tradeoffs need to be considered for the given case. In this jira the only argument I read for throwing the exception is that it means not modifying a lot of function prototypes, which, to me, does not justify bending a good principle. Thanks, Eli
          Hide
          Doug Cutting added a comment -

          > Doug's counter argument to this is that the whole HDFS is a single monolithic layer, which should handle symlinks and therefore this is not an exceptional condition for any sub-layers of HDFS.

          That's true. But the point I feel more strongly about is that, in the SPI for FileSystem implementations (local, HDFS, S3, etc.), links are normal. If HDFS uses exceptions internally for this that's not great. (It actually can't easily use exceptions across RPCs anyway since Hadoop's RPC mechanism has poor support for exceptions and should probably catch them serverside, bundle them into a union-like datastructure and re-throw them on the other side.) But for HDFS to communicate the presence of an external link to the generic FileSystem layer through an exception is much worse. The FileSystem API is arguably Hadoop's most central API and should be held to a very high standard. The contract of a FileSystem implementation, from the outside, in the presence of links, is to map paths to either bytes or to an external link. How it chooses to implement that internally is a separate issue. Exceptions are a poor choice at either level, but I have a much harder time tolerating them at the upper level.

          Show
          Doug Cutting added a comment - > Doug's counter argument to this is that the whole HDFS is a single monolithic layer, which should handle symlinks and therefore this is not an exceptional condition for any sub-layers of HDFS. That's true. But the point I feel more strongly about is that, in the SPI for FileSystem implementations (local, HDFS, S3, etc.), links are normal. If HDFS uses exceptions internally for this that's not great. (It actually can't easily use exceptions across RPCs anyway since Hadoop's RPC mechanism has poor support for exceptions and should probably catch them serverside, bundle them into a union-like datastructure and re-throw them on the other side.) But for HDFS to communicate the presence of an external link to the generic FileSystem layer through an exception is much worse. The FileSystem API is arguably Hadoop's most central API and should be held to a very high standard. The contract of a FileSystem implementation, from the outside, in the presence of links, is to map paths to either bytes or to an external link. How it chooses to implement that internally is a separate issue. Exceptions are a poor choice at either level, but I have a much harder time tolerating them at the upper level.
          Hide
          Eli Collins added a comment -

          Hey Konstantin,

          Thanks for posting the overview. Looks good, though it's not just Doug who would prefer not using exceptions for normal control flow. But aside from the issue of the use of exceptions is there consensus, eg on the user-visible behavior of the latest patches? If so, sorry, wasn't clear from the comments.

          > There were no explanations on the fact of reassignment of this jira. Eli, are you committing to support this
          > patch or do you see your mission to provide documentation and testing?

          Apologies for the missing context. Dhruba hasn't had the cycles to close the jira and I offered to help out. It wasn't clear if people were OK with the latest patches so I'll work on working on finishing them up so they're acceptable to people. Yes, I'm committed to supporting the feature.

          > My personal preference is approach VI.

          Do you mean approach V?

          It seems the use of symbolic links will be a common case (eg you can imagine them being used ala google namespaces) and therefore it would be nice to avoid multiple round trips to the NN.

          Thanks,
          Eli

          Show
          Eli Collins added a comment - Hey Konstantin, Thanks for posting the overview. Looks good, though it's not just Doug who would prefer not using exceptions for normal control flow. But aside from the issue of the use of exceptions is there consensus, eg on the user-visible behavior of the latest patches? If so, sorry, wasn't clear from the comments. > There were no explanations on the fact of reassignment of this jira. Eli, are you committing to support this > patch or do you see your mission to provide documentation and testing? Apologies for the missing context. Dhruba hasn't had the cycles to close the jira and I offered to help out. It wasn't clear if people were OK with the latest patches so I'll work on working on finishing them up so they're acceptable to people. Yes, I'm committed to supporting the feature. > My personal preference is approach VI. Do you mean approach V? It seems the use of symbolic links will be a common case (eg you can imagine them being used ala google namespaces) and therefore it would be nice to avoid multiple round trips to the NN. Thanks, Eli
          Hide
          Konstantin Shvachko added a comment -

          > My personal preference is approach

          I am in favor of approach IV (four, 4, the one with MountPointException).
          Thanks for correcting, Eli.

          Show
          Konstantin Shvachko added a comment - > My personal preference is approach I am in favor of approach IV (four, 4, the one with MountPointException). Thanks for correcting, Eli.
          Hide
          Konstantin Shvachko added a comment -

          > The contract of a FileSystem implementation, from the outside, in the presence of links, is to map paths to either bytes or to an external link. How it chooses to implement that internally is a separate issue.

          If this means NameNode and DistributedFileSystem may throw an exception in case of paths with external links, which is handled on the FileSystem level, then that resolves the issue for me.

          Show
          Konstantin Shvachko added a comment - > The contract of a FileSystem implementation, from the outside, in the presence of links, is to map paths to either bytes or to an external link. How it chooses to implement that internally is a separate issue. If this means NameNode and DistributedFileSystem may throw an exception in case of paths with external links, which is handled on the FileSystem level, then that resolves the issue for me.
          Show
          Doug Cutting added a comment - My position is the same as I stated in: https://issues.apache.org/jira/browse/HDFS-245?focusedCommentId=12638334&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12638334
          Hide
          Konstantin Shvachko added a comment -

          Then it should be easy to propagate UnresolvedPathException from the patch to the FileSystem level?
          This will simplify the patch a lot.

          Show
          Konstantin Shvachko added a comment - Then it should be easy to propagate UnresolvedPathException from the patch to the FileSystem level? This will simplify the patch a lot.
          Hide
          eric baldeschwieler added a comment -

          Hi Folks,

          It seems folks position on this issue have not changed since the bug was first discussed many moons ago. Doug has objected to the use of exceptions since first proposed. Several committers have objected to the no exception based approach since it was first proposed.

          We seem to have no mechanism for resolving such disputes. It appears this feature has missed 21 as a result, which is a pity since everyone agrees that this would be a useful feature. Given the high degree of interest in the feature, I'd suggest that we can probably vote it into the 21 release in a week or two if we can get to a consensus on the design.

          How do we get passed the veto impasse?

          E14

          Show
          eric baldeschwieler added a comment - Hi Folks, It seems folks position on this issue have not changed since the bug was first discussed many moons ago. Doug has objected to the use of exceptions since first proposed. Several committers have objected to the no exception based approach since it was first proposed. We seem to have no mechanism for resolving such disputes. It appears this feature has missed 21 as a result, which is a pity since everyone agrees that this would be a useful feature. Given the high degree of interest in the feature, I'd suggest that we can probably vote it into the 21 release in a week or two if we can get to a consensus on the design. How do we get passed the veto impasse? E14
          Hide
          Sanjay Radia added a comment -

          Sorry for my delayed response on this Jira - I was occupied for the 0.21 release.

          Exceptions should match the layer of abstraction. Often an exception from a lower layer is converted to another more appropriate exception and passed up. Similarly one often consumes an exception thrown by a lower layer below. For example in a multi device storage system a storage device may throw an exception that it does not have any free space and a middle layer may consume that exception and reapply the operation at a different device.

          I see the FileSystem layer as being independent from their implementations such as a KFS server or a NN server. Each abstraction layer should stand on its own and use exceptions appropriate for that layer. From the NN node perspective, symlinks that cannot be followed such as those to KFS or to other NNs are an abnormal condition. Hence from the NN's perspective it is reasonable to use exceptions to convey this abnormal condition. Hence the RPC interface (i.e. NN's interface) can throw exceptions for foreign symbolic links. The Filesystem layer decides that it can process such an exception or convert and re-throw it as a FileNotFoundException.

          I have been very consistent in my comments ealier that my preference for exceptions is NOT because I am too lazy to use the alternate solution; use of exception is appropriate in this situation. Further if the alternate solution did not make the interfaces so ugly or if it simply affected one or two methods, I would have been able to accept the alternate solution.
          Our current interface is reasonably clean - very similar to other file systems and hence familiar to a FS user and developer. My objection is that the other approach makes the interface unnatural and its implementation harder to maintain.

          Show
          Sanjay Radia added a comment - Sorry for my delayed response on this Jira - I was occupied for the 0.21 release. Exceptions should match the layer of abstraction. Often an exception from a lower layer is converted to another more appropriate exception and passed up. Similarly one often consumes an exception thrown by a lower layer below. For example in a multi device storage system a storage device may throw an exception that it does not have any free space and a middle layer may consume that exception and reapply the operation at a different device. I see the FileSystem layer as being independent from their implementations such as a KFS server or a NN server. Each abstraction layer should stand on its own and use exceptions appropriate for that layer. From the NN node perspective, symlinks that cannot be followed such as those to KFS or to other NNs are an abnormal condition. Hence from the NN's perspective it is reasonable to use exceptions to convey this abnormal condition. Hence the RPC interface (i.e. NN's interface) can throw exceptions for foreign symbolic links. The Filesystem layer decides that it can process such an exception or convert and re-throw it as a FileNotFoundException. I have been very consistent in my comments ealier that my preference for exceptions is NOT because I am too lazy to use the alternate solution; use of exception is appropriate in this situation. Further if the alternate solution did not make the interfaces so ugly or if it simply affected one or two methods, I would have been able to accept the alternate solution. Our current interface is reasonably clean - very similar to other file systems and hence familiar to a FS user and developer. My objection is that the other approach makes the interface unnatural and its implementation harder to maintain.
          Hide
          Doug Cutting added a comment -

          Sanjay> Exceptions should match the layer of abstraction.

          I agree. The question is whether symbolic links are first-class data in the FileSystem SPI. If they are, then they ought to be returned as normal values, no?

          Konstantin> IV. Let the name-node throw MountPointException if the specified path contains an external symlink. The client catches MountPointException, calls a new method getLinkTarget() to resolve the link, and uses the resolved path further to perform the action - open() a file.

          I could live with this approach. It's an optimization of (1). Clients could always instead call getStatus() first, resolving links and avoid MountPointException by ensuring they only call open() on paths that are non-links. If we find that this optimization is insufficient then we could further optimize the API later to avoid a separate call for link resolution.

          Show
          Doug Cutting added a comment - Sanjay> Exceptions should match the layer of abstraction. I agree. The question is whether symbolic links are first-class data in the FileSystem SPI. If they are, then they ought to be returned as normal values, no? Konstantin> IV. Let the name-node throw MountPointException if the specified path contains an external symlink. The client catches MountPointException, calls a new method getLinkTarget() to resolve the link, and uses the resolved path further to perform the action - open() a file. I could live with this approach. It's an optimization of (1). Clients could always instead call getStatus() first, resolving links and avoid MountPointException by ensuring they only call open() on paths that are non-links. If we find that this optimization is insufficient then we could further optimize the API later to avoid a separate call for link resolution.
          Hide
          Doug Cutting added a comment -

          > a new method getLinkTarget() to resolve the link

          Isn't that just getStatus()? The status should include whether the file is a link and, if it is, the target of the link, no?

          Show
          Doug Cutting added a comment - > a new method getLinkTarget() to resolve the link Isn't that just getStatus()? The status should include whether the file is a link and, if it is, the target of the link, no?
          Hide
          Konstantin Shvachko added a comment -

          Yes (IV) is an optimization of (I).
          I first thought getStatus() is the same as getLinkTarget(), but may be it's not, because we don't know which part of the path is a link.
          If the link is in the middle of the src path, then having getFileStatus(src) return a FileStatus of the link inside instead of the status of the intended file may be confusing.
          So the idea of having a separate getLinkTarget() is that it can return the target of the link along with the portion of the path, which points to the link.

          Show
          Konstantin Shvachko added a comment - Yes (IV) is an optimization of (I). I first thought getStatus() is the same as getLinkTarget(), but may be it's not, because we don't know which part of the path is a link. If the link is in the middle of the src path, then having getFileStatus(src) return a FileStatus of the link inside instead of the status of the intended file may be confusing. So the idea of having a separate getLinkTarget() is that it can return the target of the link along with the portion of the path, which points to the link.
          Hide
          Sanjay Radia added a comment -

          Konstantin> IV. Let the name-node throw MountPointException if the specified path contains an external symlink. The client catches MountPointException, calls a new method getLinkTarget() to resolve the link, and uses the resolved path further to perform the action - open() a file.

          Doug> I could live with this approach.

          I can live with the MountPointException and 2nd call to get the link info.
          Hard to believe that we did not find this middle ground earlier. So Doug you are okay with using MountPoint exception as long as it does not return the symlink info inside the exception?

          Agree with Doug that getStatus() should suffice for the 2nd call.

          Show
          Sanjay Radia added a comment - Konstantin> IV. Let the name-node throw MountPointException if the specified path contains an external symlink. The client catches MountPointException, calls a new method getLinkTarget() to resolve the link, and uses the resolved path further to perform the action - open() a file. Doug> I could live with this approach. I can live with the MountPointException and 2nd call to get the link info. Hard to believe that we did not find this middle ground earlier. So Doug you are okay with using MountPoint exception as long as it does not return the symlink info inside the exception? Agree with Doug that getStatus() should suffice for the 2nd call.
          Hide
          Doug Cutting added a comment -

          > we don't know which part of the path is a link.

          Under (IV), if /a/b is an external link to foo:/z, then getFileStatus(/a/b/c) would throw a MountPointException, while getLinkTarget(/a/b/c) would return foo:/z/c. Is that right?

          Show
          Doug Cutting added a comment - > we don't know which part of the path is a link. Under (IV), if /a/b is an external link to foo:/z, then getFileStatus(/a/b/c) would throw a MountPointException, while getLinkTarget(/a/b/c) would return foo:/z/c. Is that right?
          Hide
          Doug Cutting added a comment -

          Konstantin, will you help implement (IV) for 0.21? Dhruba's current patch implements (III). Eli volunteered update that for the project split add a design doc & tests, but converting it to (IV) is a larger task. Perhaps Eli is willing to attempt this too, I don't know, but he is less facile with HDFS than you, and such a substantial change at such a late date might be a stretch.

          Show
          Doug Cutting added a comment - Konstantin, will you help implement (IV) for 0.21? Dhruba's current patch implements (III). Eli volunteered update that for the project split add a design doc & tests, but converting it to (IV) is a larger task. Perhaps Eli is willing to attempt this too, I don't know, but he is less facile with HDFS than you, and such a substantial change at such a late date might be a stretch.
          Hide
          Konstantin Shvachko added a comment -

          > ... getLinkTarget(/a/b/c) would return foo:/z/c. Is that right?

          Exactly.

          > Konstantin, will you help implement (IV) for 0.21?

          I would be glad to implement this, but I am currently preoccupied with append. May be Dhruba wants to convert his patch to (IV), then append and symlinks can go in parallel. If not I will be available beginning of October.

          Show
          Konstantin Shvachko added a comment - > ... getLinkTarget(/a/b/c) would return foo:/z/c. Is that right? Exactly. > Konstantin, will you help implement (IV) for 0.21? I would be glad to implement this, but I am currently preoccupied with append. May be Dhruba wants to convert his patch to (IV), then append and symlinks can go in parallel. If not I will be available beginning of October.
          Hide
          dhruba borthakur added a comment -

          Hi Doug/Eli, if you have already split up the existing patch into three parts, (common, mapred, hdfs), can you pl post them to the JIRA? I am then implement the latest changes proposed and post a new patch.

          Show
          dhruba borthakur added a comment - Hi Doug/Eli, if you have already split up the existing patch into three parts, (common, mapred, hdfs), can you pl post them to the JIRA? I am then implement the latest changes proposed and post a new patch.
          Hide
          Eli Collins added a comment -

          symlink15.txt patches rebased against trunk post project split. Have not been merged in the last couple days so have conflicts with recent FileContext changes.

          Show
          Eli Collins added a comment - symlink15.txt patches rebased against trunk post project split. Have not been merged in the last couple days so have conflicts with recent FileContext changes.
          Hide
          Eli Collins added a comment -

          Thanks Dhruba.

          Show
          Eli Collins added a comment - Thanks Dhruba.
          Hide
          Sanjay Radia added a comment -

          Symlinks to .. may be tricky to make work as we fold dotDots when a path is constructed. ie. Path("foo/bar/../xxx") becomes "foo/xxx".
          Good news is that initial dotDot (as in "../foo/bar") is left alone.
          Dhruba, you may want to watch out for this case in your code. DFSUtil#isValidPath() checks for dotDots as being invalid.

          Show
          Sanjay Radia added a comment - Symlinks to .. may be tricky to make work as we fold dotDots when a path is constructed. ie. Path("foo/bar/../xxx") becomes "foo/xxx". Good news is that initial dotDot (as in "../foo/bar") is left alone. Dhruba, you may want to watch out for this case in your code. DFSUtil#isValidPath() checks for dotDots as being invalid.
          Hide
          Eli Collins added a comment -

          Uploaded a first draft of a design doc (followed the template in HADOOP-5587). Content follows. Lemme know where I missed the boat. Thanks, Eli

          Symlinks Design Doc

          Problem definition

          HDFS path resolution has the following limitations:

          • Files and directories are only accessible via a single path.
          • An HDFS namespace may not span multiple file systems.

          Symbolic links address these limitations by providing an additional level of indirection when resolving paths in an HDFS file system. A symbolic link is a special type of file that contain a path to another file or directory. Paths may be relative or absolute. Relative paths (eg ../user) provide an alternate path to a single file or directory in the file system. An absolute path may be relative to the current file system (eg /user) or specify a URL (eg hdfs://localhost:8020/foo) which allows the link to point to any file or directory irrespective of the source and destination file systems.

          Allowing multiple paths to resolve to the same file or directory, and HDFS namespaces to span multiple file systems makes it easier to access files and manage their underlying stoarge.

          Use Cases

          If an application requires data be available by a particular path a symlink may be used in lieu of copying the data from its current location. For example, a user may want to create a symlink /data/latest that points to an existing directory so that the latest data is accessible via it's current name and an alias, eg:

          $ hadoop fs -ln /data/20090922 /data/latest

          The user may eventually want to archive this data so that it's accessible but stored more efficiently. They could create an archive of the files in the 20090922 directory and make the original path a symlink to the HAR, eg:

          $ hadoop fs -ln har:///data/20090922.har /data/20090922

          They could also move the directory to another file system that is perhaps lightly loaded or less expensive and make the existing directory a symlink, eg:

          $ hadoop fs -ln hdfs://archive-host/data/20090922 /data/20090922

          The archival file system could also be accessible via an alternative protocol (eg FTP). In both cases the original data has moved but remains accessible by its original path.

          This technique can be used generally to balance storage by transparently making a namespace span multiple file systems. For example, if a particular subtree of a namespace outgrows the capabilities of the file system it resides on (eg Namenode performance, number of files, etc) it it can be moved to a new file system and linked into its current path.

          A symbolic link is also a useful primitive that could be used to implement atomic rename within a file system by atomically rewriting a symbolic link, or to rename a file across partitions (if in the future a file system metadata is partitioned across multiple hosts). See HADOOP-6240 for more info.

          Interaction with the Current System

          The user may interact with symbolic links via the shell or indirectly via applications (eg libhdfs clients like fuse mounts).

          Symbolic links are transparent to most operations though some may want to handle links specially. In general, the behavior should match POSIX where appropriate. Note that linking across file systems is somehwat equivalent to creating a symlink across mount points in POSIX.

          Path resolution:

          • Some commands operate on the link directly (eg stat, rm) if the link is the target.
          • Some commands (eg mv) should operate on the link target if a trailing slash is used (eg if /bar is a link that points to a directory, mv /foo /bar renames bar to foo while mv /foo /bar moves /foo into the directory pointed to by bar).
          • Symbolic links in archive URIs should fully resolve.
          • Some APIs should operate on the link target (eg setting access and modification times).

          Permissions: access control properties of links are ignored, checks are always performed against the link target (if it resides in an HDFS file system). The ch* operations should operate directly on the target.

          Some utilities need to be link-aware:

          • distcp should not follow links by default.
          • fsck should only look at each file once, could optionally report dangling links.
          • Symbolic links in HARs should be followed (so that a symlink to a HAR preserves the original path resolution behavior).

          Symbolic links exist independently of their targets, and may point to non-existing files or directories.

          Requirements

          Clients send the entire path to the Namenode for resolution. The path may contain multiple links:

          • If all links in the path are relative to the current file system then the Namenode transparently (to the client) resolves the path.
          • If the Namenode finds a link in the path that points outside the file system it must provide API(s) to report to the client that (a) it can not resolve a path (due to a link that points outside the file system) and (b) return the target of the link and the remainder of the path that still needs to be resolved.

          Symbolic links should be largely invisible to users of the client. However, symbolic links may introduce cycles into path resolution. For example, a link may point to another URI (on another file system) which points back to the link. Loops should be avoided by having the client limit the number of links it will traverse, and report to its user that the operation was not successful.

          Symbolic links should not introduce significant overhead in the common case, resolving paths without links. Resolving symbolic links may be a frequent operation. If links are being used to transparently span a namespace across multiple file systems then the "root" Namenode may do little work aside from resolving link paths. Therefore, link resolution should have reasonable performance overhead and limited side-effects on both the client and Namenode.

          Design

          One approach is to have the client first stat the file to see if it is a symbolic link and resolve the path. Once the path is fully resolved (this may require contacting additional file systems) the client performs the desired operation. This introduces an additional RPC in the common case (when links are not present) so an optimization is to optimistically perform the operation first. If it was unsucessful due to the presence of an external link the Namenode notifies the client, using an exception, which is caught at the FileSystem level (since the link refers to another file system). The client then makes an additional call to the Namenode to resolve the link.* Once the path is fully resolved the operation can be re-tried. Note that if the operation contained multiple paths each may contain links which must be fully resolved before the operation can complete. This may require additional round trips. The call to resolve a link may fail if the link has been deleted, or is no longer a link, etc. In this case the resulting exception is passed upwards as if the original operation failed. As stated above, link resolution at the FileSystem level will perform a limited number of link resolutions before notifying the client of the failure.

          The Namenode's INodeFile class needs maintain additional metadata to indicate whether a file is a symbolic link (or an additional class could be introduced).

          * NB: It is possible to eliminate this additional RPC by piggy backing the link resolution on the notification.

          Future work

          This design should cover the necessary use cases however some of the above features (like enhancing archives) may be deferred.

          Related future work is implementing atomic rename within a file system by atomically re-writing a symbolic link.

          Show
          Eli Collins added a comment - Uploaded a first draft of a design doc (followed the template in HADOOP-5587 ). Content follows. Lemme know where I missed the boat. Thanks, Eli Symlinks Design Doc Problem definition HDFS path resolution has the following limitations: Files and directories are only accessible via a single path. An HDFS namespace may not span multiple file systems. Symbolic links address these limitations by providing an additional level of indirection when resolving paths in an HDFS file system. A symbolic link is a special type of file that contain a path to another file or directory. Paths may be relative or absolute. Relative paths (eg ../user ) provide an alternate path to a single file or directory in the file system. An absolute path may be relative to the current file system (eg /user ) or specify a URL (eg hdfs://localhost:8020/foo ) which allows the link to point to any file or directory irrespective of the source and destination file systems. Allowing multiple paths to resolve to the same file or directory, and HDFS namespaces to span multiple file systems makes it easier to access files and manage their underlying stoarge. Use Cases If an application requires data be available by a particular path a symlink may be used in lieu of copying the data from its current location. For example, a user may want to create a symlink /data/latest that points to an existing directory so that the latest data is accessible via it's current name and an alias, eg: $ hadoop fs -ln /data/20090922 /data/latest The user may eventually want to archive this data so that it's accessible but stored more efficiently. They could create an archive of the files in the 20090922 directory and make the original path a symlink to the HAR, eg: $ hadoop fs -ln har:///data/20090922.har /data/20090922 They could also move the directory to another file system that is perhaps lightly loaded or less expensive and make the existing directory a symlink, eg: $ hadoop fs -ln hdfs://archive-host/data/20090922 /data/20090922 The archival file system could also be accessible via an alternative protocol (eg FTP). In both cases the original data has moved but remains accessible by its original path. This technique can be used generally to balance storage by transparently making a namespace span multiple file systems. For example, if a particular subtree of a namespace outgrows the capabilities of the file system it resides on (eg Namenode performance, number of files, etc) it it can be moved to a new file system and linked into its current path. A symbolic link is also a useful primitive that could be used to implement atomic rename within a file system by atomically rewriting a symbolic link, or to rename a file across partitions (if in the future a file system metadata is partitioned across multiple hosts). See HADOOP-6240 for more info. Interaction with the Current System The user may interact with symbolic links via the shell or indirectly via applications (eg libhdfs clients like fuse mounts). Symbolic links are transparent to most operations though some may want to handle links specially. In general, the behavior should match POSIX where appropriate. Note that linking across file systems is somehwat equivalent to creating a symlink across mount points in POSIX. Path resolution: Some commands operate on the link directly (eg stat, rm) if the link is the target. Some commands (eg mv) should operate on the link target if a trailing slash is used (eg if /bar is a link that points to a directory, mv /foo /bar renames bar to foo while mv /foo /bar moves /foo into the directory pointed to by bar). Symbolic links in archive URIs should fully resolve. Some APIs should operate on the link target (eg setting access and modification times). Permissions : access control properties of links are ignored, checks are always performed against the link target (if it resides in an HDFS file system). The ch* operations should operate directly on the target. Some utilities need to be link-aware : distcp should not follow links by default. fsck should only look at each file once, could optionally report dangling links. Symbolic links in HARs should be followed (so that a symlink to a HAR preserves the original path resolution behavior). Symbolic links exist independently of their targets, and may point to non-existing files or directories. Requirements Clients send the entire path to the Namenode for resolution. The path may contain multiple links: If all links in the path are relative to the current file system then the Namenode transparently (to the client) resolves the path. If the Namenode finds a link in the path that points outside the file system it must provide API(s) to report to the client that (a) it can not resolve a path (due to a link that points outside the file system) and (b) return the target of the link and the remainder of the path that still needs to be resolved. Symbolic links should be largely invisible to users of the client. However, symbolic links may introduce cycles into path resolution. For example, a link may point to another URI (on another file system) which points back to the link. Loops should be avoided by having the client limit the number of links it will traverse, and report to its user that the operation was not successful. Symbolic links should not introduce significant overhead in the common case, resolving paths without links. Resolving symbolic links may be a frequent operation. If links are being used to transparently span a namespace across multiple file systems then the "root" Namenode may do little work aside from resolving link paths. Therefore, link resolution should have reasonable performance overhead and limited side-effects on both the client and Namenode. Design One approach is to have the client first stat the file to see if it is a symbolic link and resolve the path. Once the path is fully resolved (this may require contacting additional file systems) the client performs the desired operation. This introduces an additional RPC in the common case (when links are not present) so an optimization is to optimistically perform the operation first. If it was unsucessful due to the presence of an external link the Namenode notifies the client, using an exception, which is caught at the FileSystem level (since the link refers to another file system). The client then makes an additional call to the Namenode to resolve the link.* Once the path is fully resolved the operation can be re-tried. Note that if the operation contained multiple paths each may contain links which must be fully resolved before the operation can complete. This may require additional round trips. The call to resolve a link may fail if the link has been deleted, or is no longer a link, etc. In this case the resulting exception is passed upwards as if the original operation failed. As stated above, link resolution at the FileSystem level will perform a limited number of link resolutions before notifying the client of the failure. The Namenode's INodeFile class needs maintain additional metadata to indicate whether a file is a symbolic link (or an additional class could be introduced). * NB: It is possible to eliminate this additional RPC by piggy backing the link resolution on the notification. Future work This design should cover the necessary use cases however some of the above features (like enhancing archives) may be deferred. Related future work is implementing atomic rename within a file system by atomically re-writing a symbolic link.
          Hide
          Nigel Daley added a comment -

          Thanks Eli. The design states:

          Loops should be avoided by having the client limit the number of links it will traverse

          What about loops within a filesystem? Does the NN also limit the number of links it will traverse?

          You give examples of commands the operate on the link and examples that operate on the link target and examples of those that depend on a trailing slash. Given this is a design, can you be explicitly and enumerate the commands for each of there? For instance, setReplication, setTimes, du, etc.

          What is the new options for fsck to report dangling links? What does the output look like?

          What is the new option for distcp to follow symlinks?

          If distcp doesn't follow symlinks, I assume it just copies the symlink. In this case, is the symlink adjusted to point to the source location on the source FS?

          What does the ls output look like for a symlink?

          Do symlinks contribute bytes toward a quota?

          Show
          Nigel Daley added a comment - Thanks Eli. The design states: Loops should be avoided by having the client limit the number of links it will traverse What about loops within a filesystem? Does the NN also limit the number of links it will traverse? You give examples of commands the operate on the link and examples that operate on the link target and examples of those that depend on a trailing slash. Given this is a design, can you be explicitly and enumerate the commands for each of there? For instance, setReplication, setTimes, du, etc. What is the new options for fsck to report dangling links? What does the output look like? What is the new option for distcp to follow symlinks? If distcp doesn't follow symlinks, I assume it just copies the symlink. In this case, is the symlink adjusted to point to the source location on the source FS? What does the ls output look like for a symlink? Do symlinks contribute bytes toward a quota?
          Hide
          dhruba borthakur added a comment -

          Here is a patch that conforms to the modified consensus arrived earlier in the week. I would like everybody to review the symlink17-common.txt patch that has changes to the FileSystem API.

          This code is not yet tested in great detail and is uploaded only for review purposes. (Also, this patch does not yet modify the FileContext API.)

          Show
          dhruba borthakur added a comment - Here is a patch that conforms to the modified consensus arrived earlier in the week. I would like everybody to review the symlink17-common.txt patch that has changes to the FileSystem API. This code is not yet tested in great detail and is uploaded only for review purposes. (Also, this patch does not yet modify the FileContext API.)
          Hide
          Doug Cutting added a comment -

          A few comments on the common patch:

          • If we're really on the verge of deprecating FileSystem (HADOOP-6223) then perhaps we should not invest effort adding new features there. We could instead mostly leave FileSystem alone. Symlinks would only be implemented for applications that use the FileContext API (which might help encourage folks to move to that API). Until HADOOP-6223 is committed, existing FileSystem implementations can be modified to throw MountPointException, and we'd still need to add the getLinkTarget() method, but I don't think we'd need to make many other changes to FileSystem. This also reduces risk, since existing application code paths are altered minimally. What do folks think?
          • maxPathLinks should probably be less than 1000. Earlier I suggested this should be somewhere between 8 (POSIX minimum) and 32 and perhaps configurable.
          • MountPointException would be better named something like UnresolvedLinkException. We don't elsewhere use the term 'mount point" and there's no need to introduce it here.
          Show
          Doug Cutting added a comment - A few comments on the common patch: If we're really on the verge of deprecating FileSystem ( HADOOP-6223 ) then perhaps we should not invest effort adding new features there. We could instead mostly leave FileSystem alone. Symlinks would only be implemented for applications that use the FileContext API (which might help encourage folks to move to that API). Until HADOOP-6223 is committed, existing FileSystem implementations can be modified to throw MountPointException, and we'd still need to add the getLinkTarget() method, but I don't think we'd need to make many other changes to FileSystem. This also reduces risk, since existing application code paths are altered minimally. What do folks think? maxPathLinks should probably be less than 1000. Earlier I suggested this should be somewhere between 8 (POSIX minimum) and 32 and perhaps configurable. MountPointException would be better named something like UnresolvedLinkException. We don't elsewhere use the term 'mount point" and there's no need to introduce it here.
          Hide
          Allen Wittenauer added a comment -

          What is the new options for fsck to report dangling links? What does the output look like?

          Why should fsck report this at all? Since symlinks take the form of a URL, fsck would need to make call outs to a potentially remote system to verify. Over a large file system with many symlinks, this could be a tremendous amount of network bandwidth. IIRC, no other OS has a "report all broken symlinks" functionality. The reality is, is that very few symlinks are "important" enough to have that much scrutiny. Instead, there is a reliance upon savvy admins to write scripts using find to locate symlinks and verify. [But since we have no find ... ]

          access control properties of links are ignored, checks are always performed against the link target (if it resides in an HDFS file system). The ch* operations should operate directly on the target.

          I disagree. We need chown -h (POSIX) and chown -H, chown -L, and chown -P (SVID?) for the same reasons that those OSes support them.

          We also need test -h and test -L supported.

          Show
          Allen Wittenauer added a comment - What is the new options for fsck to report dangling links? What does the output look like? Why should fsck report this at all? Since symlinks take the form of a URL, fsck would need to make call outs to a potentially remote system to verify. Over a large file system with many symlinks, this could be a tremendous amount of network bandwidth. IIRC, no other OS has a "report all broken symlinks" functionality. The reality is, is that very few symlinks are "important" enough to have that much scrutiny. Instead, there is a reliance upon savvy admins to write scripts using find to locate symlinks and verify. [But since we have no find ... ] access control properties of links are ignored, checks are always performed against the link target (if it resides in an HDFS file system). The ch* operations should operate directly on the target. I disagree. We need chown -h (POSIX) and chown -H, chown -L, and chown -P (SVID?) for the same reasons that those OSes support them. We also need test -h and test -L supported.
          Hide
          Robert Chansler added a comment -

          A further thought (perhaps implied by Allen's remarks) is that there needs to be a specification of Day Zero behavior for shell commands. I'm not opposed to adding options separately. But on Day Zero, are links in command arguments always followed? Does lsr follow links? If so, how does lsr avoid loops? rmr? And what about permissions for links? What is the moral equivalent of lstat(2)? (From the lstat(2) man page: "Unlike other filesystem objects, symbolic links do not have an owner, group, access mode, times, etc. Instead, these attributes are taken from the directory that contains the link. The only attributes returned from an lstat() that refer to the symbolic link itself are the file type (S_IFLNK), size, blocks, and link count (always 1)."

          A link is a name, and so should count against the name quota, but not against the space quota.

          fsck should never follow links. Nor is there any meaningful sense to whether a link is "broken." When resolution is attempted, a link either resolves or it doesn't.

          Show
          Robert Chansler added a comment - A further thought (perhaps implied by Allen's remarks) is that there needs to be a specification of Day Zero behavior for shell commands. I'm not opposed to adding options separately. But on Day Zero, are links in command arguments always followed? Does lsr follow links? If so, how does lsr avoid loops? rmr ? And what about permissions for links? What is the moral equivalent of lstat(2)? (From the lstat(2) man page: "Unlike other filesystem objects, symbolic links do not have an owner, group, access mode, times, etc. Instead, these attributes are taken from the directory that contains the link. The only attributes returned from an lstat() that refer to the symbolic link itself are the file type (S_IFLNK), size, blocks, and link count (always 1)." A link is a name, and so should count against the name quota, but not against the space quota. fsck should never follow links. Nor is there any meaningful sense to whether a link is "broken." When resolution is attempted, a link either resolves or it doesn't.
          Hide
          Doug Cutting added a comment -

          > But on Day Zero, are links in command arguments always followed?

          If we follow my suggestion of not implementing links except through the FileContext API, then, on Day Zero, existing code, none of which yet uses FileContext, would not follow links. Symbolic links accessed via the old API could simply appear as empty files: they'd have zero length and an empty blocklist. Then, as we update tools to use the FileContext API, we can decide how each should process symbolic links. Would that help?

          Show
          Doug Cutting added a comment - > But on Day Zero, are links in command arguments always followed? If we follow my suggestion of not implementing links except through the FileContext API, then, on Day Zero, existing code, none of which yet uses FileContext, would not follow links. Symbolic links accessed via the old API could simply appear as empty files: they'd have zero length and an empty blocklist. Then, as we update tools to use the FileContext API, we can decide how each should process symbolic links. Would that help?
          Hide
          Eli Collins added a comment -

          Thanks Nigel, Allen, Robert for taking a look. Will update the doc
          with feedback.

          Good point, the NN should do the same, use the same limit for consistency.

          Agree with Allen. Updated to read "fsck should not follow or report
          dangling links"

          Probably don't need one, seems like distcp should following links and...

          would not adjust links when it's copying them.

          Not sure the design doc needs to specify but how about the usual
          fileName -> path?

          Agree w/ Rob, links should not count towards quota (they have no
          storage after all).

          You're right, the above should be chmod (not ch*). And checks against
          the link target should be performed regardless of what file system it
          resides on. And yes ch* utilities need to be link-aware.

          Agreed. Seems like these should all go in the user guide, unless
          people want them spelled out in the design doc.

          Wrt lsr and rmr, I'd make them not follow links by default and add
          options that do (can detect loops by remembering the paths they've
          visited).

          +1 to Doug's idea of exposing links solely via FileContext, so that
          day zero behavior ignores links and clients get updated to be link
          aware as necessary. Are people OK updating them as they move to
          FileContext?

          Thanks,
          Eli

          Show
          Eli Collins added a comment - Thanks Nigel, Allen, Robert for taking a look. Will update the doc with feedback. Good point, the NN should do the same, use the same limit for consistency. Agree with Allen. Updated to read "fsck should not follow or report dangling links" Probably don't need one, seems like distcp should following links and... would not adjust links when it's copying them. Not sure the design doc needs to specify but how about the usual fileName -> path? Agree w/ Rob, links should not count towards quota (they have no storage after all). You're right, the above should be chmod (not ch*). And checks against the link target should be performed regardless of what file system it resides on. And yes ch* utilities need to be link-aware. Agreed. Seems like these should all go in the user guide, unless people want them spelled out in the design doc. Wrt lsr and rmr, I'd make them not follow links by default and add options that do (can detect loops by remembering the paths they've visited). +1 to Doug's idea of exposing links solely via FileContext, so that day zero behavior ignores links and clients get updated to be link aware as necessary. Are people OK updating them as they move to FileContext? Thanks, Eli
          Hide
          Eli Collins added a comment -

          Apologies for the dupe, looks like jira stripped out the quoted text in my email above.

          Thanks Nigel, Allen, Robert for taking a look. Will update the doc with feedback.

          Loops should be avoided by having the client limit the number of links it will traverse What about loops within a filesystem? Does the