Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      We'd like to add a new feature hardlink to HDFS that allows harlinked files to share data without copying. Currently we will support hardlinking only closed files, but it could be extended to unclosed files as well.

      Among many potential use cases of the feature, the following two are primarily used in facebook:
      1. This provides a lightweight way for applications like hbase to create a snapshot;
      2. This also allows an application like Hive to move a table to a different directory without breaking current running hive queries.

      1. HDFS-HardLink.pdf
        1.03 MB
        Liyin Tang

        Issue Links

          Activity

          Hide
          Michael Segel added a comment -

          Is this still active? The last entry seems to be on Sept 12th... has there been any progress?
          While the underlying issue is HDFS, it seems that this is more about HBase than HDFS.

          One comment... with respect to hard links over multiple name spaces... why?

          I mean hardlinks should exist only within the same namespace, which would remove this roadblock.

          If you were going between namespaces, then use a symbolic link.

          Thx...

          Show
          Michael Segel added a comment - Is this still active? The last entry seems to be on Sept 12th... has there been any progress? While the underlying issue is HDFS, it seems that this is more about HBase than HDFS. One comment... with respect to hard links over multiple name spaces... why? I mean hardlinks should exist only within the same namespace, which would remove this roadblock. If you were going between namespaces, then use a symbolic link. Thx...
          Hide
          Jesse Yates added a comment -

          @Jagane - in short, yes. With the PIT split, any writes up to that point will go into the snapshot. Obviously, we can't ensure that future writes beyond the taking of the snapshot end up in the snapshot. Some writes can get dropped between snapshots though if you don't have your TTLs set correctly, since a compaction can age-off the writes before the snapshot can be taken. This is part of an overall backup solution, and not really the concern of the mechanism for taking snapshots - that's up to you Feel free to DM me if you want to chat more.

          Show
          Jesse Yates added a comment - @Jagane - in short, yes. With the PIT split, any writes up to that point will go into the snapshot. Obviously, we can't ensure that future writes beyond the taking of the snapshot end up in the snapshot. Some writes can get dropped between snapshots though if you don't have your TTLs set correctly, since a compaction can age-off the writes before the snapshot can be taken. This is part of an overall backup solution, and not really the concern of the mechanism for taking snapshots - that's up to you Feel free to DM me if you want to chat more.
          Hide
          Jagane Sundar added a comment -

          Thanks for the pointer to HBASE-6055, Jesse. I just skimmed it, but it is an excellent write-up you have there. Your rationalization for the use of HBASE Timestamps versus actual Point in Time is well taken. My own experience is with writing software to backup a single running VM, so my previous comment did talk about an actual PIT.

          I did not catch this in my skimming of HBASE-6055, so maybe you can clarify - when using HBASE Timestamps to create backups, can we guarantee that the next backup will include all PUTs that were made after the previous snapshot? No PUTs will fall through the cracks, right?

          Show
          Jagane Sundar added a comment - Thanks for the pointer to HBASE-6055 , Jesse. I just skimmed it, but it is an excellent write-up you have there. Your rationalization for the use of HBASE Timestamps versus actual Point in Time is well taken. My own experience is with writing software to backup a single running VM, so my previous comment did talk about an actual PIT. I did not catch this in my skimming of HBASE-6055 , so maybe you can clarify - when using HBASE Timestamps to create backups, can we guarantee that the next backup will include all PUTs that were made after the previous snapshot? No PUTs will fall through the cracks, right?
          Hide
          Jesse Yates added a comment -

          @Jaganar - with HBASE-6055 (currently in review) you get a flush (more or less coordinated between regionservers - see the jira for more info) of the memstore to HFiles, which we would then love to hardlink into the snapshot directory. HFiles live under the the region directory - which lives under the column family and table directories - where the HFile is being served. When a comapction occurs, the file is moved to the .archive directory. Currently, we are getting around the hardlink issue by referencing the HFiles by name and then using a FileLink (also in review) to deal with the file getting archived out from under us when we restore the table.

          The current implementation of snapshots in HBase is pretty close to what you are proposing (and almost identical for 'globally consistent' - cross-server consistent- snapshots, but those quiesce for far too long to ensure consistency), but spends minimal time blocking.

          In short, hardlinks make snapshotting easier, but we still need both parts to get 'clean' restores. Otherwise, we need to do a WAL replay from the COW version of the WAL to get back in-memory state.

          Does that make sense/answer your question?

          Show
          Jesse Yates added a comment - @Jaganar - with HBASE-6055 (currently in review) you get a flush (more or less coordinated between regionservers - see the jira for more info) of the memstore to HFiles, which we would then love to hardlink into the snapshot directory. HFiles live under the the region directory - which lives under the column family and table directories - where the HFile is being served. When a comapction occurs, the file is moved to the .archive directory. Currently, we are getting around the hardlink issue by referencing the HFiles by name and then using a FileLink (also in review) to deal with the file getting archived out from under us when we restore the table. The current implementation of snapshots in HBase is pretty close to what you are proposing (and almost identical for 'globally consistent' - cross-server consistent- snapshots, but those quiesce for far too long to ensure consistency), but spends minimal time blocking. In short, hardlinks make snapshotting easier, but we still need both parts to get 'clean' restores. Otherwise, we need to do a WAL replay from the COW version of the WAL to get back in-memory state. Does that make sense/answer your question?
          Hide
          Jagane Sundar added a comment -

          Pardon my naive question - but, are hard links adequate for the purposes of HBase backup? The first line in this JIRA says "This provides a lightweight way for applications like hbase to create a snapshot".

          Perhaps HBase experts can answer this question: Are single file hard links adequate for HBase backup? Don't you want a Point In Time snapshot of the entire filesystem, or at least all the files under the HBase data directory?

          Don't you really want a sequence of events such as:
          1. Flush all HBase MemStores
          2. Quiesce HBase, i.e. get it to stop writing to HDFS
          3. Call underlying HDFS to create PIT RO Snapshot with COW Semantics
          4. Tell HBase to end quiesce, i.e. it can start writing to HDFS again
          5. Backup program now reads from RO snapshot and writes to backup device, while HBase continues to write to the real directory tree
          6. When the backup program is done, it deletes the RO snapshot

          Show
          Jagane Sundar added a comment - Pardon my naive question - but, are hard links adequate for the purposes of HBase backup? The first line in this JIRA says "This provides a lightweight way for applications like hbase to create a snapshot". Perhaps HBase experts can answer this question: Are single file hard links adequate for HBase backup? Don't you want a Point In Time snapshot of the entire filesystem, or at least all the files under the HBase data directory? Don't you really want a sequence of events such as: 1. Flush all HBase MemStores 2. Quiesce HBase, i.e. get it to stop writing to HDFS 3. Call underlying HDFS to create PIT RO Snapshot with COW Semantics 4. Tell HBase to end quiesce, i.e. it can start writing to HDFS again 5. Backup program now reads from RO snapshot and writes to backup device, while HBase continues to write to the real directory tree 6. When the backup program is done, it deletes the RO snapshot
          Hide
          Sanjay Radia added a comment -

          >Is there any reason to allow cross-namespace hardlinks? Why not just return EXDEV or equivalent?...
          I agree. Unix does not allow hardlinks across volumes.

          Show
          Sanjay Radia added a comment - >Is there any reason to allow cross-namespace hardlinks? Why not just return EXDEV or equivalent?... I agree. Unix does not allow hardlinks across volumes.
          Hide
          Konstantin Shvachko added a comment -

          > to leverage ZooKeeper

          Correct. With ZK you get all the necessary coordination of the distributed updates. Plus you can store ref counts in ZNodes - no need for special inodes.
          In the end HardLinks is not the goal itself, but a tool to do e.g. HBase snapshots.

          Show
          Konstantin Shvachko added a comment - > to leverage ZooKeeper Correct. With ZK you get all the necessary coordination of the distributed updates. Plus you can store ref counts in ZNodes - no need for special inodes. In the end HardLinks is not the goal itself, but a tool to do e.g. HBase snapshots.
          Hide
          Andy Isaacson added a comment -

          you keep the data along with the file (including the current file owner), you could do it all from a library.

          The fundamental problem with "do it in a client library" is, there are always clients who do not or cannot use the library. Then the symlinks don't work for those clients. I think it's pretty clear that "open(2) transparently folows symlinks unless you ask it not to" is a superior developer experience to "open the file, is it a .LNK? then follow the link else return". Even if there's a helper implementing the latter.

          Show
          Andy Isaacson added a comment - you keep the data along with the file (including the current file owner), you could do it all from a library. The fundamental problem with "do it in a client library" is, there are always clients who do not or cannot use the library. Then the symlinks don't work for those clients. I think it's pretty clear that "open(2) transparently folows symlinks unless you ask it not to" is a superior developer experience to "open the file, is it a .LNK? then follow the link else return". Even if there's a helper implementing the latter.
          Hide
          Jesse Yates added a comment -

          Another, simpler way to do hardlinks with cross-server coordination (which in reality needs something like Paxos, or suffer some more unavailability to ensure consistency) would be to leverage ZooKeeper. Yes, -1 for another piece of infrastructure from this, but if does provide all the cross-namespace transactionality we need and make reference counting and security management significantly easier. Not quite client-library easy, but pretty darn close

          Show
          Jesse Yates added a comment - Another, simpler way to do hardlinks with cross-server coordination (which in reality needs something like Paxos, or suffer some more unavailability to ensure consistency) would be to leverage ZooKeeper. Yes, -1 for another piece of infrastructure from this, but if does provide all the cross-namespace transactionality we need and make reference counting and security management significantly easier. Not quite client-library easy, but pretty darn close
          Hide
          Jesse Yates added a comment -

          Sorry for the slow reply, been a bit busy of late...
          @Daryn

          Retaining ref-counted paths after deletion in the origin namespace requires an "inode id". A new api to reference paths based on the id is required. We aren't so soft anymore...

          That's why I'd argue for doing it in file meta-data with periodic rewrites so we can just do appends. We will still need to maintain references if we do hardlinks, so this is just a single method call to do the update - arguably a pretty simple code path that doesn't need to be that highly optimized for multi-writers since we can argue that hardlinks are "rare".

          The inode id needs to be secured since it bypasses all parent dir permissions,

          Yeah, thats a bit of a pain... Maybe a bit more metadate to store with the file...?

          @Konstantin

          Do I understand correctly that your hidden inodes can be regular HDFS files, and that then the whole implementation can be done on top of existing HDFS, as a stand alone library supporting calls

          Yeah, I guess that's a possibility. But you would probably need to have some sort of "namespace managers" to deal with handling hardlinks across different namespaces, which fits comfortably with the distributed namenode design.

          ref-counted links, creating hidden "only accessible to the namenode" inodes, leases on arbitrated NN ownership, retention of deleted files with non-zero ref count, etc. Those aren't client-side operations.

          Since you keep the data along with the file (including the current file owner), you could do it all from a library. However, since the lease needs to be periodically regained, you will see temporary unavailability in the hardlinked files in the managed namespace. If you couple the hardlink management with the namenode managing the space, you can then do forced-resassign of the hardlinks to the back-up namendoe and still see the availability as you would files in that namespace, in terms of creating new hardlinks (reads would still work since all the important data can be replicated across the different namespaces).

          @Andy: I don't know if I've seen a compelling reason that we need to have cross-namespace hardlinks, particularly since they are hard, to say the least.

          Show
          Jesse Yates added a comment - Sorry for the slow reply, been a bit busy of late... @Daryn Retaining ref-counted paths after deletion in the origin namespace requires an "inode id". A new api to reference paths based on the id is required. We aren't so soft anymore... That's why I'd argue for doing it in file meta-data with periodic rewrites so we can just do appends. We will still need to maintain references if we do hardlinks, so this is just a single method call to do the update - arguably a pretty simple code path that doesn't need to be that highly optimized for multi-writers since we can argue that hardlinks are "rare". The inode id needs to be secured since it bypasses all parent dir permissions, Yeah, thats a bit of a pain... Maybe a bit more metadate to store with the file...? @Konstantin Do I understand correctly that your hidden inodes can be regular HDFS files, and that then the whole implementation can be done on top of existing HDFS, as a stand alone library supporting calls Yeah, I guess that's a possibility. But you would probably need to have some sort of "namespace managers" to deal with handling hardlinks across different namespaces, which fits comfortably with the distributed namenode design. ref-counted links, creating hidden "only accessible to the namenode" inodes, leases on arbitrated NN ownership, retention of deleted files with non-zero ref count, etc. Those aren't client-side operations. Since you keep the data along with the file (including the current file owner), you could do it all from a library. However, since the lease needs to be periodically regained, you will see temporary unavailability in the hardlinked files in the managed namespace. If you couple the hardlink management with the namenode managing the space, you can then do forced-resassign of the hardlinks to the back-up namendoe and still see the availability as you would files in that namespace, in terms of creating new hardlinks (reads would still work since all the important data can be replicated across the different namespaces). @Andy: I don't know if I've seen a compelling reason that we need to have cross-namespace hardlinks, particularly since they are hard , to say the least.
          Hide
          Andy Isaacson added a comment -

          The Windows ".lnk" file scheme is a pretty awful disaster, I hope that we don't implement a similar scheme in HDFS. I don't know of an example of a client-side shortcut scheme that worked out well (though I'd be interested to hear of any examples).

          Is there any reason to allow cross-namespace hardlinks? Why not just return EXDEV or equivalent? As an even more restrictive example, AFS only permits hardlinks within a single directory (not even between subdirectories).

          So long as failures are clearly communicated, it seems to me that it's OK to have a pretty restrictive implementation.

          Show
          Andy Isaacson added a comment - The Windows ".lnk" file scheme is a pretty awful disaster, I hope that we don't implement a similar scheme in HDFS. I don't know of an example of a client-side shortcut scheme that worked out well (though I'd be interested to hear of any examples). Is there any reason to allow cross-namespace hardlinks? Why not just return EXDEV or equivalent? As an even more restrictive example, AFS only permits hardlinks within a single directory (not even between subdirectories). So long as failures are clearly communicated, it seems to me that it's OK to have a pretty restrictive implementation.
          Hide
          Daryn Sharp added a comment -

          Jesse describes NNs proxying requests to each other to create and manage the ref-counted links, creating hidden "only accessible to the namenode" inodes, leases on arbitrated NN ownership, retention of deleted files with non-zero ref count, etc. Those aren't client-side operations.

          "Hardlinks" cannot be implemented with a client library. The best you can hope for on the client-side is managed symlinks that are advisory in nature. Clients not using the library will ruin the scheme.

          Show
          Daryn Sharp added a comment - Jesse describes NNs proxying requests to each other to create and manage the ref-counted links, creating hidden "only accessible to the namenode" inodes, leases on arbitrated NN ownership, retention of deleted files with non-zero ref count, etc. Those aren't client-side operations. "Hardlinks" cannot be implemented with a client library. The best you can hope for on the client-side is managed symlinks that are advisory in nature. Clients not using the library will ruin the scheme.
          Hide
          Konstantin Shvachko added a comment -

          Jesse, thanks for the detailed proposal. It totally addresses the complexity of issues related to hard links implementation in distributed environment.
          Do I understand correctly that your hidden inodes can be regular HDFS files, and that then the whole implementation can be done on top of existing HDFS, as a stand alone library supporting calls, like createHardLink(), deleteHardLink(). The applications then will use this methods if they want the functionality.
          Just trying to answer Sanjay's questions using your design as an example.

          Show
          Konstantin Shvachko added a comment - Jesse, thanks for the detailed proposal. It totally addresses the complexity of issues related to hard links implementation in distributed environment. Do I understand correctly that your hidden inodes can be regular HDFS files, and that then the whole implementation can be done on top of existing HDFS, as a stand alone library supporting calls, like createHardLink(), deleteHardLink(). The applications then will use this methods if they want the functionality. Just trying to answer Sanjay's questions using your design as an example.
          Hide
          Daryn Sharp added a comment -

          Nice idea, but I think it gets much more complicated. Retaining ref-counted paths after deletion in the origin namespace requires an "inode id". A new api to reference paths based on the id is required. We aren't so soft anymore...

          The inode id needs to be secured since it bypasses all parent dir permissions, yet the id should be identical for all links in order for copy utils to distinguish identical inodes.

          Now comes the worst part: the client. Will the NNs proxy fs stream operations to each other with a secure api for referencing inode ids? Or will they redirect the client to the origin NN? If they redirect, how to protect against the client guessing ids, or saving them for later replay even when the dir privs prevent access?

          Show
          Daryn Sharp added a comment - Nice idea, but I think it gets much more complicated. Retaining ref-counted paths after deletion in the origin namespace requires an "inode id". A new api to reference paths based on the id is required. We aren't so soft anymore... The inode id needs to be secured since it bypasses all parent dir permissions, yet the id should be identical for all links in order for copy utils to distinguish identical inodes. Now comes the worst part: the client. Will the NNs proxy fs stream operations to each other with a secure api for referencing inode ids? Or will they redirect the client to the origin NN? If they redirect, how to protect against the client guessing ids, or saving them for later replay even when the dir privs prevent access?
          Hide
          Jesse Yates added a comment -

          I'd like to propose an alternative to 'real' hardlinkes: "reference counted soft-Links", or all the hardness you really need in a distributed FS.

          In this implementation of "hard" links, I would propose that wherever the file is created is considered the "owner" of that file. Initially, when created, the file has a reference count of (1) on the local namespace. If you want another hardlink to the file in the same namespace, you then talk to the NN and request another handle to that file, which implicitly updates the references to the file. The reference to that file could be stored in memory (and journaled) or written as part of the file metadata (more on that later, but lets ignore that for the moment).

          Suppose instead that you are in a separate namespace and want a hardlink to the file in the original namespace. Then you would make a request to your NN (NNa) for a hardlink. Since NNa doesn't own the file you want to reference, it makes a hardlink request to NN which originally created the file, the file 'owner' (or NNb). NNb then says 'Cool, I've got your request and increment the ref-count for the file." NNa can then grant your request and give you a link to that file. The failure case here is either
          1) NNb goes down, in which case you can just keep around the reference requests and batch them when NNb comes back up.
          2) NNa goes down mid-request - if NNa doesn't recieve an ACK back for the granted request, it can then disregard that request and re-decrement the count for that hardlink.

          Deleting the hardlink then follows a similar process. You issue a request to the owner NN, either directly from the client if you are deleting a link in the current namespace or through a proxy NN to the original namenode. It then decrements the reference count on the file and allows the deletion of the link. If the reference count ever hits 0, then the NN also deletes the file since there are no valid references to that file.

          This has the implicit implication though that the file will not be visible in the namespace that created it if all the hardlinks to it are removed. This means it essentially becomes a 'hidden' inode. We could, in the future, also work out a mechanism to transfer the hidden inode to a NN that has valid references to it (maybe via a gossip-style protocol), but that would be out of the current scope.

          There are some implications for this model. If the owner NN manages the ref-count in memory, if that NN goes down, its whole namespace then becomes inaccessible, including creating new hardlinks to any of the files (inodes) that it owns. However, the owner NN going down doesn't preclude the other NN from serving the file from their own 'soft' inodes.

          Alternatively, the NN could have a lock on the a hardlinked file, with the ref-counts and ownership info in the file metadata. This might introduce some overhead when creating new hardlinks (you need to reopen and modify the block or write a new block with the new information periodically - this latter actually opens a route to do ref-count management via appends to a file-ref file), but has the added advantage that if the owner NN crashed, an alternative NN could some and claim ownership of that file. This is similar to doing Paxos style leader-election for a given hardlinked file combined with leader-leases. However, this very unlikely to see lots of fluctuation as the leader can just reclaim the leader token via appends to the file-owner file, with periodic rewrites to minimize file size.

          The on-disk representation of the extreme version I'm proposing is then this: the full file then is actually composed of three pieces: (1) the actual data and then two metadata files, "extents" (to add a new word/definition), (2) an external-reference extent: each time a reference is made to the file a new count is appended and it can periodically recompacted to a single value, (3) an owner-extent with the current NN owner and the lease time on the file, dictating who controls overall deletion of the file (since ref counts are done via the external-ref file). This means (2) and (3) are hidden inodes, only accessible to the namenode. We can minimize overhead to these file extents by ensuring a single writer via messaging to the the owner NN (as specified by the owner-file), though this is not strictly necessary.

          Further, (1) could become a hidden inode if all the local namespace references are removed, but it could eventually be transferred over to another NN shard (namespace) to keep overhead at a minimum, though (again), this is not a strict necessity.

          The design retains the NN view of files as directory entries, just entries with a little bit of metadata. The metadata could be in memory or part of the file and periodically modified, but that’s more implementation detail than anything (as mentioned above).

          Show
          Jesse Yates added a comment - I'd like to propose an alternative to 'real' hardlinkes: "reference counted soft-Links", or all the hardness you really need in a distributed FS. In this implementation of "hard" links, I would propose that wherever the file is created is considered the "owner" of that file. Initially, when created, the file has a reference count of (1) on the local namespace. If you want another hardlink to the file in the same namespace, you then talk to the NN and request another handle to that file, which implicitly updates the references to the file. The reference to that file could be stored in memory (and journaled) or written as part of the file metadata (more on that later, but lets ignore that for the moment). Suppose instead that you are in a separate namespace and want a hardlink to the file in the original namespace. Then you would make a request to your NN (NNa) for a hardlink. Since NNa doesn't own the file you want to reference, it makes a hardlink request to NN which originally created the file, the file 'owner' (or NNb). NNb then says 'Cool, I've got your request and increment the ref-count for the file." NNa can then grant your request and give you a link to that file. The failure case here is either 1) NNb goes down, in which case you can just keep around the reference requests and batch them when NNb comes back up. 2) NNa goes down mid-request - if NNa doesn't recieve an ACK back for the granted request, it can then disregard that request and re-decrement the count for that hardlink. Deleting the hardlink then follows a similar process. You issue a request to the owner NN, either directly from the client if you are deleting a link in the current namespace or through a proxy NN to the original namenode. It then decrements the reference count on the file and allows the deletion of the link. If the reference count ever hits 0, then the NN also deletes the file since there are no valid references to that file. This has the implicit implication though that the file will not be visible in the namespace that created it if all the hardlinks to it are removed. This means it essentially becomes a 'hidden' inode. We could, in the future, also work out a mechanism to transfer the hidden inode to a NN that has valid references to it (maybe via a gossip-style protocol), but that would be out of the current scope. There are some implications for this model. If the owner NN manages the ref-count in memory, if that NN goes down, its whole namespace then becomes inaccessible, including creating new hardlinks to any of the files (inodes) that it owns. However, the owner NN going down doesn't preclude the other NN from serving the file from their own 'soft' inodes. Alternatively, the NN could have a lock on the a hardlinked file, with the ref-counts and ownership info in the file metadata. This might introduce some overhead when creating new hardlinks (you need to reopen and modify the block or write a new block with the new information periodically - this latter actually opens a route to do ref-count management via appends to a file-ref file), but has the added advantage that if the owner NN crashed, an alternative NN could some and claim ownership of that file. This is similar to doing Paxos style leader-election for a given hardlinked file combined with leader-leases. However, this very unlikely to see lots of fluctuation as the leader can just reclaim the leader token via appends to the file-owner file, with periodic rewrites to minimize file size. The on-disk representation of the extreme version I'm proposing is then this: the full file then is actually composed of three pieces: (1) the actual data and then two metadata files, "extents" (to add a new word/definition), (2) an external-reference extent: each time a reference is made to the file a new count is appended and it can periodically recompacted to a single value, (3) an owner-extent with the current NN owner and the lease time on the file, dictating who controls overall deletion of the file (since ref counts are done via the external-ref file). This means (2) and (3) are hidden inodes, only accessible to the namenode. We can minimize overhead to these file extents by ensuring a single writer via messaging to the the owner NN (as specified by the owner-file), though this is not strictly necessary. Further, (1) could become a hidden inode if all the local namespace references are removed, but it could eventually be transferred over to another NN shard (namespace) to keep overhead at a minimum, though (again), this is not a strict necessity. The design retains the NN view of files as directory entries, just entries with a little bit of metadata. The metadata could be in memory or part of the file and periodically modified, but that’s more implementation detail than anything (as mentioned above).
          Hide
          Sanjay Radia added a comment -

          Konstantine

          • How can one implement hard links in a library? If you have an alternate library implementation in mind please explain.
          • I am fine to have hard links and renames restricted to volumes; this should then give you freedom to implemented a distributed NN.
          Show
          Sanjay Radia added a comment - Konstantine How can one implement hard links in a library? If you have an alternate library implementation in mind please explain. I am fine to have hard links and renames restricted to volumes; this should then give you freedom to implemented a distributed NN.
          Hide
          Konstantin Shvachko added a comment -

          Sanjay, you are taking a quote out of context. It has been explained what "hard" means above. Please scan through. One more example:
          Well understood why traditional hard links are not allowed across volumes. A distributed namespace is like dynamically changing volumes. You can restrict a link to a single volume, but the names can flow to different volumes later on.

          I am not proposing to remove the existing complexity from the system, I propose not to introduce more of it. In distributed case consistent hard links need PAXOS-like algorithms. They are not "elementary operations", which only should compose the API.
          Hard links can be implemented as a library using ZK, which will stand in distributed case.

          A couple of quotes from Snajay's (mine too) favorite author:

          • When in doubt, leave it out. If there is a fundamental theorem of API design, this is it. You can always add things later, but you can't take them away.
          • APIs must coexist peacefully with the platform, so do what is customary. It is almost always wrong to transliterate an API from one platform to another.
          • Consider the performance consequences of API design decisions ...
          Show
          Konstantin Shvachko added a comment - Sanjay, you are taking a quote out of context. It has been explained what "hard" means above. Please scan through. One more example: Well understood why traditional hard links are not allowed across volumes. A distributed namespace is like dynamically changing volumes. You can restrict a link to a single volume, but the names can flow to different volumes later on. I am not proposing to remove the existing complexity from the system, I propose not to introduce more of it. In distributed case consistent hard links need PAXOS-like algorithms. They are not "elementary operations", which only should compose the API. Hard links can be implemented as a library using ZK, which will stand in distributed case. A couple of quotes from Snajay's (mine too) favorite author: When in doubt, leave it out. If there is a fundamental theorem of API design, this is it. You can always add things later, but you can't take them away. APIs must coexist peacefully with the platform, so do what is customary. It is almost always wrong to transliterate an API from one platform to another. Consider the performance consequences of API design decisions ...
          Hide
          Daryn Sharp added a comment -

          @Sanjay: Good point about simply recording length. It would eschew random-write (not proposing it, only mentioning since it was cited earlier), but a feature like that would require significant other changes so integration with hardlinks could be deferred until if/when that's implemented. If snapshots are implemented using COW-hardlinks, then we should consider duplicating the inode to preserve all metadata, ie. not just length at the time of snapshot.

          Show
          Daryn Sharp added a comment - @Sanjay: Good point about simply recording length. It would eschew random-write (not proposing it, only mentioning since it was cited earlier), but a feature like that would require significant other changes so integration with hardlinks could be deferred until if/when that's implemented. If snapshots are implemented using COW-hardlinks, then we should consider duplicating the inode to preserve all metadata, ie. not just length at the time of snapshot.
          Hide
          Sanjay Radia added a comment -

          We should consider two kinds of hard-links: normal and COW. COW-HardLinks are easy since HDFS only allows append and hence one needs to simply record the length.

          Show
          Sanjay Radia added a comment - We should consider two kinds of hard-links: normal and COW. COW-HardLinks are easy since HDFS only allows append and hence one needs to simply record the length.
          Hide
          Sanjay Radia added a comment -

          > ... hard links .. are very hard to support when the namespace is distributed
          There are many things that are hard in a distributed namenode, For example rename is also hard - i recall discussing the challenges of renames in distributed nn with Konstantine. Do we remove such things from hdfs?

          Show
          Sanjay Radia added a comment - > ... hard links .. are very hard to support when the namespace is distributed There are many things that are hard in a distributed namenode, For example rename is also hard - i recall discussing the challenges of renames in distributed nn with Konstantine. Do we remove such things from hdfs?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          From the discussion, it seems that the notation of immutable files is useful for hardlink and the HBase use case. I think it is not hard to support immutable files in HDFS; see also HDFS-3154.

          Show
          Tsz Wo Nicholas Sze added a comment - From the discussion, it seems that the notation of immutable files is useful for hardlink and the HBase use case. I think it is not hard to support immutable files in HDFS; see also HDFS-3154 .
          Hide
          Jesse Yates added a comment -

          Maybe I'm missing something here...

          Backup itself only becomes safe if HDFS (not HBase) promises to never modify a file once it is closed. Otherwise, a process that accidentally writes into the hard-linked file will corrupt "both" copies

          At least for the HBase case, if we set the file permissions to be 744, you will only have an hbase process that could mess up the file (which it won't do once we close the file) and then an errant process can only slow down other reader processes. That would make it sufficient at least for HBase backups, but clearly not for general HDFS backups.

          Show
          Jesse Yates added a comment - Maybe I'm missing something here... Backup itself only becomes safe if HDFS (not HBase) promises to never modify a file once it is closed. Otherwise, a process that accidentally writes into the hard-linked file will corrupt "both" copies At least for the HBase case, if we set the file permissions to be 744, you will only have an hbase process that could mess up the file (which it won't do once we close the file) and then an errant process can only slow down other reader processes. That would make it sufficient at least for HBase backups, but clearly not for general HDFS backups.
          Hide
          Daryn Sharp added a comment -

          I understand hardlinks likely aren't meant to be. However I'd like to point out:

          • Hardlinks cannot be implemented at a library level. The n-many directory entries must be able to reference the same inode, which unlike symlinks, are not bound by the permissions used to access any other of the paths to the hardlink. Filesystem support is required.
          • Hardlinks shouldn't rule out the possibility of random-write (not suggesting it, it was brought up earlier). There may need to be some changes to the lease manager to apply the lease to the underlying inode instead of path.
          • Hardlinks for backup aren't sufficient except by convention. That's where snapshots using hardlinks+COW blocks is interesting. COW blocks also open the door to zero-write copies.
          Show
          Daryn Sharp added a comment - I understand hardlinks likely aren't meant to be. However I'd like to point out: Hardlinks cannot be implemented at a library level. The n-many directory entries must be able to reference the same inode, which unlike symlinks, are not bound by the permissions used to access any other of the paths to the hardlink. Filesystem support is required. Hardlinks shouldn't rule out the possibility of random-write (not suggesting it, it was brought up earlier). There may need to be some changes to the lease manager to apply the lease to the underlying inode instead of path. Hardlinks for backup aren't sufficient except by convention. That's where snapshots using hardlinks+COW blocks is interesting. COW blocks also open the door to zero-write copies.
          Hide
          Lars Hofhansl added a comment -

          Hardlinks would be used for temporary snapshotting (not to hold the backup itself).

          Anyway... Since there's strong opposition to this, at Salesforce we'll either come up with something else, maintain local HDFS patches, or use a different file system.

          Show
          Lars Hofhansl added a comment - Hardlinks would be used for temporary snapshotting (not to hold the backup itself). Anyway... Since there's strong opposition to this, at Salesforce we'll either come up with something else, maintain local HDFS patches, or use a different file system.
          Hide
          M. C. Srivas added a comment -

          The fact that HBase wants to use hard-links for backup does not make the backup itself safe. Backup itself only becomes safe if HDFS (not HBase) promises to never modify a file once it is closed. Otherwise, a process that accidentally writes into the hard-linked file will corrupt "both" copies. Simple having HBase say "oh, but we never modify this file via HBase" is not strong enough. The backup has to be absolutely immutable.

          So the use-case here requires a commitment from HDFS to never be able to either append or ever write into an existing file. So it means no chance of random-write or NFS support.

          Show
          M. C. Srivas added a comment - The fact that HBase wants to use hard-links for backup does not make the backup itself safe. Backup itself only becomes safe if HDFS (not HBase) promises to never modify a file once it is closed. Otherwise, a process that accidentally writes into the hard-linked file will corrupt "both" copies. Simple having HBase say "oh, but we never modify this file via HBase" is not strong enough. The backup has to be absolutely immutable. So the use-case here requires a commitment from HDFS to never be able to either append or ever write into an existing file. So it means no chance of random-write or NFS support.
          Hide
          Jesse Yates added a comment -

          Hardlinks are of similar nature. They are hard to support if the namespace is distributed.

          FWIW Ceph also punts on distributed hardlinks and just puts them into a single node "because they are not commonly used and not likely to be hot or large" (paraphrasing). Conceptually, you could do it with 2PC across nodes, which should be fine as long as the namespace isn't sharded too highly - +1000s of nodes hosting hardlink information (again, not too many hardlinks).

          From an HBase perspective, hardlink count could become large (~equal number of hfiles), but that isn't going to be near the number of files overall currently in HDFS. Maybe punt on the issue until it becomes a problem, keeping it flexible behind an interface?

          Show
          Jesse Yates added a comment - Hardlinks are of similar nature. They are hard to support if the namespace is distributed. FWIW Ceph also punts on distributed hardlinks and just puts them into a single node "because they are not commonly used and not likely to be hot or large" (paraphrasing). Conceptually, you could do it with 2PC across nodes, which should be fine as long as the namespace isn't sharded too highly - +1000s of nodes hosting hardlink information (again, not too many hardlinks). From an HBase perspective, hardlink count could become large (~equal number of hfiles), but that isn't going to be near the number of files overall currently in HDFS. Maybe punt on the issue until it becomes a problem, keeping it flexible behind an interface?
          Hide
          Konstantin Shvachko added a comment -

          > the key question: What services should a file system provide?

          Exactly so. I would clarify it as: What functions should be a part of the file system API and what should be a library function.

          > The same argument could be made for symbolic links. The application could implement those (in fact it's quite simple).

          "Simple" is the key point here. Simple functions should be fs APIs. Hard functions should go into libraries.

          Darin, you are right there is a lot of overlap, and yes hardlinks simplify building snapshots, but you are just pushing the complexity on HDFS layer. This does not change the difficulty of the problem.

          We relaxed posix semantics in many aspects in HDFS for simplicity and performance. Imagine how much easier life would be with random writes or multiple writers. You are not asking for it, right?

          Hardlinks are of similar nature. They are hard to support if the namespace is distributed. They should not be HDFS API, but they could be a library function.

          Show
          Konstantin Shvachko added a comment - > the key question: What services should a file system provide? Exactly so. I would clarify it as: What functions should be a part of the file system API and what should be a library function. > The same argument could be made for symbolic links. The application could implement those (in fact it's quite simple). "Simple" is the key point here. Simple functions should be fs APIs. Hard functions should go into libraries. Darin, you are right there is a lot of overlap, and yes hardlinks simplify building snapshots, but you are just pushing the complexity on HDFS layer. This does not change the difficulty of the problem. We relaxed posix semantics in many aspects in HDFS for simplicity and performance. Imagine how much easier life would be with random writes or multiple writers. You are not asking for it, right? Hardlinks are of similar nature. They are hard to support if the namespace is distributed. They should not be HDFS API, but they could be a library function.
          Hide
          Daryn Sharp added a comment -

          I see a lot of overlap between hard links and snapshots. Conceptually, a snapshot is composed of hardlinks with COW semantics for file's metadata and last partial block. Hardlinks would also be a very easy way to implement zero-write copies. Streaming the bytes down and back up via the client isn't very efficient.

          Show
          Daryn Sharp added a comment - I see a lot of overlap between hard links and snapshots. Conceptually, a snapshot is composed of hardlinks with COW semantics for file's metadata and last partial block. Hardlinks would also be a very easy way to implement zero-write copies. Streaming the bytes down and back up via the client isn't very efficient.
          Hide
          Lars Hofhansl added a comment -

          This is a good discussion.

          Couple of points:

          Or provide use cases which cannot be solved without it.

          This seems to be the key question: What services should a file system provide?
          The same argument could be made for symbolic links. The application could implement those (in fact it's quite simple).

          but they are very hard to support when the namespace is distributed

          But isn't that an implementation detail, which should not inform the feature set?
          Hardlinks could be only supported per distinct namespace (namespace in federated HDFS or a volume in MapR - I think). This is not unlike Unix where hardlinks are per distinct filesystem (i.e. not across mount points).

          @M.C. Srivas:
          If you create 15 backups without hardlinks you get 15 times the metadata and 15 times the data... Unless you assume some other feature such as snapshots with copy-on-write or backup-on-write semantics. (Maybe I did not get the argument)

          Immutable files are very a common and useful design pattern (not just for HBase) and while not strictly needed, hardlinks are very useful together with immutable files.

          Just my $0.02.

          Show
          Lars Hofhansl added a comment - This is a good discussion. Couple of points: Or provide use cases which cannot be solved without it. This seems to be the key question: What services should a file system provide? The same argument could be made for symbolic links. The application could implement those (in fact it's quite simple). but they are very hard to support when the namespace is distributed But isn't that an implementation detail, which should not inform the feature set? Hardlinks could be only supported per distinct namespace (namespace in federated HDFS or a volume in MapR - I think). This is not unlike Unix where hardlinks are per distinct filesystem (i.e. not across mount points). @M.C. Srivas: If you create 15 backups without hardlinks you get 15 times the metadata and 15 times the data... Unless you assume some other feature such as snapshots with copy-on-write or backup-on-write semantics. (Maybe I did not get the argument) Immutable files are very a common and useful design pattern (not just for HBase) and while not strictly needed, hardlinks are very useful together with immutable files. Just my $0.02.
          Hide
          M. C. Srivas added a comment -

          @Karthik: using hard-links for backup accomplishes exactly the opposite. The expectation with a correctly-implemented hardlink is that when the original is modified, the change is reflected in the file, no matter which path-name was used to access it. Isn't that exactly the opposite effect of what a backup/snapshot is supposed to do? Unless of course you are committing to never ever being able to modify a file once written (although that would be viewed by most as a major step backwards in the evolution of Hadoop).

          Another major problem is the scalability of the NN gets reduced by a factor of 10. (ie, your cluster can now hold only 10 million files instead of the 100 million which it used to be able to hold). Imagine someone doing a backup every 6 hours. Let's say the backups are to be retained as follows: 4 for the past 24 hrs, 1 daily for a week, and 1 per week for 1 month. Total: 4 + 7 + 4 = 15 backups, ie, 15 hard-links to the files, one from each backup. So each file is pointed to by 15 names, or, in another words, the NN now holds 15 names instead of 1 for each file. I think that would reduce the number of files held by the cluster practically speaking by a factor of 10, no?

          Thirdly, hard-links don't work with directories. What is the scheme to back up directories? (If this scheme only usable for HBase backups and nothing else, then I agree with Konstantin that it belongs in the HBase layer and not here)

          Show
          M. C. Srivas added a comment - @Karthik: using hard-links for backup accomplishes exactly the opposite. The expectation with a correctly-implemented hardlink is that when the original is modified, the change is reflected in the file, no matter which path-name was used to access it. Isn't that exactly the opposite effect of what a backup/snapshot is supposed to do? Unless of course you are committing to never ever being able to modify a file once written (although that would be viewed by most as a major step backwards in the evolution of Hadoop). Another major problem is the scalability of the NN gets reduced by a factor of 10. (ie, your cluster can now hold only 10 million files instead of the 100 million which it used to be able to hold). Imagine someone doing a backup every 6 hours. Let's say the backups are to be retained as follows: 4 for the past 24 hrs, 1 daily for a week, and 1 per week for 1 month. Total: 4 + 7 + 4 = 15 backups, ie, 15 hard-links to the files, one from each backup. So each file is pointed to by 15 names, or, in another words, the NN now holds 15 names instead of 1 for each file. I think that would reduce the number of files held by the cluster practically speaking by a factor of 10, no? Thirdly, hard-links don't work with directories. What is the scheme to back up directories? (If this scheme only usable for HBase backups and nothing else, then I agree with Konstantin that it belongs in the HBase layer and not here)
          Hide
          Karthik Ranganathan added a comment -

          @Konstantin: << This can be modeled by symlinks on the application (HBase) level without making any changes in HDFS. >>
          Modeling this on top of HBase would essentially mean implementing the hardlink feature at the HBase level for all its files. This means that every application that needs a similar feature needs to use symbolic links to implement hardlinks. We have already implemented this at the underlying filesystem level for HBase backups - except that on disk/node failure, the re-replication would increase the total size of data in the cluster which was getting hard to provision. Hence the natural progression towards putting it in HDFS.

          Show
          Karthik Ranganathan added a comment - @Konstantin: << This can be modeled by symlinks on the application (HBase) level without making any changes in HDFS. >> Modeling this on top of HBase would essentially mean implementing the hardlink feature at the HBase level for all its files. This means that every application that needs a similar feature needs to use symbolic links to implement hardlinks. We have already implemented this at the underlying filesystem level for HBase backups - except that on disk/node failure, the re-replication would increase the total size of data in the cluster which was getting hard to provision. Hence the natural progression towards putting it in HDFS.
          Hide
          Konstantin Shvachko added a comment -

          > I would recommend finding a different approach to implementing snapshots than adding this feature.

          I agree with Srivas, hard links seem easy in single-NameNode architecture, but they are very hard to support when the namespace is distributed, because if links to a file belong to different nodes you cannot just lock the entire namespace and do atomic cross-node linking / unlinking.
          I also agree with Srivas that hard links in traditional file systems cause more problems than add value.
          Looking at the design document I see that you create sort of internal symlinks called INodeHardLinkFile pointing to HardLinkFileInfo, representing the actual file. This can be modeled by symlinks on the application (HBase) level without making any changes in HDFS.

          I strongly discourage bringing this feature inside HDFS.
          Or provide use cases which cannot be solved without it.

          Show
          Konstantin Shvachko added a comment - > I would recommend finding a different approach to implementing snapshots than adding this feature. I agree with Srivas, hard links seem easy in single-NameNode architecture, but they are very hard to support when the namespace is distributed, because if links to a file belong to different nodes you cannot just lock the entire namespace and do atomic cross-node linking / unlinking. I also agree with Srivas that hard links in traditional file systems cause more problems than add value. Looking at the design document I see that you create sort of internal symlinks called INodeHardLinkFile pointing to HardLinkFileInfo, representing the actual file. This can be modeled by symlinks on the application (HBase) level without making any changes in HDFS. I strongly discourage bringing this feature inside HDFS. Or provide use cases which cannot be solved without it.
          Hide
          Konstantin Shvachko added a comment -

          Sorry some combination of buttons lead to reassigning. Assigning back.

          Show
          Konstantin Shvachko added a comment - Sorry some combination of buttons lead to reassigning. Assigning back.
          Hide
          Andy Isaacson added a comment -

          When users run "cp" in the linux file system against hard linked files, it will copy the bytes, right?

          cp -a preserves hard links; cp -r breaks them (duplicates the bytes).

          I think we shall keep the same semantics here as well.

          I don't think it's a good idea to pretend that we can or should preserve every corner case of the semantics of POSIX hard links. The Unix hard link was originally a historical accident of the inode/dentry structure of the filesystem, preserved because it's useful and has been heavily relied upon by users of the Unix api. The implementation in something like ZFS or btrfs is pretty far away from the original simplicity.

          Since we don't have API compatibility with Unix and our underlying structure is deeply different, it's a good idea to borrow the good ideas but take a practical eye to where it makes sense to diverge.

          Show
          Andy Isaacson added a comment - When users run "cp" in the linux file system against hard linked files, it will copy the bytes, right? cp -a preserves hard links; cp -r breaks them (duplicates the bytes). I think we shall keep the same semantics here as well. I don't think it's a good idea to pretend that we can or should preserve every corner case of the semantics of POSIX hard links. The Unix hard link was originally a historical accident of the inode/dentry structure of the filesystem, preserved because it's useful and has been heavily relied upon by users of the Unix api. The implementation in something like ZFS or btrfs is pretty far away from the original simplicity. Since we don't have API compatibility with Unix and our underlying structure is deeply different, it's a good idea to borrow the good ideas but take a practical eye to where it makes sense to diverge.
          Hide
          Liyin Tang added a comment -

          Good point, Lars.
          When users run "cp" in the linux file system against hard linked files, it will copy the bytes, right?
          I think we shall keep the same semantics here as well.

          In terms of optimization, the upper level application shall have the knowledge when to use hardlink in the remote/destination DFS instead of coping the bytes between two clusters.

          Show
          Liyin Tang added a comment - Good point, Lars. When users run "cp" in the linux file system against hard linked files, it will copy the bytes, right? I think we shall keep the same semantics here as well. In terms of optimization, the upper level application shall have the knowledge when to use hardlink in the remote/destination DFS instead of coping the bytes between two clusters.
          Hide
          Lars Hofhansl added a comment -

          Thanks Liyin. Sounds good.

          One thought that occurred to me since: We need to think about copy semantics. For example how will distcp handle this? It shouldn't create a new copy of a file for each hardlink that points to it, but rather just copy it at most once and create hardlinks for each following reference. But then what about multiple distcp commands that happen to cover hardlinks to the same file? I suppose in the case we cannot be expected to avoid multiple copies of the same file (but at most one copy for each invocation of distcp, and only if the distcp happens to cover a different hardlink).

          Show
          Lars Hofhansl added a comment - Thanks Liyin. Sounds good. One thought that occurred to me since: We need to think about copy semantics. For example how will distcp handle this? It shouldn't create a new copy of a file for each hardlink that points to it, but rather just copy it at most once and create hardlinks for each following reference. But then what about multiple distcp commands that happen to cover hardlinks to the same file? I suppose in the case we cannot be expected to avoid multiple copies of the same file (but at most one copy for each invocation of distcp, and only if the distcp happens to cover a different hardlink).
          Hide
          Liyin Tang added a comment -

          I planned to break this feature into several parts:
          1) Implement the new FileSystem API: hardLink based on the INodeHardLinkFile and HardLinkFileInfo class. Also handle the deletion properly.
          2) Handle the DU operation and quote update properly.
          3) Update the FSImage format and FSEditLog.

          I have finished the part 1 but still work on part 2.

          Show
          Liyin Tang added a comment - I planned to break this feature into several parts: 1) Implement the new FileSystem API: hardLink based on the INodeHardLinkFile and HardLinkFileInfo class. Also handle the deletion properly. 2) Handle the DU operation and quote update properly. 3) Update the FSImage format and FSEditLog. I have finished the part 1 but still work on part 2.
          Hide
          Lars Hofhansl added a comment -

          Do you have a preliminary patch to look at?

          Show
          Lars Hofhansl added a comment - Do you have a preliminary patch to look at?
          Hide
          Liyin Tang added a comment -

          Hi Lars, we are still working on this feature. It may take a while to take care of all the cases, especially the quote updates and fsimage format change.

          Show
          Liyin Tang added a comment - Hi Lars, we are still working on this feature. It may take a while to take care of all the cases, especially the quote updates and fsimage format change.
          Hide
          Lars Hofhansl added a comment -

          Is anybody working on a patch for this?
          If not, I would not mind picking this up (although I can't promise getting to this before the end of the month).

          Show
          Lars Hofhansl added a comment - Is anybody working on a patch for this? If not, I would not mind picking this up (although I can't promise getting to this before the end of the month).
          Hide
          Lars Hofhansl added a comment -

          Reading through the Design Doc it seems that FileSystem.

          {setPermission|setOwner}

          would be awkward. We'd have to find each INodeHardLinkFile pointing to the same "file" and then changing all their permissions/owners.

          HardLinkFileInfo could also maintain permissions and owners (since they - following posix - are the same for each hard link). That way changing owner or permissions would immediately affect all hard links.
          When the fsimage is saved each INodeHardLinkFile would still write its own permission and owner (for simplicity, but that could be optimized, as long as at least one INode writes the permissions/owner).
          Upon read INode representing a hardlink must have the same permission/owner as all other INodes linking to the same "file". If not the image is inconsistent.

          In that case HardLinkFileInfo would not need to maintain a list of pointers back to all INodeHardLinkFiles, and owner/permissions would only be stored once in memory.

          Show
          Lars Hofhansl added a comment - Reading through the Design Doc it seems that FileSystem. {setPermission|setOwner} would be awkward. We'd have to find each INodeHardLinkFile pointing to the same "file" and then changing all their permissions/owners. HardLinkFileInfo could also maintain permissions and owners (since they - following posix - are the same for each hard link). That way changing owner or permissions would immediately affect all hard links. When the fsimage is saved each INodeHardLinkFile would still write its own permission and owner (for simplicity, but that could be optimized, as long as at least one INode writes the permissions/owner). Upon read INode representing a hardlink must have the same permission/owner as all other INodes linking to the same "file". If not the image is inconsistent. In that case HardLinkFileInfo would not need to maintain a list of pointers back to all INodeHardLinkFiles, and owner/permissions would only be stored once in memory.
          Hide
          Lars Hofhansl added a comment -

          @M.C. Srivas: Isn't that the same for any file?
          A rename of a file renames one of its references. I don't understand how the fact that the file has more references has any impact on that.

          Hardlinks are incredibly useful for applications like HBase, where an immutable HFile could just be mapped to another directory (not just for backup purposes).

          Show
          Lars Hofhansl added a comment - @M.C. Srivas: Isn't that the same for any file? A rename of a file renames one of its references. I don't understand how the fact that the file has more references has any impact on that. Hardlinks are incredibly useful for applications like HBase, where an immutable HFile could just be mapped to another directory (not just for backup purposes).
          Hide
          M. C. Srivas added a comment -

          Sanjay, POSIX says that a user cannot open a file unless they have permissions to traverse the entire path from / to the file. The problem is that if a file has two paths (as in a hard-link), perms becomes very hard to enforce since a file does not know which dir is its parent. Imagine a rename of a file with many hard-links across to a new dir. This problem is harder in a distr file system if you wish to spread the meta-data. Note that the enforcement happens automatically with symbolic links. As you point out, with MapR we could implement hard-links within a volume, but chose not to and instead implemented only symlinks. (I personally find symlinks to be more flexible).

          Show
          M. C. Srivas added a comment - Sanjay, POSIX says that a user cannot open a file unless they have permissions to traverse the entire path from / to the file. The problem is that if a file has two paths (as in a hard-link), perms becomes very hard to enforce since a file does not know which dir is its parent. Imagine a rename of a file with many hard-links across to a new dir. This problem is harder in a distr file system if you wish to spread the meta-data. Note that the enforcement happens automatically with symbolic links. As you point out, with MapR we could implement hard-links within a volume, but chose not to and instead implemented only symlinks. (I personally find symlinks to be more flexible).
          Hide
          Sanjay Radia added a comment -
          • I see the additional complexity in quotas because HDFS quotas are directory based (like several file systems). I think this is addressable if we double count the quotas along both path.
          • Permissions are not a problem since the file retains the file permissions and both paths to the file offer their own permissions.
          • I don't understand Srivas's rename example.
            Srivas, in MapR I suspect that renames are only allowed within a volume and such hard links would be supported only within a volume. Can you explain the problem with some more details.
          Show
          Sanjay Radia added a comment - I see the additional complexity in quotas because HDFS quotas are directory based (like several file systems). I think this is addressable if we double count the quotas along both path. Permissions are not a problem since the file retains the file permissions and both paths to the file offer their own permissions. I don't understand Srivas's rename example. Srivas, in MapR I suspect that renames are only allowed within a volume and such hard links would be supported only within a volume. Can you explain the problem with some more details.
          Hide
          Hari Mankude added a comment -

          Can the hard linked files be reopened for append?

          Show
          Hari Mankude added a comment - Can the hard linked files be reopened for append?
          Hide
          Liyin Tang added a comment -

          It only allows hard links to the closed files.

          Show
          Liyin Tang added a comment - It only allows hard links to the closed files.
          Hide
          Sanjay Radia added a comment -

          Is the proposal to allow hard links to only files or to files and directories?

          Show
          Sanjay Radia added a comment - Is the proposal to allow hard links to only files or to files and directories?
          Hide
          Liyin Tang added a comment -

          @M.C.Srivas, I am afraid that I didn't quite understand your concerns.

          1. Changing the permissions along the path "/path1/dirA" to make "file" inaccessible works, but now with hard-links "/path2/dirB" is wide open.
          2. Rename "/path2/dirB" to "/path3/dirC" will require taking locks on "/path1/dirA" ... but the "file" does not have "parent ptrs" to figure out which path(s) to lock.

          1) Hardlinked files are suposed to have the same permission as the source file.
          2) Each INodeFile do have a parent pointer to its parent in HDFS and also which lock are you talking about exactly (in the implementation's perspective) ?

          @Daryn Sharp:

          I did a little testing and didn't realize that a hard link retains the same attrs (owner, group, perms, xattrs, etc) as the original file. Changing one implicit changes the others, so that negates some issues such as differing replication factor concerns. Perhaps hard link creation can be restricted to only the file owner and superuser.

          Totally agreed

          The quota concerns are still a bit more complex. Unixy systems like linux and bsd only have fs level quotas for users, so quota handling is trivial compared to directory level quotas in hdfs. Since all hard links implicitly have the same owner, quotas are as simple as incrementing the user's ds quota is at file creation, and decrement when all links are removed. This why a DOS is possible against a user.

          From the security perspective, users should be responsible to set the correct permission to protect themselves. In this cases, users should ONLY grant the EXECUTE permission to the trusted users for hardlinkling.

          I'm sorry if I'm missing a detail, but I remain unclear on how you are proposing to handle the directory level quotas. I don't fully grok how finding a common ancestor with a quota is sufficient because quotas can be added or removed at any time. Maybe part of the issue too is I have nested directories with individual quotas in mind, whereas maybe you are assuming one and only one quota from the root?

          Would you mind giving me an exact example for your concerns about quotas? I would be very happy to explain it in details

          Show
          Liyin Tang added a comment - @M.C.Srivas, I am afraid that I didn't quite understand your concerns. 1. Changing the permissions along the path "/path1/dirA" to make "file" inaccessible works, but now with hard-links "/path2/dirB" is wide open. 2. Rename "/path2/dirB" to "/path3/dirC" will require taking locks on "/path1/dirA" ... but the "file" does not have "parent ptrs" to figure out which path(s) to lock. 1) Hardlinked files are suposed to have the same permission as the source file. 2) Each INodeFile do have a parent pointer to its parent in HDFS and also which lock are you talking about exactly (in the implementation's perspective) ? @Daryn Sharp: I did a little testing and didn't realize that a hard link retains the same attrs (owner, group, perms, xattrs, etc) as the original file. Changing one implicit changes the others, so that negates some issues such as differing replication factor concerns. Perhaps hard link creation can be restricted to only the file owner and superuser. Totally agreed The quota concerns are still a bit more complex. Unixy systems like linux and bsd only have fs level quotas for users, so quota handling is trivial compared to directory level quotas in hdfs. Since all hard links implicitly have the same owner, quotas are as simple as incrementing the user's ds quota is at file creation, and decrement when all links are removed. This why a DOS is possible against a user. From the security perspective, users should be responsible to set the correct permission to protect themselves. In this cases, users should ONLY grant the EXECUTE permission to the trusted users for hardlinkling. I'm sorry if I'm missing a detail, but I remain unclear on how you are proposing to handle the directory level quotas. I don't fully grok how finding a common ancestor with a quota is sufficient because quotas can be added or removed at any time. Maybe part of the issue too is I have nested directories with individual quotas in mind, whereas maybe you are assuming one and only one quota from the root? Would you mind giving me an exact example for your concerns about quotas? I would be very happy to explain it in details
          Hide
          Daryn Sharp added a comment -

          I fully agree that posix and/or linux conventions should ideally be followed.

          I did a little testing and didn't realize that a hard link retains the same attrs (owner, group, perms, xattrs, etc) as the original file. Changing one implicit changes the others, so that negates some issues such as differing replication factor concerns. Perhaps hard link creation can be restricted to only the file owner and superuser.

          The quota concerns are still a bit more complex. Unixy systems like linux and bsd only have fs level quotas for users, so quota handling is trivial compared to directory level quotas in hdfs. Since all hard links implicitly have the same owner, quotas are as simple as incrementing the user's ds quota is at file creation, and decrement when all links are removed. This why a DOS is possible against a user.

          I'm sorry if I'm missing a detail, but I remain unclear on how you are proposing to handle the directory level quotas. I don't fully grok how finding a common ancestor with a quota is sufficient because quotas can be added or removed at any time. Maybe part of the issue too is I have nested directories with individual quotas in mind, whereas maybe you are assuming one and only one quota from the root?

          I look forward to your thoughts.

          Show
          Daryn Sharp added a comment - I fully agree that posix and/or linux conventions should ideally be followed. I did a little testing and didn't realize that a hard link retains the same attrs (owner, group, perms, xattrs, etc) as the original file. Changing one implicit changes the others, so that negates some issues such as differing replication factor concerns. Perhaps hard link creation can be restricted to only the file owner and superuser. The quota concerns are still a bit more complex. Unixy systems like linux and bsd only have fs level quotas for users, so quota handling is trivial compared to directory level quotas in hdfs. Since all hard links implicitly have the same owner, quotas are as simple as incrementing the user's ds quota is at file creation, and decrement when all links are removed. This why a DOS is possible against a user. I'm sorry if I'm missing a detail, but I remain unclear on how you are proposing to handle the directory level quotas. I don't fully grok how finding a common ancestor with a quota is sufficient because quotas can be added or removed at any time. Maybe part of the issue too is I have nested directories with individual quotas in mind, whereas maybe you are assuming one and only one quota from the root? I look forward to your thoughts.
          Hide
          M. C. Srivas added a comment -

          Creating hard-links in a distributed file-system will cause all kinds of future problems with scalability. Hard-links are rarely used in the real-world, because of all the associated bizzare problems. Eg, consider a hardlink setup as follows:

          link1: /path1/dirA/file
          link2: /path2/dirB/file

          1. Changing the permissions along the path "/path1/dirA" to make "file" inaccessible works, but now with hard-links "/path2/dirB" is wide open.

          2. Rename "/path2/dirB" to "/path3/dirC" will require taking locks on "/path1/dirA" ... but the "file" does not have "parent ptrs" to figure out which path(s) to lock.

          I would recommend finding a different approach to implementing snapshots than adding this feature.

          Show
          M. C. Srivas added a comment - Creating hard-links in a distributed file-system will cause all kinds of future problems with scalability. Hard-links are rarely used in the real-world, because of all the associated bizzare problems. Eg, consider a hardlink setup as follows: link1: /path1/dirA/file link2: /path2/dirB/file 1. Changing the permissions along the path "/path1/dirA" to make "file" inaccessible works, but now with hard-links "/path2/dirB" is wide open. 2. Rename "/path2/dirB" to "/path3/dirC" will require taking locks on "/path1/dirA" ... but the "file" does not have "parent ptrs" to figure out which path(s) to lock. I would recommend finding a different approach to implementing snapshots than adding this feature.
          Hide
          Liyin Tang added a comment -

          Another consideration is ds quota is based on a multiple of replication factor, so who is allowed to change the replication factor since increasing it may impact a different user's quota?

          Generally, when user creates a hardlink in Linux, it requires the EXECUTE permission for the source directory and WRITE_EXECUTE permission for the destination directory. And it is a well-known issue that hard links on Linux could create local DoS vulnerability and security problems, especially when malicious user keeps creating hard links to other users files and let others run out of quota. One of solutions to prevent this problem is to set the permission of the dir correctly.

          HDFS hardlink should follow the same permission requirements as genreal Linux FS and only allow the trusted users or groups have right permission to create hardlinks. The same security principle shall apply for setReplication operation, which can be treated as a normal write operation in general Linux FS.

          Thanks Daryn Sharp so much for the above discussion.
          It really helps us to re-visit several design issues and improve the solutions. I will update the design doc later.

          Show
          Liyin Tang added a comment - Another consideration is ds quota is based on a multiple of replication factor, so who is allowed to change the replication factor since increasing it may impact a different user's quota? Generally, when user creates a hardlink in Linux, it requires the EXECUTE permission for the source directory and WRITE_EXECUTE permission for the destination directory. And it is a well-known issue that hard links on Linux could create local DoS vulnerability and security problems, especially when malicious user keeps creating hard links to other users files and let others run out of quota. One of solutions to prevent this problem is to set the permission of the dir correctly. HDFS hardlink should follow the same permission requirements as genreal Linux FS and only allow the trusted users or groups have right permission to create hardlinks. The same security principle shall apply for setReplication operation, which can be treated as a normal write operation in general Linux FS. Thanks Daryn Sharp so much for the above discussion. It really helps us to re-visit several design issues and improve the solutions. I will update the design doc later.
          Hide
          Liyin Tang added a comment -

          bg. I agree that ds quota doesn't need to be changed when there are links in the same directory. I'm referring to the case of hardlinks across directories. Ie. /dir/dir2/file and /dir/dir3/hardlink. If dir2 and dir3 have separate ds quotas, then dir3 has to absorb the ds quota when the original file is removed from dir2. What if there is a /dir/dir4/hardlink2? Does dir3 or dir4 absorb the ds quota? What if neither has the necessary quota available?

          Based on the same example you commented, when linking /dir/dir2/file and /dir/dir3/hardlink, it will increase the dsquota for dir3 but not /dir. Because dir3 is NOT a common ancestor but dir is. And if dir3 doesn't have enough dsquota, then it shall throw quota exceptions. Also if there is a /dir/dir4/hardlink2, it absorbs the dsquota for dir4 as well. So the point is that it only absorbs the dsquota during the link creation time and decreases the dsquota during the link deletion time.

          From my understanding, the basic semantics for HardLink is to allow user create multiple logic files referencing to the same set of blocks/bytes on disks. So user could set different file level attributes for each linked file such as owner, permission, modification time.
          Since these linked files share the same set of blocks, the block level setting shall be shared.
          It may be a little confused to distinguish the replication factor in HDFS between file-level attributes and block-level attributes.
          If we agree that replication factor is a block-level attribute, then we shall pay the overhead (wait time) when increasing replication factor, just as increasing the replication factor against a regular file, and the setReplication operation is supposed to fail if it breaks the dsquota.

          Show
          Liyin Tang added a comment - bg. I agree that ds quota doesn't need to be changed when there are links in the same directory. I'm referring to the case of hardlinks across directories. Ie. /dir/dir2/file and /dir/dir3/hardlink. If dir2 and dir3 have separate ds quotas, then dir3 has to absorb the ds quota when the original file is removed from dir2. What if there is a /dir/dir4/hardlink2? Does dir3 or dir4 absorb the ds quota? What if neither has the necessary quota available? Based on the same example you commented, when linking /dir/dir2/file and /dir/dir3/hardlink, it will increase the dsquota for dir3 but not /dir. Because dir3 is NOT a common ancestor but dir is. And if dir3 doesn't have enough dsquota, then it shall throw quota exceptions. Also if there is a /dir/dir4/hardlink2, it absorbs the dsquota for dir4 as well. So the point is that it only absorbs the dsquota during the link creation time and decreases the dsquota during the link deletion time. From my understanding, the basic semantics for HardLink is to allow user create multiple logic files referencing to the same set of blocks/bytes on disks. So user could set different file level attributes for each linked file such as owner, permission, modification time. Since these linked files share the same set of blocks, the block level setting shall be shared. It may be a little confused to distinguish the replication factor in HDFS between file-level attributes and block-level attributes. If we agree that replication factor is a block-level attribute, then we shall pay the overhead (wait time) when increasing replication factor, just as increasing the replication factor against a regular file, and the setReplication operation is supposed to fail if it breaks the dsquota.
          Hide
          Daryn Sharp added a comment -

          I'm glad you find my questions helpful!

          For example, "ln /root/dir1/file1 /root/dir1/file2" : there is no need to increase the ds quota usage when creating the link file: file2. Also "rm /root/dir1/file1" : there is no need to decrease the ds quota usage when removing the original source file: file1.

          I agree that ds quota doesn't need to be changed when there are links in the same directory. I'm referring to the case of hardlinks across directories. Ie. /dir/dir2/file and /dir/dir3/hardlink. If dir2 and dir3 have separate ds quotas, then dir3 has to absorb the ds quota when the original file is removed from dir2. What if there is a /dir/dir4/hardlink2? Does dir3 or dir4 absorb the ds quota? What if neither has the necessary quota available?

          Currently, at least for V1, we shall support the hardlinking only for the closed files and won't support to append operation against linked files, but it could be extended in the future.

          A reasonable approach, but it may lead to user confusion. It almost begs for a immutable flag (ie. chattr +i/-i) to prevent inadvertent hard linking to files intended to be mutable.

          Nonetheless, I'd suggest exploring the difficulties reconciling the current design of the namesystem/block management with your design. It may help avoid boxing ourselves into a corner with limited hard link support.

          From my understanding, the setReplication is just a memory footprint update and the name node will increase actual replication in the background.

          Yes, but the FsShell setrep command actively monitors the files and does not exit until the replication factor is what the user requested – as determined by the number of hosts per block. Another consideration is ds quota is based on a multiple of replication factor, so who is allowed to change the replication factor since increasing it may impact a different user's quota?

          Show
          Daryn Sharp added a comment - I'm glad you find my questions helpful! For example, "ln /root/dir1/file1 /root/dir1/file2" : there is no need to increase the ds quota usage when creating the link file: file2. Also "rm /root/dir1/file1" : there is no need to decrease the ds quota usage when removing the original source file: file1. I agree that ds quota doesn't need to be changed when there are links in the same directory. I'm referring to the case of hardlinks across directories. Ie. /dir/dir2/file and /dir/dir3/hardlink. If dir2 and dir3 have separate ds quotas, then dir3 has to absorb the ds quota when the original file is removed from dir2. What if there is a /dir/dir4/hardlink2? Does dir3 or dir4 absorb the ds quota? What if neither has the necessary quota available? Currently, at least for V1, we shall support the hardlinking only for the closed files and won't support to append operation against linked files, but it could be extended in the future. A reasonable approach, but it may lead to user confusion. It almost begs for a immutable flag (ie. chattr +i/-i) to prevent inadvertent hard linking to files intended to be mutable. Nonetheless, I'd suggest exploring the difficulties reconciling the current design of the namesystem/block management with your design. It may help avoid boxing ourselves into a corner with limited hard link support. From my understanding, the setReplication is just a memory footprint update and the name node will increase actual replication in the background. Yes, but the FsShell setrep command actively monitors the files and does not exit until the replication factor is what the user requested – as determined by the number of hosts per block. Another consideration is ds quota is based on a multiple of replication factor, so who is allowed to change the replication factor since increasing it may impact a different user's quota?
          Hide
          Liyin Tang added a comment -

          @Daryn Sharp: very good comments
          1) Quota is the trickest for the hard link.

          For nsquota usage, it will be added up when creating hardlinks and be decreased when removing hardlinks.

          For dsquota usage, it will only increase and decrease the quota usage for the directories, which are not any common ancestor directories with any linked files.
          For example, "ln /root/dir1/file1 /root/dir1/file2" : there is no need to increase the ds quota usage when creating the link file: file2.
          Also "rm /root/dir1/file1" : there is no need to decrease the ds quota usage when removing the original source file: file1.

          The bottom line is there is no such case that we need to increase any dsquota during the file removal operation. Because if the directory is a common ancestor directory, no dsquota needs to be updated, otherwise the dsquota has already been updated during the hard link created time.

          2) You are right that each blockInfo of the linked files needs to be updated when the original file is deleted. I shall update the design doc to explicitly explain this part in details.

          3) Currently, at least for V1, we shall support the hardlinking only for the closed files and won't support to append operation against linked files, but it could be extended in the future.

          4) Very good point that hardlinked files shall respect the max replication factors. From my understanding, the setReplication is just a memory footprint update and the name node will increase actual replication in the background.

          Show
          Liyin Tang added a comment - @Daryn Sharp: very good comments 1) Quota is the trickest for the hard link. For nsquota usage, it will be added up when creating hardlinks and be decreased when removing hardlinks. For dsquota usage, it will only increase and decrease the quota usage for the directories, which are not any common ancestor directories with any linked files. For example, "ln /root/dir1/file1 /root/dir1/file2" : there is no need to increase the ds quota usage when creating the link file: file2. Also "rm /root/dir1/file1" : there is no need to decrease the ds quota usage when removing the original source file: file1. The bottom line is there is no such case that we need to increase any dsquota during the file removal operation. Because if the directory is a common ancestor directory, no dsquota needs to be updated, otherwise the dsquota has already been updated during the hard link created time. 2) You are right that each blockInfo of the linked files needs to be updated when the original file is deleted. I shall update the design doc to explicitly explain this part in details. 3) Currently, at least for V1, we shall support the hardlinking only for the closed files and won't support to append operation against linked files, but it could be extended in the future. 4) Very good point that hardlinked files shall respect the max replication factors. From my understanding, the setReplication is just a memory footprint update and the name node will increase actual replication in the background.
          Hide
          John George added a comment -

          Thanks for uploading the design document.
          Do you plan to support hardlink using FileContext? In the design document, I see FileSystem and FsShell being mentioned as client interface - hence the question.

          Show
          John George added a comment - Thanks for uploading the design document. Do you plan to support hardlink using FileContext? In the design document, I see FileSystem and FsShell being mentioned as client interface - hence the question.
          Hide
          Daryn Sharp added a comment -

          While I really like the idea of hardlinks, I believe there are more non-trivial consideration with this proposed implementation. I'm by no means a SME, but I experimented with a very different approach awhile ago. Here are some of the issues I encountered:

          I think the quota considerations may be a bit trickier. The original creator of the file takes the nsquota & dsquota hit. The links take just the dsquota hit. However, when the original creator of the file is removed, one of the other links must absorb the dsquota. If there are multiple remaining links, which one takes the hit?

          What if none of the remaining links have available quota? If the dsquota can always be exceeded, I can bypass my quota by creating the file in one dir, hardlinking from my out-of-dsquota dir, then removing the original. If the dsquota cannot be exceeded, I can (maliciously?) hardlink from my out-of-dsquota dir to deny the original creator the ability to delete the file – perhaps causing them to be unable to reduce their quota usage.

          Block management will also be impacted. The manager currently operates on an inode mapping (changing to an interface though), but which of the hardlink inodes will it be? The original? When that link is removed, how will the block manager be updated with another hardlink inode?

          When a file is open for writing, the inode converts to under construction, so there would need to be a hardlink under construction. You will have to think about how other hardlinks are affected/handled. The case applies to hardlinks during file creation and appending.

          There may also be an impact to file leases. I believe they are path based so leases will now need to be enforced across multiple paths.

          What if one hardlink changes the replication factor? The maximum replication factor for all hardlinks should probably be obeyed, but now the setrep command will never succeed since it waits for the replication value to actually change.

          Show
          Daryn Sharp added a comment - While I really like the idea of hardlinks, I believe there are more non-trivial consideration with this proposed implementation. I'm by no means a SME, but I experimented with a very different approach awhile ago. Here are some of the issues I encountered: I think the quota considerations may be a bit trickier. The original creator of the file takes the nsquota & dsquota hit. The links take just the dsquota hit. However, when the original creator of the file is removed, one of the other links must absorb the dsquota. If there are multiple remaining links, which one takes the hit? What if none of the remaining links have available quota? If the dsquota can always be exceeded, I can bypass my quota by creating the file in one dir, hardlinking from my out-of-dsquota dir, then removing the original. If the dsquota cannot be exceeded, I can (maliciously?) hardlink from my out-of-dsquota dir to deny the original creator the ability to delete the file – perhaps causing them to be unable to reduce their quota usage. Block management will also be impacted. The manager currently operates on an inode mapping (changing to an interface though), but which of the hardlink inodes will it be? The original? When that link is removed, how will the block manager be updated with another hardlink inode? When a file is open for writing, the inode converts to under construction, so there would need to be a hardlink under construction. You will have to think about how other hardlinks are affected/handled. The case applies to hardlinks during file creation and appending. There may also be an impact to file leases. I believe they are path based so leases will now need to be enforced across multiple paths. What if one hardlink changes the replication factor? The maximum replication factor for all hardlinks should probably be obeyed, but now the setrep command will never succeed since it waits for the replication value to actually change.
          Hide
          Hairong Kuang added a comment -

          Sajay, you are right. HDFS hardlink is only a meta operation and no datanode is involved. In all our use cases, the source file may be deleted over time but its content can be still accessed through hardlinks.

          Show
          Hairong Kuang added a comment - Sajay, you are right. HDFS hardlink is only a meta operation and no datanode is involved. In all our use cases, the source file may be deleted over time but its content can be still accessed through hardlinks.
          Hide
          Liyin Tang added a comment -

          @Sanjay, sorry that I misunderstood "the advantage over".
          It is correct that keeping other linked files after deletion is the main advantage over symbolic links

          Show
          Liyin Tang added a comment - @Sanjay, sorry that I misunderstood "the advantage over". It is correct that keeping other linked files after deletion is the main advantage over symbolic links
          Hide
          Liyin Tang added a comment -

          <<The main advantage over symbolic links being that when the original link is deleted the 2nd one keeps the actual data from being deleted. Correct>>
          Do you mean hard link instead of symbolic links? If the original link deleted, the symbolic link will be broken. But if one of the hard linked files is deleted, other linked files won't be affected.

          You are right that the hard link stays on NN only

          Show
          Liyin Tang added a comment - <<The main advantage over symbolic links being that when the original link is deleted the 2nd one keeps the actual data from being deleted. Correct>> Do you mean hard link instead of symbolic links? If the original link deleted, the symbolic link will be broken. But if one of the hard linked files is deleted, other linked files won't be affected. You are right that the hard link stays on NN only
          Hide
          Sanjay Radia added a comment -

          The main advantage over symbolic links being that when the original link is deleted the 2nd one keeps the actual data from being deleted. Correct?
          Does the hard link stay on the NN or does it propagate to the actual blocks on the DN?
          I believe it is not necessary to propagate the link to the DNs based on the use cases you have described.

          Show
          Sanjay Radia added a comment - The main advantage over symbolic links being that when the original link is deleted the 2nd one keeps the actual data from being deleted. Correct? Does the hard link stay on the NN or does it propagate to the actual blocks on the DN? I believe it is not necessary to propagate the link to the DNs based on the use cases you have described.
          Hide
          Liyin Tang added a comment -

          Attached HDFS-HardLinks design doc and any comments are so welcome.

          Show
          Liyin Tang added a comment - Attached HDFS-HardLinks design doc and any comments are so welcome.
          Hide
          Namit Jain added a comment -

          Another usecase in Hive is to copy one table/partition to another table/partition.
          Ideally, we would like the following in Hive:

          Copy Table T1 to T2.

          The files under table location for T2 (say, /user/hive/warehouse/T2/0) can be a link to the corresponding file in table T1
          (say, /user/hive/warehouse/T1/0).

          Having said that, one of the requirements is that the data should be modified independently. So, if a new data is loaded into T1 (or T2),
          those changes should not be visible to T2 (or T1 respectively).

          Show
          Namit Jain added a comment - Another usecase in Hive is to copy one table/partition to another table/partition. Ideally, we would like the following in Hive: Copy Table T1 to T2. The files under table location for T2 (say, /user/hive/warehouse/T2/0) can be a link to the corresponding file in table T1 (say, /user/hive/warehouse/T1/0). Having said that, one of the requirements is that the data should be modified independently. So, if a new data is loaded into T1 (or T2), those changes should not be visible to T2 (or T1 respectively).

            People

            • Assignee:
              Liyin Tang
              Reporter:
              Hairong Kuang
            • Votes:
              3 Vote for this issue
              Watchers:
              56 Start watching this issue

              Dates

              • Created:
                Updated:

                Development