Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Support HDFS snapshots. It should support creating snapshots without shutting down the file system. Snapshot creation should be lightweight and a typical system should be able to support a few thousands concurrent snapshots. There should be a way to surface (i.e. mount) a few of these snapshots simultaneously.

      1. Snapshots.pdf
        94 kB
        dhruba borthakur
      2. Snapshots.pdf
        94 kB
        dhruba borthakur

        Issue Links

          Activity

          Hide
          dhruba borthakur added a comment -

          A few Primary requirements:

          1. Latency of creating a snapshot be small. This is important for a data-warehousing installation where decision-analysis-software would like to create frequent snapshots of production data.
          2. A system should be able to support a few thousand snapshots concurrently.
          3. Existence of a snapshot should not unduly tax the memory requirements of the namenode.
          4. Only a few of those snapshots will be mounted simultaneously.
          5. A write to one snapshot should not affect data in another snapshot.
          6. The guarantees are such that files/directories that were created before the snapshot was created will be part of the snapshot. Files that were created after the snapshot was taken will not be part of the snapshot.
          7. Blocks that were allocated before the snapshot was created will be part of the snapshot. No cluster-wide freeze-thaw mechanism is needed for data blocks. A client can continue to write data to a block that was allocated to a snapshot without needing to trigger copy-on-write.

          Secondary Requirements:
          1. It would be good to have snapshotting support for a subtree rather than the entire namespace.
          2. It would be good to have some sort of "search feature" for files in snaphots without mounting the snapshot. This will help in finding out if a file/dir exists in the snapshot before mounting it.

          Show
          dhruba borthakur added a comment - A few Primary requirements: 1. Latency of creating a snapshot be small. This is important for a data-warehousing installation where decision-analysis-software would like to create frequent snapshots of production data. 2. A system should be able to support a few thousand snapshots concurrently. 3. Existence of a snapshot should not unduly tax the memory requirements of the namenode. 4. Only a few of those snapshots will be mounted simultaneously. 5. A write to one snapshot should not affect data in another snapshot. 6. The guarantees are such that files/directories that were created before the snapshot was created will be part of the snapshot. Files that were created after the snapshot was taken will not be part of the snapshot. 7. Blocks that were allocated before the snapshot was created will be part of the snapshot. No cluster-wide freeze-thaw mechanism is needed for data blocks. A client can continue to write data to a block that was allocated to a snapshot without needing to trigger copy-on-write. Secondary Requirements: 1. It would be good to have snapshotting support for a subtree rather than the entire namespace. 2. It would be good to have some sort of "search feature" for files in snaphots without mounting the snapshot. This will help in finding out if a file/dir exists in the snapshot before mounting it.
          Hide
          Joydeep Sen Sarma added a comment -

          why are thousands of snapshots required. just curious - since most commercial products offer couple of hundred at most - and i have never heard anyone complain.

          one of my biggest concerns is irrecoverable corruption of on-disk namenode metadata and the ability to restore to a known point in time from a backup copy of the metadata. typically the backup image would be obtained by external software (extraneous to HDFS). It would be good to articulate a recovery procedure for this (and this as a core requirement). (it is understood that the backup should be more recent than the oldest snapshot for there to be at least one available recovery point).

          Show
          Joydeep Sen Sarma added a comment - why are thousands of snapshots required. just curious - since most commercial products offer couple of hundred at most - and i have never heard anyone complain. one of my biggest concerns is irrecoverable corruption of on-disk namenode metadata and the ability to restore to a known point in time from a backup copy of the metadata. typically the backup image would be obtained by external software (extraneous to HDFS). It would be good to articulate a recovery procedure for this (and this as a core requirement). (it is understood that the backup should be more recent than the oldest snapshot for there to be at least one available recovery point).
          Hide
          dhruba borthakur added a comment -

          My understanding is that a data warehousing platform need a huge number of snapshots. It is typical to create a snapshot on production data and then run a decision-analysis-software on the snapshot. It has to write summary reports on the snapshots (hence the writebable snapshots).

          Creating a snapshots should create a copy of the fsimage and transaction log. For disaster recovery solutions, we would have to transfer these images/edits to a remote location using some external software. I will add this to the requirement. Requirement 8 is as follows:

          1. Support using snapshots for disaster recovery. Provide an example to transfer fsimage/edits file for each snapshot to some remote location using an external piece of software (e.g. ftp, scp, etc). Provide docs to recover HDFS from one of these snapshots.

          Show
          dhruba borthakur added a comment - My understanding is that a data warehousing platform need a huge number of snapshots. It is typical to create a snapshot on production data and then run a decision-analysis-software on the snapshot. It has to write summary reports on the snapshots (hence the writebable snapshots). Creating a snapshots should create a copy of the fsimage and transaction log. For disaster recovery solutions, we would have to transfer these images/edits to a remote location using some external software. I will add this to the requirement. Requirement 8 is as follows: 1. Support using snapshots for disaster recovery. Provide an example to transfer fsimage/edits file for each snapshot to some remote location using an external piece of software (e.g. ftp, scp, etc). Provide docs to recover HDFS from one of these snapshots.
          Hide
          Lohit Vijayarenu added a comment -

          Does this mean, we can get rid of SecondaryNameNode and use it only for failovers?

          Show
          Lohit Vijayarenu added a comment - Does this mean, we can get rid of SecondaryNameNode and use it only for failovers?
          Hide
          dhruba borthakur added a comment -

          Good question. I think a typical installation would have the namenode and secondary namenode in the same data center. So, in a disaster scenario, both of these machines might be unavailable. In short, a disaster recovery solution would need something more than the secondary namenode.

          I think that the request fr Joydeep is to transfer the snapshot image into a remote location (possibly outside the data center) so that it can allow namenode recovery after a disaster. One assumption for this recovery is that at at least one replica of most hdfs blocks be not affected by the disaster.

          Show
          dhruba borthakur added a comment - Good question. I think a typical installation would have the namenode and secondary namenode in the same data center. So, in a disaster scenario, both of these machines might be unavailable. In short, a disaster recovery solution would need something more than the secondary namenode. I think that the request fr Joydeep is to transfer the snapshot image into a remote location (possibly outside the data center) so that it can allow namenode recovery after a disaster. One assumption for this recovery is that at at least one replica of most hdfs blocks be not affected by the disaster.
          Hide
          Hong Tang added a comment -

          Requirement 1. Instead of saying "latency of creating a snapshot be small", can we be a bit more concrete, e.g. "snapshot creation is is a local operation on NN, and the latency should be in the order of seconds."

          Show
          Hong Tang added a comment - Requirement 1. Instead of saying "latency of creating a snapshot be small", can we be a bit more concrete, e.g. "snapshot creation is is a local operation on NN, and the latency should be in the order of seconds."
          Hide
          Joydeep Sen Sarma added a comment -

          > I think that the request fr Joydeep is to transfer the snapshot image into a remote location (possibly outside the data center) so
          > that it can allow namenode recovery after a disaster.

          more or less - my definition of disaster was simply corruption of the namenode metadata. but yes - if replication is managed in a way that preserves enough copies of the data blocks in a different site - then i guess this would allow site disaster recovery as well.

          Show
          Joydeep Sen Sarma added a comment - > I think that the request fr Joydeep is to transfer the snapshot image into a remote location (possibly outside the data center) so > that it can allow namenode recovery after a disaster. more or less - my definition of disaster was simply corruption of the namenode metadata. but yes - if replication is managed in a way that preserves enough copies of the data blocks in a different site - then i guess this would allow site disaster recovery as well.
          Hide
          Joydeep Sen Sarma added a comment -

          the way snapshots are used in an OLTP database to facilitate warehousing applications is not applicable to hadoop btw. hadoop is pretty much not used for oltp style application and is a data warehouse to begin with (or so it would seem to me). i would like snapshots for recovery - not for facilitating any data warehousing functionality.

          the whole writable snapshot scenario looks less than applicable to hadoop if u ask me.

          now if hbase or something similar similar really took off and became the primary way of storing data into hadoop - then that would be a whole different story.

          Show
          Joydeep Sen Sarma added a comment - the way snapshots are used in an OLTP database to facilitate warehousing applications is not applicable to hadoop btw. hadoop is pretty much not used for oltp style application and is a data warehouse to begin with (or so it would seem to me). i would like snapshots for recovery - not for facilitating any data warehousing functionality. the whole writable snapshot scenario looks less than applicable to hadoop if u ask me. now if hbase or something similar similar really took off and became the primary way of storing data into hadoop - then that would be a whole different story.
          Hide
          dhruba borthakur added a comment -

          Hong: I agree with your refinement of Requirement 1. For scalability reasons, it might be worthwhile to require that the namenode should not have to contact all datanodes at the time of snapshot creation.

          Joydeep: I do not plan on implementing a write-able snaphsot at present. However, if the requrement/design is such that it will encompass write-able snapshots in the future, it will be good, isn't it? I think of hadoop more like a data warehouse and archive packaged into one. When an administrator/user creates a snaopshot, it would be nice to be able to at-least annotate that snapshot will some more metadata. The annotatation can be done by writing a arbitrary size label into a new file in the snapshot, i.e. ability to do write to the snapshot!

          Show
          dhruba borthakur added a comment - Hong: I agree with your refinement of Requirement 1. For scalability reasons, it might be worthwhile to require that the namenode should not have to contact all datanodes at the time of snapshot creation. Joydeep: I do not plan on implementing a write-able snaphsot at present. However, if the requrement/design is such that it will encompass write-able snapshots in the future, it will be good, isn't it? I think of hadoop more like a data warehouse and archive packaged into one. When an administrator/user creates a snaopshot, it would be nice to be able to at-least annotate that snapshot will some more metadata. The annotatation can be done by writing a arbitrary size label into a new file in the snapshot, i.e. ability to do write to the snapshot!
          Hide
          dhruba borthakur added a comment -

          This is a very very draft version of the design that I am thinking of. Early comments welcome. Some portions of the document might not yet be very readable and I am working on a few pictures to enhance readability.

          Show
          dhruba borthakur added a comment - This is a very very draft version of the design that I am thinking of. Early comments welcome. Some portions of the document might not yet be very readable and I am working on a few pictures to enhance readability.
          Hide
          Allen Wittenauer added a comment -

          Overall, what has been proposed sounds very promising to me. Some comments though:

          Requirements
          ===========

          Requirements #4 vs. Non-goal #2: "Only a few of those snapshots will be accessed simultaneiously"

          What happens if it is determined that finding that data that one is looking for is expensive? In other words, what is Plan B?

          On busy systems, I can see where many users could be searching for data in different snapshots very, very easily, esp. when FUSE is involved.

          Requirements #5:

          While I understand what you are saying and why the requirement exists , it would be good to make sure this is really well documented.

          [An aside... I never thought of directed graphs as being a mathematical construct. At least I was taught them as part of my CS courses which were distinct from the math courses. Hmm.]

          "Special Number 500"
          =================

          Why 500? That seems particularly arbitrary. I would recommend starting at a digit boundary. 1000, 10000, 100, whatever.

          Namedir Structure
          ==============

          What happens when the number of snapshots gets large? Any concern about things like directory name lookup caches at the (UNIX) file system level having issues? Would it be a good idea to be able to support a multilevel hashed structure now or wait till someone needs it?

          Appending to Files
          ==============
          I have a bit of concern about the "wait for some period" bit. We've noticed that when the file system gets full at the UNIX level, the name node goes a bit spastic while it tries to hunt for free space. Now, clearly the name node should be better behaved in this sort of edge-case scenario. But I'm wondering what the client should do if, when it retries after waiting for the NN to COW the block under such conditions.

          Show
          Allen Wittenauer added a comment - Overall, what has been proposed sounds very promising to me. Some comments though: Requirements =========== Requirements #4 vs. Non-goal #2: "Only a few of those snapshots will be accessed simultaneiously" What happens if it is determined that finding that data that one is looking for is expensive? In other words, what is Plan B? On busy systems, I can see where many users could be searching for data in different snapshots very, very easily, esp. when FUSE is involved. Requirements #5: While I understand what you are saying and why the requirement exists , it would be good to make sure this is really well documented. [An aside... I never thought of directed graphs as being a mathematical construct. At least I was taught them as part of my CS courses which were distinct from the math courses. Hmm.] "Special Number 500" ================= Why 500? That seems particularly arbitrary. I would recommend starting at a digit boundary. 1000, 10000, 100, whatever. Namedir Structure ============== What happens when the number of snapshots gets large? Any concern about things like directory name lookup caches at the (UNIX) file system level having issues? Would it be a good idea to be able to support a multilevel hashed structure now or wait till someone needs it? Appending to Files ============== I have a bit of concern about the "wait for some period" bit. We've noticed that when the file system gets full at the UNIX level, the name node goes a bit spastic while it tries to hunt for free space. Now, clearly the name node should be better behaved in this sort of edge-case scenario. But I'm wondering what the client should do if, when it retries after waiting for the NN to COW the block under such conditions.
          Hide
          Joydeep Sen Sarma added a comment -

          couple of comments:

          • we should make it a requirement that admins should be able to find out space locked down (exclusively) by a snapshot. This is very valuable - because otherwise there is no way to know which snapshot should be deleted (when running out of space).
          • it may make sense to exclude certain path prefixes from snapshotting. this could be a global configuration - and from the current design standpoint this looks really easy to do (if a block belongs to one of the excluded path prefixes - then the block can be freed unless it appears in the current namespace). this would be great for excluding tmp folders that have high churn and will cause space to be locked up unnecessarily.
          Show
          Joydeep Sen Sarma added a comment - couple of comments: we should make it a requirement that admins should be able to find out space locked down (exclusively) by a snapshot. This is very valuable - because otherwise there is no way to know which snapshot should be deleted (when running out of space). it may make sense to exclude certain path prefixes from snapshotting. this could be a global configuration - and from the current design standpoint this looks really easy to do (if a block belongs to one of the excluded path prefixes - then the block can be freed unless it appears in the current namespace). this would be great for excluding tmp folders that have high churn and will cause space to be locked up unnecessarily.
          Hide
          dhruba borthakur added a comment -

          New version of the document. This accounts for handling snapshots that are not mounted.

          Show
          dhruba borthakur added a comment - New version of the document. This accounts for handling snapshots that are not mounted.
          Hide
          Sanjay Radia added a comment - - edited

          My comment is regarding the motivation for snapshots as described here.
          The introduction of the attached file states that the motivation is timetravel to previous states of the file system.
          Clearly HDFS can store data over time and store an evolving state of some data in different files with name and timestamps. Is this not sufficient? Or are you really looking for a way to look at the complete file system view at particular point in time.
          Snapshots in other systems are motivated by recovering accidental deletions or changes.

          Also this design proposes snapshotting the entire file system rather then some subtree.
          I believe that for backup purposes, one wants to snapshot different parts of the file system at different degrees of granularity (rather then the filesystem as a whole). Hence I disagree with the Non-goal of snapshotting subtrees. At the very least the design should be general enough to allow snapshotting subtrees even though version 1 will implement snapshotting the entire file system.

          Show
          Sanjay Radia added a comment - - edited My comment is regarding the motivation for snapshots as described here. The introduction of the attached file states that the motivation is timetravel to previous states of the file system. Clearly HDFS can store data over time and store an evolving state of some data in different files with name and timestamps. Is this not sufficient? Or are you really looking for a way to look at the complete file system view at particular point in time. Snapshots in other systems are motivated by recovering accidental deletions or changes. Also this design proposes snapshotting the entire file system rather then some subtree. I believe that for backup purposes, one wants to snapshot different parts of the file system at different degrees of granularity (rather then the filesystem as a whole). Hence I disagree with the Non-goal of snapshotting subtrees. At the very least the design should be general enough to allow snapshotting subtrees even though version 1 will implement snapshotting the entire file system.
          Hide
          Pete Wyckoff added a comment -

          +1 to disaster recovery.

          Another requirement might be that the mounting of a snapshot should be part of the dfsshell and not require the user to launch new servers or something else and that the mounting be reasonably fast.

          Show
          Pete Wyckoff added a comment - +1 to disaster recovery. Another requirement might be that the mounting of a snapshot should be part of the dfsshell and not require the user to launch new servers or something else and that the mounting be reasonably fast.
          Hide
          Pete Wyckoff added a comment -

          And one should be able to implement enumerating snapshots and mounting them from libhdfs.

          Show
          Pete Wyckoff added a comment - And one should be able to implement enumerating snapshots and mounting them from libhdfs.
          Hide
          Sanjay Radia added a comment -

          Pete says:
          >Another requirement might be that the mounting of a snapshot should be part of the dfsshell and not require
          >the user to launch new servers or something else and that the mounting be reasonably fast.

          You mean the snapshot is mounted on the client side (ie a mini namenode in the client side)? Or do you mean the main name node can mount a snapshot?
          For large file systems, the main name node will not be able to fit a snapshot and the current view of the file system both in memory.

          Show
          Sanjay Radia added a comment - Pete says: >Another requirement might be that the mounting of a snapshot should be part of the dfsshell and not require >the user to launch new servers or something else and that the mounting be reasonably fast. You mean the snapshot is mounted on the client side (ie a mini namenode in the client side)? Or do you mean the main name node can mount a snapshot? For large file systems, the main name node will not be able to fit a snapshot and the current view of the file system both in memory.
          Hide
          Allen Wittenauer added a comment -

          A few folks have asked me to clarify my position on what I really want to be able to do with snapshots.

          A solution that provides a single snap of the entire file system is acceptable for the short term. For the medium term, the solution should provide multiple snaps of the entire file system and have them all usable simultaneously. But ultimately, I want to be able to snapshot directories (partial file systems) multiple times a day and have all of those snapshots available. [a la ZFS' .zfs and NetApp's .snapshot system ] Different file paths have different priorities with regards to a desired level of recover-ability.

          I think Sanjay is correct in that snapshots will get used more for the "oops!" situations than a full blown recovery. As an end user of other snapshot systems, it is extremely nice being able to compare/contrast multiple versions of the same file. This would be especially prevalent in the FUSE and NFS proxy cases, I'd imagine, where the fs is running around in a POSIX-like costume.

          I don't want to understate the need to be able to do a full file system recovery though. Having to bring the system down just to take a snapshot is annoying.

          I want to be able to do both as they provide different solutions for different needs. Any solution that prevents fulfilling what I need over the long haul, to me, is probably not a good idea. I haven't dug into the spec to see if that is the situation here.

          plink plink

          Show
          Allen Wittenauer added a comment - A few folks have asked me to clarify my position on what I really want to be able to do with snapshots. A solution that provides a single snap of the entire file system is acceptable for the short term. For the medium term, the solution should provide multiple snaps of the entire file system and have them all usable simultaneously. But ultimately, I want to be able to snapshot directories (partial file systems) multiple times a day and have all of those snapshots available. [a la ZFS' .zfs and NetApp's .snapshot system ] Different file paths have different priorities with regards to a desired level of recover-ability. I think Sanjay is correct in that snapshots will get used more for the "oops!" situations than a full blown recovery. As an end user of other snapshot systems, it is extremely nice being able to compare/contrast multiple versions of the same file. This would be especially prevalent in the FUSE and NFS proxy cases, I'd imagine, where the fs is running around in a POSIX-like costume. I don't want to understate the need to be able to do a full file system recovery though. Having to bring the system down just to take a snapshot is annoying. I want to be able to do both as they provide different solutions for different needs. Any solution that prevents fulfilling what I need over the long haul, to me, is probably not a good idea. I haven't dug into the spec to see if that is the situation here. plink plink
          Hide
          Pete Wyckoff added a comment -

          You mean the snapshot is mounted on the client side (ie a mini namenode in the client side)? Or do you mean the main name node can mount a snapshot?

          Didn't mean on the client side or that the namenode has to do it, but just saying if supported and installed, should probably run a daemon to handle remote mounting requests or have the namenode capable of launching one on some machine.

          Just meant from libhdfs world on some remote machine, it will not be possible to launch a namenode, so the DFSClient or something has to have the capability of calling some remote server to do the mount for us.

          +1 to allen and sanjay's comments.

          Disaster recovery snapshots really go a long way towards making hdfs production worthy for many apps that don't require super high availability but that cannot lose data because of system (hdfs) or operator errors.

          Show
          Pete Wyckoff added a comment - You mean the snapshot is mounted on the client side (ie a mini namenode in the client side)? Or do you mean the main name node can mount a snapshot? Didn't mean on the client side or that the namenode has to do it, but just saying if supported and installed, should probably run a daemon to handle remote mounting requests or have the namenode capable of launching one on some machine. Just meant from libhdfs world on some remote machine, it will not be possible to launch a namenode, so the DFSClient or something has to have the capability of calling some remote server to do the mount for us. +1 to allen and sanjay's comments. Disaster recovery snapshots really go a long way towards making hdfs production worthy for many apps that don't require super high availability but that cannot lose data because of system (hdfs) or operator errors.
          Hide
          Doug Cutting added a comment -

          > For large file systems, the main name node will not be able to fit a snapshot and the current view of the file system both in memory.

          Perhaps snapshots should never be held in memory, but rather dealt with as an on-disk datastructure? Thus it would be possible to browse snapshots, but not very quickly. Would that be acceptable?

          Show
          Doug Cutting added a comment - > For large file systems, the main name node will not be able to fit a snapshot and the current view of the file system both in memory. Perhaps snapshots should never be held in memory, but rather dealt with as an on-disk datastructure? Thus it would be possible to browse snapshots, but not very quickly. Would that be acceptable?
          Hide
          dhruba borthakur added a comment -

          I think blocks that belong to snapshots but not to the primary namespace still needs to be replicated. Similarly, rebalancing,decommission, etc should work correctly on these blocks. That is the reaosn why my design has a SnapNode that is used to maintain the blocks of snapshots that are not mounted read-write. For snapshots that are mounted read-write, it is the ViewNode that serves the namespace as well as the blocks that contain them.

          Regarding Sanjay's comments: The ability to 'time-travel' is not a motivation for this design. The primary and only motivation is for "disaster recovery". However, I believe that this design is good for scaling to a v large HDFS cluster. Pl let me know if you see/need any major changes in the proposed design.

          Regarding Allenn's comments: This design provides for full-file system snapshot and recovery. Snapshots of specified directories are not allowed. However, one can specify a set of directories to exclude from a snapshot (to eliminate tmp directories). In the long term, we can enhance this feature to support per-directory snapshots.

          Show
          dhruba borthakur added a comment - I think blocks that belong to snapshots but not to the primary namespace still needs to be replicated. Similarly, rebalancing,decommission, etc should work correctly on these blocks. That is the reaosn why my design has a SnapNode that is used to maintain the blocks of snapshots that are not mounted read-write. For snapshots that are mounted read-write, it is the ViewNode that serves the namespace as well as the blocks that contain them. Regarding Sanjay's comments: The ability to 'time-travel' is not a motivation for this design. The primary and only motivation is for "disaster recovery". However, I believe that this design is good for scaling to a v large HDFS cluster. Pl let me know if you see/need any major changes in the proposed design. Regarding Allenn's comments: This design provides for full-file system snapshot and recovery. Snapshots of specified directories are not allowed. However, one can specify a set of directories to exclude from a snapshot (to eliminate tmp directories). In the long term, we can enhance this feature to support per-directory snapshots.
          Hide
          dhruba borthakur added a comment -

          Just to clarify on Doug' comments:

          >Perhaps snapshots should never be held in memory, but rather dealt with as an on-disk datastructure? Thus it would be >possible to browse snapshots, but not very quickly.

          We can come up with a plan to read snapshots directly from the disk image. However, the SnapNode still has to exist because it has to replicate blocks in the snapshot.

          Once we have the ability to serve snapshot data directly from the fsimage on disk, the same technique could be used to serve data from the primary namespace! But I plan to do it as a separate JIRA as not as part of snapshots. Does it make sense?

          Show
          dhruba borthakur added a comment - Just to clarify on Doug' comments: >Perhaps snapshots should never be held in memory, but rather dealt with as an on-disk datastructure? Thus it would be >possible to browse snapshots, but not very quickly. We can come up with a plan to read snapshots directly from the disk image. However, the SnapNode still has to exist because it has to replicate blocks in the snapshot. Once we have the ability to serve snapshot data directly from the fsimage on disk, the same technique could be used to serve data from the primary namespace! But I plan to do it as a separate JIRA as not as part of snapshots. Does it make sense?
          Hide
          Hairong Kuang added a comment -

          Dhruba, in addition to the block replication complexity associated with snapshots as we discussed yesterday, rebalancing a dfs cluster introduces another complexity. A balancer needs to move active blocks as well as inactive blocks. A request for a datanode's partial block map needs to forward to the snap node as well.

          Show
          Hairong Kuang added a comment - Dhruba, in addition to the block replication complexity associated with snapshots as we discussed yesterday, rebalancing a dfs cluster introduces another complexity. A balancer needs to move active blocks as well as inactive blocks. A request for a datanode's partial block map needs to forward to the snap node as well.
          Hide
          dhruba borthakur added a comment -

          Hi Hairong, Thanks. My current proposal would require that the balancer contact the SnapNode (in additional to the NameNode) to retrieve the list of candidate blocks. Do you think that it will work?

          Show
          dhruba borthakur added a comment - Hi Hairong, Thanks. My current proposal would require that the balancer contact the SnapNode (in additional to the NameNode) to retrieve the list of candidate blocks. Do you think that it will work?
          Hide
          dhruba borthakur added a comment -

          Here are the suggestion modifications to my current proposal:

          1. There should be an ability to specify the exclude list at snapshot creation time. It is not enough to have a cluster-wide exclude list that apply to all snapshot-creations.
          2. The SnapNode could merge the fsimage, fsedits and exclude list to create a single fsimage (optimization).
          3. When a block is managed by the NameNode as well as the SnapNode, who manages replication for this block? The proposal is that the SnapNode forward all replication request to the NameNode. The NameNode can look into its blockmap to decide whether this block really needs replication. This simplifies the architecture because it is ways the NameNode who is the final arbitrator of what to replicate. Similarly, if the SnapNode detects excess replicas fr a block, it forwards the delete-excess-replica request to the NameNode. The NameNode, based on its own blockmap, decides whether to delete the excess replica or not.
          4. This design has not impact on Namespace quotas or disk quotas. Snapshots are created by the system administrator and has no impact on per-directory bases disk quotas.

          Show
          dhruba borthakur added a comment - Here are the suggestion modifications to my current proposal: 1. There should be an ability to specify the exclude list at snapshot creation time. It is not enough to have a cluster-wide exclude list that apply to all snapshot-creations. 2. The SnapNode could merge the fsimage, fsedits and exclude list to create a single fsimage (optimization). 3. When a block is managed by the NameNode as well as the SnapNode, who manages replication for this block? The proposal is that the SnapNode forward all replication request to the NameNode. The NameNode can look into its blockmap to decide whether this block really needs replication. This simplifies the architecture because it is ways the NameNode who is the final arbitrator of what to replicate. Similarly, if the SnapNode detects excess replicas fr a block, it forwards the delete-excess-replica request to the NameNode. The NameNode, based on its own blockmap, decides whether to delete the excess replica or not. 4. This design has not impact on Namespace quotas or disk quotas. Snapshots are created by the system administrator and has no impact on per-directory bases disk quotas.
          Hide
          Jeff Hammerbacher added a comment -

          Hey dhruba,

          Any news on the snapshot implementation?

          Thanks,
          Jeff

          Show
          Jeff Hammerbacher added a comment - Hey dhruba, Any news on the snapshot implementation? Thanks, Jeff
          Hide
          dhruba borthakur added a comment -

          After a few rounds of discussions, it was found that snapshots can be designed very elegantly only if there is complete separation between namespace management and block management. Currently, both these functionality are managed by the namenode. HADOOP-5015 is a step in this direction.

          Show
          dhruba borthakur added a comment - After a few rounds of discussions, it was found that snapshots can be designed very elegantly only if there is complete separation between namespace management and block management. Currently, both these functionality are managed by the namenode. HADOOP-5015 is a step in this direction.
          Hide
          James Kennedy added a comment -

          I'd like to see this task get more traction. A snapshot mechanism would be very valuable for those of us relying on HDFS to hold critical information.

          Show
          James Kennedy added a comment - I'd like to see this task get more traction. A snapshot mechanism would be very valuable for those of us relying on HDFS to hold critical information.
          Hide
          dhruba borthakur added a comment -

          There hasn't been much work in this direction. It was deemed complex to implement because it needs lots of changes to the current NameNode/DataNode code. I have a proposal in mind that can implement "HDFS snapshots" as a layer on top of the current HDFS code with negligible changes to the existing NameNode/DataNode architecture.

          If you have any ideas regarding this , or is willing to contribute towards this effort, that will be great! Thanks.

          Show
          dhruba borthakur added a comment - There hasn't been much work in this direction. It was deemed complex to implement because it needs lots of changes to the current NameNode/DataNode code. I have a proposal in mind that can implement "HDFS snapshots" as a layer on top of the current HDFS code with negligible changes to the existing NameNode/DataNode architecture. If you have any ideas regarding this , or is willing to contribute towards this effort, that will be great! Thanks.
          Hide
          Sameer Agarwal added a comment -

          We recently submitted a paper with Dhruba based on an implementation of a possible snapshot solution for HDFS. While the paper broadly tries to address both truncates and appends, it would be really nice to get some feedback and advise on some design decisions and I would love to submit a HDFS patch for the same. The paper can be accessed at http://www.cs.berkeley.edu/~sameerag/hdfs-snapshots.pdf. Thanks!

          Show
          Sameer Agarwal added a comment - We recently submitted a paper with Dhruba based on an implementation of a possible snapshot solution for HDFS. While the paper broadly tries to address both truncates and appends, it would be really nice to get some feedback and advise on some design decisions and I would love to submit a HDFS patch for the same. The paper can be accessed at http://www.cs.berkeley.edu/~sameerag/hdfs-snapshots.pdf . Thanks!
          Hide
          Matt Foley added a comment -

          Dhruba and Sameer, the paper is pretty interesting. Any interest in creating a sandbox branch for this feature, and contributing the implementation for review?

          Show
          Matt Foley added a comment - Dhruba and Sameer, the paper is pretty interesting. Any interest in creating a sandbox branch for this feature, and contributing the implementation for review?
          Hide
          dhruba borthakur added a comment -

          Hi Matt, without a design approval, the work to upload this feature merged with apache hdfs trunk might go waste. We currently have the code at githib, a major chunk of it contributed by an intern Usman Masoud.

          https://github.com/facebook/hadoop-20/tree/master/src/contrib/snapshot/

          Show
          dhruba borthakur added a comment - Hi Matt, without a design approval, the work to upload this feature merged with apache hdfs trunk might go waste. We currently have the code at githib, a major chunk of it contributed by an intern Usman Masoud. https://github.com/facebook/hadoop-20/tree/master/src/contrib/snapshot/
          Hide
          Joe Kraska added a comment -

          Reviewing the comments and noting the dataware housing feature requests and the like, I thought I would comment on the snapshot feature from the more pragmatic perspective of simple, responsible data stewardship.

          By and large, the most important features of snapshots are being able to:
          1. Do them live.
          2. Do them economically: do not require particularly large amounts of space for the snapshot.
          3. Being able to have a dozen or so (and often less).
          4. Being able to schedule them (hourly, daily, weekly, with emphasis on the latter two)
          5. Being able to selectively restore portions of the tree due to user- or program- caused erasure or damage
          6. Being able to quickly conduct a restore of either a sub portion of the tree or an entire volume.

          The above set of features are about fundamental data protection, cost, and restore time objectives. They are directly related to economical data stewardship, and are considered the first line of defense for data protection in many enterprises today. I.e., we data stewards prefer these features over tape restores (although we also use tape, we hate it).

          AFTER the above, space-efficient writable snapshots are interesting. This is because there are applications for test for current data sets where touching the master data set is a complete no-no, but the application needs to make trial changes. These snapshots are often made, modified for a while, then deleted.

          You will want minimal performance impact for these snapshots, because the assumption should be that the scheduled snapshot system is ALWAYS used. The one exception to this is static read-only data where a single manual snapshot is recorded just once. Everything else will have something like 2 daily and 2 weekly snapshots going all the time. Some enterprises will also use hourly snapshots scheduled every 6 hours or so and retain about a day of those...

          As a side note (and no offense to the hadoop community), I regard all shared storage as defective for data stewardship purpose if it does not have the above features (except writable snapshots, that's candy), and I am not the least bit alone. Any data protection strategy that says "go to tape for that" as its first offer is... onerous.

          While the following matter is merely my opinion, I feel pretty sure that the rise of the enterprise NAS appliance (e.g., NetApp et al) is at least partly due to the default nature of snapshot protection on those devices. Food for thought.

          Show
          Joe Kraska added a comment - Reviewing the comments and noting the dataware housing feature requests and the like, I thought I would comment on the snapshot feature from the more pragmatic perspective of simple, responsible data stewardship. By and large, the most important features of snapshots are being able to: 1. Do them live. 2. Do them economically: do not require particularly large amounts of space for the snapshot. 3. Being able to have a dozen or so (and often less). 4. Being able to schedule them (hourly, daily, weekly, with emphasis on the latter two) 5. Being able to selectively restore portions of the tree due to user- or program- caused erasure or damage 6. Being able to quickly conduct a restore of either a sub portion of the tree or an entire volume. The above set of features are about fundamental data protection, cost, and restore time objectives. They are directly related to economical data stewardship, and are considered the first line of defense for data protection in many enterprises today. I.e., we data stewards prefer these features over tape restores (although we also use tape, we hate it). AFTER the above, space-efficient writable snapshots are interesting. This is because there are applications for test for current data sets where touching the master data set is a complete no-no, but the application needs to make trial changes. These snapshots are often made, modified for a while, then deleted. You will want minimal performance impact for these snapshots, because the assumption should be that the scheduled snapshot system is ALWAYS used. The one exception to this is static read-only data where a single manual snapshot is recorded just once. Everything else will have something like 2 daily and 2 weekly snapshots going all the time. Some enterprises will also use hourly snapshots scheduled every 6 hours or so and retain about a day of those... As a side note (and no offense to the hadoop community), I regard all shared storage as defective for data stewardship purpose if it does not have the above features (except writable snapshots, that's candy), and I am not the least bit alone. Any data protection strategy that says "go to tape for that" as its first offer is... onerous. While the following matter is merely my opinion, I feel pretty sure that the rise of the enterprise NAS appliance (e.g., NetApp et al) is at least partly due to the default nature of snapshot protection on those devices. Food for thought.
          Hide
          Lars Hofhansl added a comment -

          For what's it worth... Here's another vote for read-only snapshots.
          We're planning a big deploy of HBase, and this would be a very helpful feature for consistent filesystem level backups.

          Show
          Lars Hofhansl added a comment - For what's it worth... Here's another vote for read-only snapshots. We're planning a big deploy of HBase, and this would be a very helpful feature for consistent filesystem level backups.
          Hide
          Allen Wittenauer added a comment -

          Closing this as a duplicate.

          Historical fun fact:

          Sanjay would ask me what features I'd like to see in HDFS quickly followed by a "Don't say snapshots!" After HDFS-2802 was committed, the next time I saw him he said "You should be happy! You finally got your #1 HDFS request!"

          Show
          Allen Wittenauer added a comment - Closing this as a duplicate. Historical fun fact: Sanjay would ask me what features I'd like to see in HDFS quickly followed by a "Don't say snapshots!" After HDFS-2802 was committed, the next time I saw him he said "You should be happy! You finally got your #1 HDFS request!"

            People

            • Assignee:
              dhruba borthakur
              Reporter:
              dhruba borthakur
            • Votes:
              6 Vote for this issue
              Watchers:
              59 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development