(jon: I made a minor formatting tweak to make this easier to read the dir structure)
But before a detailed description of how timestamp-based snapshots work internally, lets answer some comments!
@Jon: I'll add more info to the document to cover this stuff, but for the moment, lets just get it out there.
What is the read mechanism for snapshots like? Does the snapshot act like a read-only table or is there some special external mechanism needed to read the data from a snapshot? You mention having to rebuild in-memory state by replaying wals – is this a recovery situation or needed in normal reads?
Its almost, but not quite like a table. Read of a snapshot is going to require an external tool but after hooking up the snapshot via the external tool, it should act just like a real table.
Snapshots are intended to happen as fast as possible, to minimize downtime for the table. To enable that, we are just creating reference files in the snapshot directory. My vision is that once you take a snapshot, at some point (maybe weekly), you export the snapshot to a backup area. In the export you actually do the copy of the referenced files - you do a direct scan of the HFile (avoiding the top-level interface and going right to HDFS) and the WAL files. Then when you want to read the snapshot, you can just bulk-import the HFIles and replay the WAL files (with the WALPlayer this is relatively easy) to rebuild the state of the table at the time of the snapshot. Its not an exact copy (META isn't preserved), but all the actual data is there.
The caveat here is since everything is references, one of the WAL files you reference may not actually have been closed (and therefore not readable). In the common case this won't happen, but if you snap and immediately export, its possible. In that case, you need to roll the WAL for the RS that haven't rolled them yet. However, this is in the export process, so a little latency there is tolerable, whereas avoiding this means adding latency to taking a snapshot - bad news bears.
Keep in mind that the log files and hfiles will get regularly cleaned up. The former will be moved to the .oldlogs directory and periodically cleaned up and the latter get moved to the .archive directory (again with a parallel file hierarchy, as per
HBASE-5547). If the snapshot goes to read the reference file, which tracks down to the original file and it doesn't find it, then it will need to lookup the same file in its respective archive directory. If its not there, then you are really hosed (except for the case mentioned in the doc about the WALs getting cleaned up by an aggressive log cleaner, which it is shown, is not a problem).
Haven't gotten around to implementing this yet, but it seems reasonable to finish up (and I think Matteo was interested in working on that part).
What is a representation of a snapshot look like in terms of META and file system contents?
The way I see the implementation in the end is just a bunch of files in the /hbase/.snapshot directory. Like I mentioned above, the layout is very similar to the layout of a table.
Lets look at an example of a table named "stuff" (snapshot names need to be valid directory names - same as a table or CF) and has column "column" which is hosted on servers rs-1 and rs-2. Originally, the file system will look something like (with license taken on file names - its not exact, I know, this is just an example) :
The snapshot named "tuesday-at-nine", when completed, then just adds the following to the directory structure (or close enough):
The only file here that isn't a reference here is the tableinfo since it is a pretty small file (generally), so a copy seemed more prudent over doing archiving on changes to the table info.
The original implementation updated META with file references to do hbase-level hard links for the HFiles. AFter getting the original implementation working, I'm going to be ripping this piece out in favor of just doing an HFile cleaner and cleaner delegates (similar to logs) and then have a snapshot cleaner that reads of the FS for file references.
At some point we may get called upon to repair these, I want to make sure there are enough breadcrumbs for this to be possible.
How could that happen - hbase never has problems! (sarcasm)
- hlog roll (which I believe does not trigger a flush) instead of special meta hlog marker (this might avoid write unavailability, seems simpler that the mechanism I suggested)
The hlog marker is what I'm planning on doing for the timestamped based snapshot, which is going to be far safer than doing an HLog roll and provide less latency. With the roll, you need to not take any writes to the memstore between the roll and the end of the snapshot (otherwise you will lose edits). Doing meta edits into the HLog allows you to keep edits and not worry about it.
admin initiated snapshot and admin initiated restore operations as opposed to acting like a read only table. (not sure what happens to "newer" data after a restore, need to reread to see if it is in there, not sure about the cost to restore a snapshot)
Yup, right now its all handled from HBaseAdmin. Matteo was interested in working on the restore stuff, but depending on timing, I may end up picking up that work when I get the taking of a snapshot working. I think part of "snapshots" definitely includes getting back the state.
I believe it also has an ability to read files directly from an MR job without having to go through HBase's get/put interface. Is that in scope for
Absolutely in scope. It just didn't come up because I considered that part of the restore (which Matteo expressed interest). If you had to go through the high-level interface, then you would just use the procedure Lars talks about in his blog: http://hadoop-hbase.blogspot.com/2012/04/timestamp-consistent-backups-in-hbase.html
The other notable change is that I'm building to support multiple snapshots concurrently. Its really a trivial change, so I don't think its too much feature creep, just a matter of using lists rather than a single item.
How does this buy your more consistency? Aren't we still inconsistent at the prepare point now instead? Can we just write the special snapshotting hlog entry at initiation of prepare, allowing writes to continue, then adding data elsewhere (META) to mark success in commit? We could then have some compaction/flush time logic cleanup failed atttempt markers?
See the above comment about timestamp based vs. point in time and the former being all that's necessary for HBase. This means we don't take downtime and end up with a 'fuzzy' snapshot in terms of global consistency, but is exact in terms of HBase delivered timestamps.
The problem point-in-time snapshots overcomes is reaching distributed consensus while still trying to maintain availability and the ability to cross partitions. Since no one has figured out CAP and we are looking for consistency, we have to remove some availability to reach consensus. In this case, the agreement is over the state of the entire table, rather than per region server.
Yes, this is strictly against the contract that we have on a Scan, but it is also in line with expectations people have on what a snapshot means. Any writes that are pending before the snapshot are allowed to commit, but any writes that reach the RS after the snapshot time cannot be included in the snapshot. I got a little overzealous in my reading of
HBASE-50 and took it to mean global state, but after review the only way it would work within the constraints (no downtime) is to make it timestamp based.
But why can't we get global consistency without taking downtime?
Let's take your example of using an HLog edit to mark the start (and for ease, lets say the end as well - as long as its durable and recoverable, it doesn't matter if its WAL or META).
Say we start a snapshot and send a message to all the RS (lets ignore ZK for the moment, to simplify things) that they should take a snapshot. So they write a marker into the HLog marking the start, create references as mentioned above, and then report to the master that they are done. When everyone is done, we then message each RS to commit the snapshot, which is just another entry into the WAL. Then in rebuilding the snapshot, they would just replay the WAL up to the start (assuming the end is found).
How do we know though which writes arrived first on each RS if we just dump a write into the WAL? Ok, so then we need to wait for the MVCC read number to roll forward to when we got the snapshot notification before we can write an edit to the log - totally reasonable.
However, the problem arises in attempting to get a global state of the table in a high-volume write environment. We have no guarantee that the "snapshot commit" notification reached each of the RS at the same time. And even if it did reach them at the same time, maybe there was some latency in getting the write number. Or the switch was a little wonky, or it just finishing up a GC (I could go on).
Then we have a case where we don't actually have the snapshot as of the commit, but rather "at commit, plus or minus a bit" - not a clean snapshot (if we don't care about being exact then we can do a much faster, lower potential latency solution, the discussion of which is still coming, I promise). In a system that can take millions of writes a second, that is still a non-trivial amount of data that can change in a few milliseconds, no longer a true 'point in time'.
The only way to get that global, consistent view is to remove the availability of the table for a short time so we know that the state is the same across all tables.
Say we start a snapshot and the start indication doesn't reach the servers and get started at exactly the same time on all the servers, which, as explained above, is very likely. Then we let the servers commit any outstanding writes,but they don't get to take any new writes or a short time. In this time while they are waiting for writes to commit, we can then do all the snapshot preparation (referencing, table info copying). Once we are ready for the snapshot, we report back to the master and wait for the commit step. In this time we are still not taking writes. The key here is that for that short time, none of the servers are taking writes and that allows us to get a single point in time that no writes are committing (but they do get buffered on the server, they just can't change the system state).
If we let writes commit, then how do we reach a state that we can agree on across all the servers? If you let the writes commit, you again don't have any assurances that the prepare or the commit message time is agreed to by all the servers. The table-level consistent state is somewhere between the prepare and commit, but it's not clear how one would find that point - I'm pretty sure we can't do this unless we have perfectly synchronized clocks, which is not really possible without a better understanding of quantum mechanics
Block writes is a perhaps a bad phrase in this situation. In the current implementation, it buffers the writes as threads into the server, blocking on the updateLock. However, we can go with a "semi-blocking" version: writes still complete, but they aren't going to be visible until we roll forward to the snapshot MVCC number. This lets the writers complete (not affecting latency), but is going to affect read-modify-write and reader-to-writer comparison latency. However, as soon as we roll forward the MVCC, all those writes become visible, essentially catching back up to the current state. A slight modification to the WAL edits will need to be made to write the MVCC number so we can keep track of which writes are in/out of a snapshot, but that shouldn't be too hard (famous last words). You don't even need to modify all the WAL edits, just those made during the snapshot window, so the over the wire cost is still kept essentially the same, when amortized over the life of a table (for the standard use case).
I'm looking at doing this once I get the simple version working - one step at a time. Moving to the timestamp based approach lets us keep taking writes but does so at the cost of global consistency in favor of local consistency and still uses the exact same infrastructure. The first patch I'll actually put on RB will be the timestamp based, but let me get the stop the world version going before going down a rabbit hole.
The only thing we don't capture is if a writer makes a request to the RS before the snapshot is taken (by another client), but the write doesn't reach the server until after the RS hits the start barrier. From the global client perspective, this write should be in the snapshot, but that requires a single client or client-side write coordination (via a timestamp oracle). However, this is even worse coordination and creates even more constraints on the system where we currently have no coordination between clients (and I'm against adding any). So yes, we miss that edit, but that would be the case in a single-server database anyways without an external timestamp manager (to again distributed coordination between the client and server, though it can be done in a non-blocking manner). I'll mention some of this external coordination in the timestamp explanation.