Michael Stack added a comment - 15/Mar/10 21:35

Making this a blocker on 0.21 and 0.20.4. For 0.20.4, I think it ok if its incomplete before we release. Not so for 0.21.

Michael Stack added a comment - 15/Mar/10 21:35 Making this a blocker on 0.21 and 0.20.4. For 0.20.4, I think it ok if its incomplete before we release. Not so for 0.21.

ryan rawson added a comment - 15/Mar/10 22:41

So previously without durability, some of the things in here were just not applicable. Sync = noop really doesnt lend itself to answering these questions.

I have postponed commenting until I fixed 2248, but now that I have here are my suggestions on how we should do things:

Row mutate operations should be atomic. Concurrent gets/scans do not see the results of a row mutation until it is "finished". In the code, this means "when rwcc.completeMemstoreInsert() is called". This has to happen after all KVs have been put in memstore. We have to call HLog.sync() before we start modifying the memstore so if there is any HLog issue we don't mutate memstore. Thus rows become visibile very shortly after a HLog.sync occurs. The time it takes to modify in-memory structures and call rwcc.completeMemstoreInsert().
Row mutates across multiple families should be atomic. This was not too hard to implement in ~~HBASE-2248~~ and represents a good level of service I think.
Reads cannot see rows that have not been sync()ed to HLog.
Scanners have a weak isolation - they are continuously seeing a updated view of the table as it runs across rows. That means a scanner can see rows inserted after it's creation. Providing stronger isolation doesn't make sense since there is no intra-row atomic guarantees.
Once a client gets a success after a mutation operation, all other clients, including itself will be able to see the new data.

In my work for HDFS-0.21, it was pretty obvious that hflush was fairly slow. For high volume updates, with lower value data (eg: calling ICV on a row many thousands of times a seconds) it seemed to make sense to use a time-based flush. That is the durability promise is relaxed slightly to say that the row is only durable after X milliseconds (configurable) at the most. This is a per-table setting (see: ~~HBASE-1944~~).

Right now we have the in-memory atomic reads. The durability story is being improved in ~~HBASE-2283~~ with restructuring hlog appends/syncs and memstore mutations. The performance and locking of in-memory atomic reads is being improved in ~~HBASE-2248~~.

ryan rawson added a comment - 15/Mar/10 22:41 So previously without durability, some of the things in here were just not applicable. Sync = noop really doesnt lend itself to answering these questions. I have postponed commenting until I fixed 2248, but now that I have here are my suggestions on how we should do things: Row mutate operations should be atomic. Concurrent gets/scans do not see the results of a row mutation until it is "finished". In the code, this means "when rwcc.completeMemstoreInsert() is called". This has to happen after all KVs have been put in memstore. We have to call HLog.sync() before we start modifying the memstore so if there is any HLog issue we don't mutate memstore. Thus rows become visibile very shortly after a HLog.sync occurs. The time it takes to modify in-memory structures and call rwcc.completeMemstoreInsert(). Row mutates across multiple families should be atomic. This was not too hard to implement in HBASE-2248 and represents a good level of service I think. Reads cannot see rows that have not been sync()ed to HLog. Scanners have a weak isolation - they are continuously seeing a updated view of the table as it runs across rows. That means a scanner can see rows inserted after it's creation. Providing stronger isolation doesn't make sense since there is no intra-row atomic guarantees. Once a client gets a success after a mutation operation, all other clients, including itself will be able to see the new data. In my work for HDFS-0.21, it was pretty obvious that hflush was fairly slow. For high volume updates, with lower value data (eg: calling ICV on a row many thousands of times a seconds) it seemed to make sense to use a time-based flush. That is the durability promise is relaxed slightly to say that the row is only durable after X milliseconds (configurable) at the most. This is a per-table setting (see: HBASE-1944 ). Right now we have the in-memory atomic reads. The durability story is being improved in HBASE-2283 with restructuring hlog appends/syncs and memstore mutations. The performance and locking of in-memory atomic reads is being improved in HBASE-2248 .

Michael Stack added a comment - 15/Mar/10 23:57

Above looks good to me Ryan. Only thing I'd change would be the language. It should be rewritten in spec-speak. For example, rather than "should be atomic", in a spec., we'd say "are atomic" (and its a bug if it ain't so).

Regards scanners, we should be clearer that they'll never see partial updates to a row (even if it is made from many families), or in other words, that they respect row locks. Also if no timestamp is specified, scanners will return the state of the row as of the time the scanner encounters the row. Otherwise, if the scanner is opened with an explicit timestamp, they'll return the state of the row as of the specified timestamp.

I think we need to probably also describe how a Scanner moves through a table explaining what rows it will return.

We should probably talk up how deletes work too though I think it should be plain given the above, we probably should just spell it out in a spec.

Michael Stack added a comment - 15/Mar/10 23:57 Above looks good to me Ryan. Only thing I'd change would be the language. It should be rewritten in spec-speak. For example, rather than "should be atomic", in a spec., we'd say "are atomic" (and its a bug if it ain't so). Regards scanners, we should be clearer that they'll never see partial updates to a row (even if it is made from many families), or in other words, that they respect row locks. Also if no timestamp is specified, scanners will return the state of the row as of the time the scanner encounters the row. Otherwise, if the scanner is opened with an explicit timestamp, they'll return the state of the row as of the specified timestamp. I think we need to probably also describe how a Scanner moves through a table explaining what rows it will return. We should probably talk up how deletes work too though I think it should be plain given the above, we probably should just spell it out in a spec.

Todd Lipcon added a comment - 16/Mar/10 00:32

Thanks for the input, guys. Also, I think we should refrain from talking about implementation details (the RWCC stuff, internal timestamps, HLog, etc). So instead of talking about hlog sync, we should say "made durable".

Todd Lipcon added a comment - 16/Mar/10 00:32 Thanks for the input, guys. Also, I think we should refrain from talking about implementation details (the RWCC stuff, internal timestamps, HLog, etc). So instead of talking about hlog sync, we should say "made durable".

ryan rawson added a comment - 16/Mar/10 00:48

I agree on the implementation details. I was just illustrating on how the code currently works. It helps to have a concrete example of how things are done when writing specs. On the plus side, what I described above I think both fits a good user experience, and is possible to implement (as evidenced by having a working implementation thereof over in ~~HBASE-2248~~).

I would also like to keep the term 'row lock' out - I think we could possibly have serialized atomic updates to HBase without row locks (wow!).

One point of discussion, I think it's important to have a scanner stay 'up to date' as much as possible. Not only would it simplify the implementation (as is), it makes no sense without a broader transaction promise to also provide a level of transaction isolation. If you are doing an aggregate scan on a table via map reduce, we already provide a mechanism for giving yourself a consistent view of the world, and that is the Scan#setTimeRange() call. Supporting it in a Scan would require carrying the consistency view information from region to region, and without some serious changes we could not support that. Given our existing support, I would argue it is unnecessary to do further work to promise large scale scanner consistency.

Scanner consistency is already an issue in the master META scanner. We have to double check the results of a scan to avoid problematic things such as double assignment. Keeping the scanner more lively will help with this.

One area where users could have issues would be consuming/producing rows in the same job. The Map-reduce framework helps with this, with TIF you can read in one pass, and TOF you write in another phase that are by necessity non-overlapping.

The more I think about it, the more I realize a user wants perfect isolation, they should use Scan#setTimerange() - it supports everything you want: restartable scanners, simple semantics, and cross-region support and has an existing implementation.

ryan rawson added a comment - 16/Mar/10 00:48 I agree on the implementation details. I was just illustrating on how the code currently works. It helps to have a concrete example of how things are done when writing specs. On the plus side, what I described above I think both fits a good user experience, and is possible to implement (as evidenced by having a working implementation thereof over in HBASE-2248 ). I would also like to keep the term 'row lock' out - I think we could possibly have serialized atomic updates to HBase without row locks (wow!). One point of discussion, I think it's important to have a scanner stay 'up to date' as much as possible. Not only would it simplify the implementation (as is), it makes no sense without a broader transaction promise to also provide a level of transaction isolation. If you are doing an aggregate scan on a table via map reduce, we already provide a mechanism for giving yourself a consistent view of the world, and that is the Scan#setTimeRange() call. Supporting it in a Scan would require carrying the consistency view information from region to region, and without some serious changes we could not support that. Given our existing support, I would argue it is unnecessary to do further work to promise large scale scanner consistency. Scanner consistency is already an issue in the master META scanner. We have to double check the results of a scan to avoid problematic things such as double assignment. Keeping the scanner more lively will help with this. One area where users could have issues would be consuming/producing rows in the same job. The Map-reduce framework helps with this, with TIF you can read in one pass, and TOF you write in another phase that are by necessity non-overlapping. The more I think about it, the more I realize a user wants perfect isolation, they should use Scan#setTimerange() - it supports everything you want: restartable scanners, simple semantics, and cross-region support and has an existing implementation.

Todd Lipcon added a comment - 16/Mar/10 00:58

I would also like to keep the term 'row lock' out - I think we could possibly have serialized atomic updates to HBase without row locks (wow!).

+1. I'd also love to consider dropping the user-exposed row lock feature entirely. This might be unpopular, but I think that feature is dangerous, and compareAndSwap is an equally powerful concurrency primitive that's a lot less complex. What do you guys think?

One point of discussion, I think it's important to have a scanner stay 'up to date' as much as possible

Also +1, I think the current compromise makes sense - don't see partial row mutations, but at the beginning of each row, the freshest data is taken.

The Map-reduce framework helps with this, with TIF you can read in one pass, and TOF you write in another phase that are by necessity non-overlapping.

Not true of a map-only job, right?

The more I think about it, the more I realize a user wants perfect isolation, they should use Scan#setTimerange() - it supports everything you want: restartable scanners, simple semantics, and cross-region support and has an existing implementation

Again +1

Todd Lipcon added a comment - 16/Mar/10 00:58 I would also like to keep the term 'row lock' out - I think we could possibly have serialized atomic updates to HBase without row locks (wow!). +1. I'd also love to consider dropping the user-exposed row lock feature entirely. This might be unpopular, but I think that feature is dangerous, and compareAndSwap is an equally powerful concurrency primitive that's a lot less complex. What do you guys think? One point of discussion, I think it's important to have a scanner stay 'up to date' as much as possible Also +1, I think the current compromise makes sense - don't see partial row mutations, but at the beginning of each row, the freshest data is taken. The Map-reduce framework helps with this, with TIF you can read in one pass, and TOF you write in another phase that are by necessity non-overlapping. Not true of a map-only job, right? The more I think about it, the more I realize a user wants perfect isolation, they should use Scan#setTimerange() - it supports everything you want: restartable scanners, simple semantics, and cross-region support and has an existing implementation Again +1

Todd Lipcon added a comment - 16/Mar/10 01:03

I'd also love to consider dropping the user-exposed row lock feature entirely. This might be unpopular, but I think that feature is dangerous, and compareAndSwap is an equally powerful concurrency primitive that's a lot less complex.

Stated differently: the current row lock feature can be implemented on the client side through use of an atomic compareAndSet primitive. So, if we provide CAS within the region server, row lock primitives are redundant and add complexity for no apparent gain.

Todd Lipcon added a comment - 16/Mar/10 01:03 I'd also love to consider dropping the user-exposed row lock feature entirely. This might be unpopular, but I think that feature is dangerous, and compareAndSwap is an equally powerful concurrency primitive that's a lot less complex. Stated differently: the current row lock feature can be implemented on the client side through use of an atomic compareAndSet primitive. So, if we provide CAS within the region server, row lock primitives are redundant and add complexity for no apparent gain.

Henry Robinson added a comment - 16/Mar/10 01:08

Stated differently: the current row lock feature can be implemented on the client side through use of an atomic compareAndSet primitive. So, if we provide CAS within the region server, row lock primitives are redundant and add complexity for no apparent gain.

The only downside of CAS compared to a lock is that the client has to handle the retry on failure. Putting CAS in a tight loop on a contended row might be painful as well. Still preferable to locks, in my opinion!

Henry Robinson added a comment - 16/Mar/10 01:08 Stated differently: the current row lock feature can be implemented on the client side through use of an atomic compareAndSet primitive. So, if we provide CAS within the region server, row lock primitives are redundant and add complexity for no apparent gain. The only downside of CAS compared to a lock is that the client has to handle the retry on failure. Putting CAS in a tight loop on a contended row might be painful as well. Still preferable to locks, in my opinion!

Todd Lipcon added a comment - 16/Mar/10 01:11

The only downside of CAS compared to a lock is that the client has to handle the retry on failure

Yea, we could add a signaling primitive (eg "wait for change on cell") which also allows spurious wakeup. With spurious wakeup allowed we can allow ourselves more simplicitly on the server side but still avoid busy spinning.

Todd Lipcon added a comment - 16/Mar/10 01:11 The only downside of CAS compared to a lock is that the client has to handle the retry on failure Yea, we could add a signaling primitive (eg "wait for change on cell") which also allows spurious wakeup. With spurious wakeup allowed we can allow ourselves more simplicitly on the server side but still avoid busy spinning.

ryan rawson added a comment - 16/Mar/10 01:21

One thing to note - rowLock/rowUnlock is not a scalable interface anyways! If you use explicit row locks, and you have a lot of people waiting to acquire the row, you start consuming handler threads and can DOS yourself.

Perhaps providing a client-only option that uses CAS to allow an atomic series of operations to occur with a retry strategy.

ryan rawson added a comment - 16/Mar/10 01:21 One thing to note - rowLock/rowUnlock is not a scalable interface anyways! If you use explicit row locks, and you have a lot of people waiting to acquire the row, you start consuming handler threads and can DOS yourself. Perhaps providing a client-only option that uses CAS to allow an atomic series of operations to occur with a retry strategy.

Jean-Daniel Cryans added a comment - 16/Mar/10 01:24

<crazyidea>If we remove row locks from core HBase, maybe we should still support them at the HTable level but by using ZK directly instead of contacting region servers.</crazyidea>

Jean-Daniel Cryans added a comment - 16/Mar/10 01:24 <crazyidea>If we remove row locks from core HBase, maybe we should still support them at the HTable level but by using ZK directly instead of contacting region servers.</crazyidea>

Todd Lipcon added a comment - 16/Mar/10 01:27

maybe we should still support them at the HTable level but by using ZK directly instead of contacting region servers

I don't think that works, because you'd still be able to use the non-locking calls on that row by contacting the RS. We certainly can't gate all row operations through a ZK lock.

Todd Lipcon added a comment - 16/Mar/10 01:27 maybe we should still support them at the HTable level but by using ZK directly instead of contacting region servers I don't think that works, because you'd still be able to use the non-locking calls on that row by contacting the RS. We certainly can't gate all row operations through a ZK lock.

Todd Lipcon added a comment - 16/Mar/10 02:38

Here's a first pass at some kind of spec. These aren't meant to be final - just posting for discussion. I anticipate that after we (developers) come to some kind of conclusion here we will want to run this by the user list to see if we're missing use cases, etc.

Definitions

For the sake of common vocabulary, we define the following terms:

ATOMICITY: an operation is atomic if it either completes entirely or not at all
CONSISTENCY: all actions cause the table to transition from one valid state directly to another (eg a row will not disappear during an update,e tc)
ISOLATION: an operation is isolated if it appears to complete independently of any other concurrent transaction
DURABILITY: any update that reports "successful" to the client will not be lost
VISIBILITY: an update is considered visible if any subsequent read will see the update as having been committed

APIs to consider

Read APIs
- get
- scan

Write APIs
- put
- delete

Combination (read-modify-write) APIs
- incrementColumnValue
- compareAndSet

Guarantees Provided

Atomicity

All mutations are atomic within a row. Any put will either wholely succeed or wholely fail.
1. An operation that returns a "success" code has completely succeeded.
2. An operation that returns a "failure" code has completely failed.
3. An operation that times out may have succeeded and may have failed. However, it will not have partially succeeded or failed.
This is true even if the mutation crosses multiple column families within a row.
APIs that mutate several rows will not be atomic across the multiple rows. For example, a multiput that operates on rows 'a','b', and 'c' may return having mutated some but not all of the rows. XXX: will they return failure or success or some mixed response here?
The compareAndSet API happens atomically as is typically understood by this operation.

Consistency and Isolation

All rows returned via any access API will consist of a complete row that existed at some point in the table's history.
This is true across column families - i.e a get of a full row that occurs concurrent with some mutations 1,2,3,4,5 will return a complete row that existed at some point in time between mutation i and i+1 for some i between 1 and 5.

Consistency of Scans

A scan is not a consistent view of a table. Scans do not exhibit snapshot isolation.

Rather, scans have the following properties:

Any row returned by the scan will be a consistent view (i.e. that version of the complete row existed at some point in time)
A scan will always reflect a version at least as new as the beginning of the scan. This satisfies the visibility guarantees enumerated below.
1. For example, if client A writes data X and then communicates via a side channel to client B, any scans started by client B will contain data at least as new as X.
2. Scans may include data that is newer than the start of the scan.
3. Another way of stating this is that a scan must reflect all mutations committed prior to the construction of the scanner, and may reflect some mutations committed subsequent to the construction of the scanner.

Those familiar with relational databases will recognize this isolation level as "read committed".

XXX: Ryan has mentioned the model of "scans will always get the most up-to-date version of a row when beginning a new row". Do we want to guarantee this or just leave it at "some version of the row at least as new as what existed at scan start"?

Visibility

When a client receives a "success" response for any mutation, that mutation is immediately visible to both that client and any client with whom it later communicates through side channels.
A row will never exhibit so-called "time-travel" properties. That is to say, if a series of mutations moves a row sequentially through a series of states, any sequence of concurrent reads will return a subsequence of those states.
1. For example, if a row's cells are mutated using the "incrementColumnValue" API, a client will never see the value of any cell decrease.
2. This is true regardless of which read API is used to read back the mutation.
Any version of a cell that has been returned to a read operation is guaranteed to be durably stored.

Durability

All visible data is also durable data. That is to say, a read will never return data that is not durably on disk.
Any operation that returns a "success" code (eg does not throw an exception) will be made durable.
Any operation that returns a "failure" code will not be made durable (subject to the Atomicity guarantees above)
All reasonable failure scenarios will not affect any of the guarantees of this document.

XXX: should expand this to include the concept of tunable durability windows (this also impacts visibility since you can experience time travel during failure if some updates arent durable)

Todd Lipcon added a comment - 16/Mar/10 02:38 Here's a first pass at some kind of spec. These aren't meant to be final - just posting for discussion. I anticipate that after we (developers) come to some kind of conclusion here we will want to run this by the user list to see if we're missing use cases, etc. Definitions For the sake of common vocabulary, we define the following terms: ATOMICITY : an operation is atomic if it either completes entirely or not at all CONSISTENCY : all actions cause the table to transition from one valid state directly to another (eg a row will not disappear during an update,e tc) ISOLATION : an operation is isolated if it appears to complete independently of any other concurrent transaction DURABILITY : any update that reports "successful" to the client will not be lost VISIBILITY : an update is considered visible if any subsequent read will see the update as having been committed APIs to consider Read APIs get scan Write APIs put delete Combination (read-modify-write) APIs incrementColumnValue compareAndSet Guarantees Provided Atomicity All mutations are atomic within a row. Any put will either wholely succeed or wholely fail. An operation that returns a "success" code has completely succeeded. An operation that returns a "failure" code has completely failed. An operation that times out may have succeeded and may have failed. However, it will not have partially succeeded or failed. This is true even if the mutation crosses multiple column families within a row. APIs that mutate several rows will not be atomic across the multiple rows. For example, a multiput that operates on rows 'a','b', and 'c' may return having mutated some but not all of the rows. XXX: will they return failure or success or some mixed response here? The compareAndSet API happens atomically as is typically understood by this operation. Consistency and Isolation All rows returned via any access API will consist of a complete row that existed at some point in the table's history. This is true across column families - i.e a get of a full row that occurs concurrent with some mutations 1,2,3,4,5 will return a complete row that existed at some point in time between mutation i and i+1 for some i between 1 and 5. Consistency of Scans A scan is not a consistent view of a table. Scans do not exhibit snapshot isolation . Rather, scans have the following properties: Any row returned by the scan will be a consistent view (i.e. that version of the complete row existed at some point in time) A scan will always reflect a version at least as new as the beginning of the scan. This satisfies the visibility guarantees enumerated below. For example, if client A writes data X and then communicates via a side channel to client B, any scans started by client B will contain data at least as new as X. Scans may include data that is newer than the start of the scan. Another way of stating this is that a scan must reflect all mutations committed prior to the construction of the scanner, and may reflect some mutations committed subsequent to the construction of the scanner. Those familiar with relational databases will recognize this isolation level as "read committed". XXX: Ryan has mentioned the model of "scans will always get the most up-to-date version of a row when beginning a new row". Do we want to guarantee this or just leave it at "some version of the row at least as new as what existed at scan start"? Visibility When a client receives a "success" response for any mutation, that mutation is immediately visible to both that client and any client with whom it later communicates through side channels. A row will never exhibit so-called "time-travel" properties. That is to say, if a series of mutations moves a row sequentially through a series of states, any sequence of concurrent reads will return a subsequence of those states. For example, if a row's cells are mutated using the "incrementColumnValue" API, a client will never see the value of any cell decrease. This is true regardless of which read API is used to read back the mutation. Any version of a cell that has been returned to a read operation is guaranteed to be durably stored. Durability All visible data is also durable data. That is to say, a read will never return data that is not durably on disk. Any operation that returns a "success" code (eg does not throw an exception) will be made durable. Any operation that returns a "failure" code will not be made durable (subject to the Atomicity guarantees above) All reasonable failure scenarios will not affect any of the guarantees of this document. XXX: should expand this to include the concept of tunable durability windows (this also impacts visibility since you can experience time travel during failure if some updates arent durable)

Aaron Kimball added a comment - 16/Mar/10 02:53

APIs that mutate several rows will not be atomic across the multiple rows. For example, a multiput that operates on rows 'a','b', and 'c' may return having mutated some but not all of the rows. XXX: will they return failure or success or some mixed response here?

Worth pointing out for comparison that JDBC's Statement.executeBatch() statement will return a success code iff all operations completed successfully. If some of the underlying updates in a batch failed, it will return an array of success/failure codes indicating results on a per-operation level. Likewise, you may wish to consider returning an array of success/failed/timeout responses for each row updated in a multi-row put.

Aaron Kimball added a comment - 16/Mar/10 02:53 APIs that mutate several rows will not be atomic across the multiple rows. For example, a multiput that operates on rows 'a','b', and 'c' may return having mutated some but not all of the rows. XXX: will they return failure or success or some mixed response here? Worth pointing out for comparison that JDBC's Statement.executeBatch() statement will return a success code iff all operations completed successfully. If some of the underlying updates in a batch failed, it will return an array of success/failure codes indicating results on a per-operation level. Likewise, you may wish to consider returning an array of success/failed/timeout responses for each row updated in a multi-row put.

Todd Lipcon added a comment - 16/Mar/10 02:58

Henry points out that I missed part of the definition of atomicity in the above. Namely, multiple concurrent writes will be linearized such that each takes effect in an instant of time. Thus, if I have two concurrent writes of "a=1,b=1,c=1" and "a=2,b=2,c=2" I will end up with one of the two states, but not a state the mixes 1s and 2s.

Todd Lipcon added a comment - 16/Mar/10 02:58 Henry points out that I missed part of the definition of atomicity in the above. Namely, multiple concurrent writes will be linearized such that each takes effect in an instant of time. Thus, if I have two concurrent writes of "a=1,b=1,c=1" and "a=2,b=2,c=2" I will end up with one of the two states, but not a state the mixes 1s and 2s.

Todd Lipcon added a comment - 16/Mar/10 03:10

On further thought, the "no time travel" guarantee is actually a bit tricky to implement in the face of failure. For example, a RS can enter GC pause, have region reassigned, then come back to life. Before it finds out from ZK that it's been determined to be dead, it could serve some reads of stale data to a client that has cached the old region location. So we should think carefully about this requirement.

Todd Lipcon added a comment - 16/Mar/10 03:10 On further thought, the "no time travel" guarantee is actually a bit tricky to implement in the face of failure. For example, a RS can enter GC pause, have region reassigned, then come back to life. Before it finds out from ZK that it's been determined to be dead, it could serve some reads of stale data to a client that has cached the old region location. So we should think carefully about this requirement.

Michael Stack added a comment - 16/Mar/10 04:38

Thanks for putting together first cut Todd. Should we put it somewhere we can all hack on it? Up on hbase wiki or over in public google doc?

+ Minor: In current API its called, checkAndPut rather than compareAndSave: http://su.pr/2tPOlr
+ "XXX: will they return failure or success or some mixed response here?" Failure with a list of what updates succeeded+failed (As per Aaron's JDBC illustration).
+ "...at least as new as the beginning of the scan" could confuse, or rather, to avoid confusion we need to explain all the ways in which the scanner works. Whats being described is the case where no timestamp is specified on opening of a scan. In this case we want to return the latest version of the row as of the time the scanner happens upon it (the row). Even so, there is no reason why the table row data may not be older than the scan opening. In this case we'd return data older than scanner opening (Are you saying this Todd when you say "at least as new.."?). Also, you can specify the timestamp range when you open a scanner and in this case, it'll return cells that lie within the bounds of the timestamp range irrepective of the time at which the scanner was opened.
+ On 'XXX: Ryan has mentioned the model of "scans will always get the most up-to-date version of a row when beginning a new row". Do we want to guarantee this or just leave it at "some version of the row at least as new as what existed at scan start"?', I'm fine with either. The latter seems an easier guarantee.
+ Is this a slip on your part Todd -> "...not durably on disk." Maybe we should adjust your durable definition to suit what we can offer, or, maybe you intend that we adjust the filesystem to match your durability definition?
+ +1 on adding the Henry amendment
+ On the 'time travel' requirement in the face of failure, how about we take it on (with your note that it hard to do) so if 'time travel', its a bug.

Michael Stack added a comment - 16/Mar/10 04:38 Thanks for putting together first cut Todd. Should we put it somewhere we can all hack on it? Up on hbase wiki or over in public google doc? + Minor: In current API its called, checkAndPut rather than compareAndSave: http://su.pr/2tPOlr + "XXX: will they return failure or success or some mixed response here?" Failure with a list of what updates succeeded+failed (As per Aaron's JDBC illustration). + "...at least as new as the beginning of the scan" could confuse, or rather, to avoid confusion we need to explain all the ways in which the scanner works. Whats being described is the case where no timestamp is specified on opening of a scan. In this case we want to return the latest version of the row as of the time the scanner happens upon it (the row). Even so, there is no reason why the table row data may not be older than the scan opening. In this case we'd return data older than scanner opening (Are you saying this Todd when you say "at least as new.."?). Also, you can specify the timestamp range when you open a scanner and in this case, it'll return cells that lie within the bounds of the timestamp range irrepective of the time at which the scanner was opened. + On 'XXX: Ryan has mentioned the model of "scans will always get the most up-to-date version of a row when beginning a new row". Do we want to guarantee this or just leave it at "some version of the row at least as new as what existed at scan start"?', I'm fine with either. The latter seems an easier guarantee. + Is this a slip on your part Todd -> "...not durably on disk." Maybe we should adjust your durable definition to suit what we can offer, or, maybe you intend that we adjust the filesystem to match your durability definition? + +1 on adding the Henry amendment + On the 'time travel' requirement in the face of failure, how about we take it on (with your note that it hard to do) so if 'time travel', its a bug.

Todd Lipcon added a comment - 16/Mar/10 05:07

Should we put it somewhere we can all hack on it? Up on hbase wiki or over in public google doc?

I made a gist: http://gist.github.com/333664 - hopefully this is easy to hack on and also retain edit history

Regarding the scan stuff, we should add some section to clarify "latest" with regard to timestamp versus "latest" with regard to the time of the edit. There's a lot of confusion over the semantics here - do timestamps just act like a "z dimension" or do they actually impact consistent views, etc?

Will try to address your other feedback on the gist later this evening.

Todd Lipcon added a comment - 16/Mar/10 05:07 Should we put it somewhere we can all hack on it? Up on hbase wiki or over in public google doc? I made a gist: http://gist.github.com/333664 - hopefully this is easy to hack on and also retain edit history Regarding the scan stuff, we should add some section to clarify "latest" with regard to timestamp versus "latest" with regard to the time of the edit. There's a lot of confusion over the semantics here - do timestamps just act like a "z dimension" or do they actually impact consistent views, etc? Will try to address your other feedback on the gist later this evening.

Yoram Kulbak added a comment - 16/Mar/10 05:50

[Ryan] One point of discussion, I think it's important to have a scanner stay 'up to date' as much as possible

[Todd] Also +1, I think the current compromise makes sense - don't see partial row mutations, but at the beginning of each row, the freshest data is taken.

[Stack] + On 'XXX: Ryan has mentioned the model of "scans will always get the most up-to-date version of a row when beginning a new row". Do we want to guarantee this or just leave it at "some version of the row at least as new as what existed at scan start"?', I'm fine with either. The latter seems an easier guarantee.

I'm basing my comment on ~~HBASE-2248~~ playing a major factor in enforcing the ACID properties of HBase.

IMHO having the scanner stay 'up to date' as much as possible is a nice-to-have, definitely not important enough to hurt performance. A quick look at the suggested patch for ~~HBASE-2248~~ reveals that in order to enforce the rule above the memstore scanner was reverted to using the 0.20.2-style ConcurrentSkipListSet#tailSet operation. Our experiments on 0.20.2 showed that with this style of memstore scanning it's actually 3 times slower to scan the memstore than it is to scan the store files (with block cache enabled).
Also, (assuming that the ~~HBASE-2248~~ patch is committed) I don't see any point in a 'best effort' guarantee: e.g. since from the user's perspective "'up to date' as much as possible" is not clearly defined it's better to guarantee the clear-cut notion of seeing your own writes since it leaves leeway for future performance tweaks.

I haven't performance tested any of the suggested patches for ~~HBASE-2248~~, but it seems like PE is going to be performed soon. My guess is that if the PE numbers will be compared to the existing baseline it may show a slow-down.

I'm not familiar with the wide range of use-cases for HBASE but my experience is that usually scanning through a single region takes less than a second. Every time the client scanner moves to a new region a new region scanner is instantiated (which grabs the latest 'region state') and so in most cases, the client scanner will encounter rows which are at most a couple of seconds old. Slower scans will usually be due to the client side performing some lengthy operations during the scan. I would think that clients which do 'lengthy scans' don't particularly care about performance and hence, if they wish to make the best effort to process up-to-date rows they can issue a GET for every row before they process it. For most cases, I would expect a row which is at most a couple of seconds old to be good enough.

Yoram Kulbak added a comment - 16/Mar/10 05:50 [Ryan] One point of discussion, I think it's important to have a scanner stay 'up to date' as much as possible [Todd] Also +1, I think the current compromise makes sense - don't see partial row mutations, but at the beginning of each row, the freshest data is taken. [Stack] + On 'XXX: Ryan has mentioned the model of "scans will always get the most up-to-date version of a row when beginning a new row". Do we want to guarantee this or just leave it at "some version of the row at least as new as what existed at scan start"?', I'm fine with either. The latter seems an easier guarantee. I'm basing my comment on HBASE-2248 playing a major factor in enforcing the ACID properties of HBase. IMHO having the scanner stay 'up to date' as much as possible is a nice-to-have, definitely not important enough to hurt performance. A quick look at the suggested patch for HBASE-2248 reveals that in order to enforce the rule above the memstore scanner was reverted to using the 0.20.2-style ConcurrentSkipListSet#tailSet operation. Our experiments on 0.20.2 showed that with this style of memstore scanning it's actually 3 times slower to scan the memstore than it is to scan the store files (with block cache enabled). Also, (assuming that the HBASE-2248 patch is committed) I don't see any point in a 'best effort' guarantee: e.g. since from the user's perspective "'up to date' as much as possible" is not clearly defined it's better to guarantee the clear-cut notion of seeing your own writes since it leaves leeway for future performance tweaks. I haven't performance tested any of the suggested patches for HBASE-2248 , but it seems like PE is going to be performed soon. My guess is that if the PE numbers will be compared to the existing baseline it may show a slow-down. I'm not familiar with the wide range of use-cases for HBASE but my experience is that usually scanning through a single region takes less than a second. Every time the client scanner moves to a new region a new region scanner is instantiated (which grabs the latest 'region state') and so in most cases, the client scanner will encounter rows which are at most a couple of seconds old. Slower scans will usually be due to the client side performing some lengthy operations during the scan. I would think that clients which do 'lengthy scans' don't particularly care about performance and hence, if they wish to make the best effort to process up-to-date rows they can issue a GET for every row before they process it. For most cases, I would expect a row which is at most a couple of seconds old to be good enough.

Todd Lipcon added a comment - 16/Mar/10 06:31

IMHO having the scanner stay 'up to date' as much as possible is a nice-to-have, definitely not important enough to hurt performance.

I think I agree with you. I don't want to sidetrack this particular JIRA towards implementation details, so I'll leave it at that. Without regard to the specifics of the other JIRA, it seems likely to me that the "as up to date as possible" can often be implemented more efficiently than the "snapshot iterator". The current implementation may not be up to snuff, so I'll leave it at this: I think the scanner semantics should be as loose as possible to achieve the maximum speed, and I view "up to date" as looser than snapshot.

I would think that clients which do 'lengthy scans' don't particularly care about performance

I disagree - MR jobs are a typical "lengthy scan" application and throughput is certainly important. Especially important is the ability to have the bulk (MR) jobs coexist with high concurrent live load on the table.

Todd Lipcon added a comment - 16/Mar/10 06:31 IMHO having the scanner stay 'up to date' as much as possible is a nice-to-have, definitely not important enough to hurt performance. I think I agree with you. I don't want to sidetrack this particular JIRA towards implementation details, so I'll leave it at that. Without regard to the specifics of the other JIRA, it seems likely to me that the "as up to date as possible" can often be implemented more efficiently than the "snapshot iterator". The current implementation may not be up to snuff, so I'll leave it at this: I think the scanner semantics should be as loose as possible to achieve the maximum speed, and I view "up to date" as looser than snapshot. I would think that clients which do 'lengthy scans' don't particularly care about performance I disagree - MR jobs are a typical "lengthy scan" application and throughput is certainly important. Especially important is the ability to have the bulk (MR) jobs coexist with high concurrent live load on the table.

ryan rawson added a comment - 16/Mar/10 06:34

I'm not sure it's fair to say that a bulk job must retrieve the data twice if they want freshness. Since everything is already in memory, it seems easy to be more live/fresh.

ryan rawson added a comment - 16/Mar/10 06:34 I'm not sure it's fair to say that a bulk job must retrieve the data twice if they want freshness. Since everything is already in memory, it seems easy to be more live/fresh.

Todd Lipcon added a comment - 16/Mar/10 23:32

Ryan and I just discussed the snapshot vs up-to-date question on IRC. Brief summary:

We cannot provide snapshot isolation as a guarantee, since doing so across regions is really impossible. Users can make use of timestamp range filters to get snapshot-like behavior.
People don't seem to have concrete use cases for exactly-up-to-date semantics. Thus we shouldn't provide that either.
The "spec" above describes the semantics as "at least as new as the start of the scan and possibly newer". This permits anything from snapshot isolation all the way up to "up to date" semantics, but leaves us wide open for implementation paths. If we decide to go with up-to-date now, we'll have explicit documentation to say this is not a feature, and anyone relying on it may be broken by implementation changes down the road.

Does that sound reasonable to you, Yoram?

Todd Lipcon added a comment - 16/Mar/10 23:32 Ryan and I just discussed the snapshot vs up-to-date question on IRC. Brief summary: We cannot provide snapshot isolation as a guarantee, since doing so across regions is really impossible. Users can make use of timestamp range filters to get snapshot-like behavior. People don't seem to have concrete use cases for exactly-up-to-date semantics. Thus we shouldn't provide that either. The "spec" above describes the semantics as "at least as new as the start of the scan and possibly newer". This permits anything from snapshot isolation all the way up to "up to date" semantics, but leaves us wide open for implementation paths. If we decide to go with up-to-date now, we'll have explicit documentation to say this is not a feature, and anyone relying on it may be broken by implementation changes down the road. Does that sound reasonable to you, Yoram?

Yoram Kulbak added a comment - 17/Mar/10 01:56

That sounds great Todd. Cheers.

Yoram Kulbak added a comment - 17/Mar/10 01:56 That sounds great Todd. Cheers.

Todd Lipcon added a comment - 18/Mar/10 05:51

I pushed an update to the doc in response to Stack's comments:
https://gist.github.com/336081/6c64c14c35fa778d74f3c7fdcfde09a38dc4b5c9

The time-travel thing is still somewhat worrisome. I think we have a few options here:

a) We allow time travel reads always (a little weak, hard to program against). To satisfy other guarantees, we know that writes and read-modify-writes won't have this property.
b) We disallow time travel from a single client, but different clients may be at different points in the timeline . That is to say, in the example of some set of processes incrementing a cell, a single reader will never see a cell decrease. However, a reader may see a cell at value N, communicate to a second reader, and the second reader may then see the cell at a value less than N.
c) We give the user a call something like "ensureReadsUptodate()". This ensures that the reader will not read any data more stale than the time when this call is made. This is exactly what ZooKeeper does about the stale read problem - usually you get stale reads but don't care, and if you care, you call ZK's sync() method.
d) We never allow time travel reads. I think this is nearly impossible to do without killing performance (essentially the region server would have to verify that it is still in charge of a region before every read).

Thoughts?

Todd Lipcon added a comment - 18/Mar/10 05:51 I pushed an update to the doc in response to Stack's comments: https://gist.github.com/336081/6c64c14c35fa778d74f3c7fdcfde09a38dc4b5c9 The time-travel thing is still somewhat worrisome. I think we have a few options here: a) We allow time travel reads always (a little weak, hard to program against). To satisfy other guarantees, we know that writes and read-modify-writes won't have this property. b) We disallow time travel from a single client, but different clients may be at different points in the timeline . That is to say, in the example of some set of processes incrementing a cell, a single reader will never see a cell decrease. However, a reader may see a cell at value N, communicate to a second reader, and the second reader may then see the cell at a value less than N. c) We give the user a call something like "ensureReadsUptodate()". This ensures that the reader will not read any data more stale than the time when this call is made. This is exactly what ZooKeeper does about the stale read problem - usually you get stale reads but don't care, and if you care, you call ZK's sync() method. d) We never allow time travel reads. I think this is nearly impossible to do without killing performance (essentially the region server would have to verify that it is still in charge of a region before every read). Thoughts?

Todd Lipcon added a comment - 18/Mar/10 06:24

d) We never allow time travel reads. I think this is nearly impossible to do without killing performance (essentially the region server would have to verify that it is still in charge of a region before every read).

Actually, after much head scratching, I take it back. If we make an assumption that the clocks of all of the nodes progress within some error threshold of the same rate, we can do this efficiently. We just need to keep track of the timestamp at which we began our last ZK write, and be careful not to serve writes if it's possible that we might have been expired. I think the assumption of equal rate clocks is an OK one (note this is different than synchronized clocks). I'll sleep on this and write up a description tomorrow to make it more clear.

Todd Lipcon added a comment - 18/Mar/10 06:24 d) We never allow time travel reads. I think this is nearly impossible to do without killing performance (essentially the region server would have to verify that it is still in charge of a region before every read). Actually, after much head scratching, I take it back. If we make an assumption that the clocks of all of the nodes progress within some error threshold of the same rate, we can do this efficiently. We just need to keep track of the timestamp at which we began our last ZK write, and be careful not to serve writes if it's possible that we might have been expired. I think the assumption of equal rate clocks is an OK one (note this is different than synchronized clocks). I'll sleep on this and write up a description tomorrow to make it more clear.

Henry Robinson added a comment - 18/Mar/10 07:02

Clocks, anecdotally, do progress at different rates. Also you would have to ensure that you read and test the clock atomically with the update, otherwise you can read an ok timestamp, get a gc pause and then send the update. The only thing a clock read can really tell you is that it was time X some time in the past.

Wouldn't providing b and c be the best combination? But maybe parameterise the ensureUpToDate read by a logical timestamp. So if you want to maintain out-of-band causality, you have to send the timestamp to your peer for them to do the read. Makes it explicit.

Henry Robinson added a comment - 18/Mar/10 07:02 Clocks, anecdotally, do progress at different rates. Also you would have to ensure that you read and test the clock atomically with the update, otherwise you can read an ok timestamp, get a gc pause and then send the update. The only thing a clock read can really tell you is that it was time X some time in the past. Wouldn't providing b and c be the best combination? But maybe parameterise the ensureUpToDate read by a logical timestamp. So if you want to maintain out-of-band causality, you have to send the timestamp to your peer for them to do the read. Makes it explicit.

Todd Lipcon added a comment - 18/Mar/10 07:40

Clocks, anecdotally, do progress at different rates

Certainly within a bounded amount of error for practical systems, though - we could set this error as high as 50%, and on the time scales we're talking about I dont think one node will think 5 seconds passed while another thinks 10.

Also you would have to ensure that you read and test the clock atomically with the update,

I don't think so, in this case. Let me try to work through this somewhat [maybe overly] formally (mostly to convince myself too!)

We have the following events:

1) node A reads its timestamp as T1
2) node A sends a sync() message to ZK
2a) ZK receives sync() method and responds
3) node A receives success from sync (ie things have been sunk)
3a) concurrently at some point, node A loses its connection to ZK (network partition or some such)
4) client C sends a request to node A (call this T2)
5) node A receives a request from client C (call this T3)
6) node A responds to C
7) C receives response
8) ZK times out session to A (call this T4)
[note that this sequence above isn't a defined ordering]

Looking at "happens-before" relations, we know the following easily:

1 < 2 < 3 < 5 < 6 (these are all seen by A in this order, so we know it to be true)
3 < 3a (the connection must have been up when we received success)
1 < 2a < 3 (causal)
4 < 5 < 6 < 7 (causal)

Let's say that ZK will time out a node it hasn't heard from in Z seconds. From Z's perspective, then, step 8 occurs at least Z seconds after step 2a. Since step 1 happens before 2a (see above), we know that step 8 happens at least Z seconds after step 1. If we assume that ZK's clock progresses at some error ratio of A's clock, then step 8 happens at Z*errorRatio after it received the sync. It received the sync (2a) some unknown amount of time after T1 due to latency. So T4 from A's perspective = T1 + Z*errorRatio + latency. That is, as long as we are within Z*errorRatio seconds after sending our last ZK message, we are "in the clear" that no one else has decided we're dead.

Back to the problem at hand, to avoid "time travel" reads, what we need to do is make sure that when we initiate the read from a client, the target region server is still holding the region (ie 4 happens before 8). We already know 4 happens before 5, so if 5 happens before 8, that's a stronger condition. We know step 5 happens before 8 if T3 < T4. We decided T4 > T1 + Z*errorRatio + latency. So if T3 < T1 + Z*errorRatio + latency, we are good to go. We don't know latency, but it's always positive so it only helps us.

Does this sound correct?

Todd Lipcon added a comment - 18/Mar/10 07:40 Clocks, anecdotally, do progress at different rates Certainly within a bounded amount of error for practical systems, though - we could set this error as high as 50%, and on the time scales we're talking about I dont think one node will think 5 seconds passed while another thinks 10. Also you would have to ensure that you read and test the clock atomically with the update, I don't think so, in this case. Let me try to work through this somewhat [maybe overly] formally (mostly to convince myself too!) We have the following events: 1) node A reads its timestamp as T1 2) node A sends a sync() message to ZK 2a) ZK receives sync() method and responds 3) node A receives success from sync (ie things have been sunk) 3a) concurrently at some point, node A loses its connection to ZK (network partition or some such) 4) client C sends a request to node A (call this T2) 5) node A receives a request from client C (call this T3) 6) node A responds to C 7) C receives response 8) ZK times out session to A (call this T4) [note that this sequence above isn't a defined ordering] Looking at "happens-before" relations, we know the following easily: 1 < 2 < 3 < 5 < 6 (these are all seen by A in this order, so we know it to be true) 3 < 3a (the connection must have been up when we received success) 1 < 2a < 3 (causal) 4 < 5 < 6 < 7 (causal) Let's say that ZK will time out a node it hasn't heard from in Z seconds. From Z's perspective, then, step 8 occurs at least Z seconds after step 2a. Since step 1 happens before 2a (see above), we know that step 8 happens at least Z seconds after step 1. If we assume that ZK's clock progresses at some error ratio of A's clock, then step 8 happens at Z*errorRatio after it received the sync. It received the sync (2a) some unknown amount of time after T1 due to latency. So T4 from A's perspective = T1 + Z*errorRatio + latency. That is, as long as we are within Z*errorRatio seconds after sending our last ZK message, we are "in the clear" that no one else has decided we're dead. Back to the problem at hand, to avoid "time travel" reads, what we need to do is make sure that when we initiate the read from a client, the target region server is still holding the region (ie 4 happens before 8). We already know 4 happens before 5, so if 5 happens before 8, that's a stronger condition. We know step 5 happens before 8 if T3 < T4. We decided T4 > T1 + Z*errorRatio + latency. So if T3 < T1 + Z*errorRatio + latency, we are good to go. We don't know latency, but it's always positive so it only helps us. Does this sound correct?

Todd Lipcon added a comment - 22/Mar/10 17:49

Despite my length comment above, I think option (c) for stale reads is actually best. We would by default allow stale reads, and add an API (either per-Get or perhaps per-HTable instance) to say either "ensure data is up to date with respect to the current instant" or "all further reads should be up-to-date". This will allow us to add a new feature in the future of read-only region replicas (~~HBASE-2357~~). I think we should, though, guarantee "no time travel of a single row from the vantage point of a single client". This needs a bit more fleshing out, will think on it and add to the document this week.

Todd Lipcon added a comment - 22/Mar/10 17:49 Despite my length comment above, I think option (c) for stale reads is actually best. We would by default allow stale reads, and add an API (either per-Get or perhaps per-HTable instance) to say either "ensure data is up to date with respect to the current instant" or "all further reads should be up-to-date". This will allow us to add a new feature in the future of read-only region replicas ( HBASE-2357 ). I think we should, though, guarantee "no time travel of a single row from the vantage point of a single client". This needs a bit more fleshing out, will think on it and add to the document this week.

ryan rawson added a comment - 23/Mar/10 00:49

right now the way we implement some of this stuff is via the JMM, we use locks and other things to essentially create an order of events that a client will see. For example in my 2248 patch, the ReadWriteConcurrencyControl uses an atomic incrementing long to move the state in one direction only. Users won't have time travel problems in this regime.

In terms of crashes and log recovery, the sequential log recovery id solves this problem, no?

ryan rawson added a comment - 23/Mar/10 00:49 right now the way we implement some of this stuff is via the JMM, we use locks and other things to essentially create an order of events that a client will see. For example in my 2248 patch, the ReadWriteConcurrencyControl uses an atomic incrementing long to move the state in one direction only. Users won't have time travel problems in this regime. In terms of crashes and log recovery, the sequential log recovery id solves this problem, no?

Todd Lipcon added a comment - 23/Mar/10 00:56

Yea, the reason I'm voting for C is not for the current model, but rather so we can add some more features down the road. The idea of a read-only replica that lags the actual region is one (see ~~HBASE-2357~~). Another is that when a region is being loaded after a failure, it could theoretically serve stale data in readonly mode from the StoreFiles while it's replaying the log into memstores. There are plenty of examples where we may want to time travel (trade off uptodate-ness for read availability), and making staleness an option opens us up to these possibilities.

Todd Lipcon added a comment - 23/Mar/10 00:56 Yea, the reason I'm voting for C is not for the current model, but rather so we can add some more features down the road. The idea of a read-only replica that lags the actual region is one (see HBASE-2357 ). Another is that when a region is being loaded after a failure, it could theoretically serve stale data in readonly mode from the StoreFiles while it's replaying the log into memstores. There are plenty of examples where we may want to time travel (trade off uptodate-ness for read availability), and making staleness an option opens us up to these possibilities.

Michael Stack added a comment - 31/Mar/10 06:30

C. sounds good with the option per-Get/per-Scan/per-Delete to say ensure up-to-date as of 'now' (What does this option do if a regoin is replaying WAL edits? Block? Fail?).

The document is looking healthy. How about posting it to hbase-dev for discussion. The kick-off might be adopt this document as dev target going forward with hbase 0.21 (0.22) gated on adherence?

Michael Stack added a comment - 31/Mar/10 06:30 C. sounds good with the option per-Get/per-Scan/per-Delete to say ensure up-to-date as of 'now' (What does this option do if a regoin is replaying WAL edits? Block? Fail?). The document is looking healthy. How about posting it to hbase-dev for discussion. The kick-off might be adopt this document as dev target going forward with hbase 0.21 (0.22) gated on adherence?

Todd Lipcon added a comment - 31/Mar/10 23:56

Thanks for reviving this issue, Stack.

I thought a bit more about the stale reads thing, and I think the safest bet is this: by default we do not allow stale reads, but in the future we could add a flag on get() calls that explicitly allows it. I think this is more what people expect out of a datastore, and if people want to make the tradeoff they should ask for it. Since we determined above it should be perfectly efficient to be correct, we might as well be correct by default.

Here's the current state of the gist:

Here's a first pass at some kind of spec. These aren't meant to be final - just posting for discussion. I anticipate that after we (developers) come to some kind of conclusion here we will want to run this by the user list to see if we're missing use cases, etc.

Definitions

For the sake of common vocabulary, we define the following terms:

ATOMICITY: an operation is atomic if it either completes entirely or not at all
CONSISTENCY: all actions cause the table to transition from one valid state directly to another (eg a row will not disappear during an update,e tc)
ISOLATION: an operation is isolated if it appears to complete independently of any other concurrent transaction
DURABILITY: any update that reports "successful" to the client will not be lost
VISIBILITY: an update is considered visible if any subsequent read will see the update as having been committed

The terms must and may are used as specified by RFC 2119. In short, the word "must" implies that, if some case exists where the statement is not true, it is a bug. The word "may" implies that, even if the guarantee is provided in a current release, users should not rely on it.

APIs to consider

Read APIs
- get
- scan

Write APIs
- put
- batch put
- delete

Combination (read-modify-write) APIs
- incrementColumnValue
- checkAndPut

Guarantees Provided

Atomicity

All mutations are atomic within a row. Any put will either wholely succeed or wholely fail.
1. An operation that returns a "success" code has completely succeeded.
2. An operation that returns a "failure" code has completely failed.
3. An operation that times out may have succeeded and may have failed. However, it will not have partially succeeded or failed.
This is true even if the mutation crosses multiple column families within a row.
APIs that mutate several rows will not be atomic across the multiple rows. For example, a multiput that operates on rows 'a','b', and 'c' may return having mutated some but not all of the rows. In such cases, these APIs will return a list of success codes, each of which may be succeeded, failed, or timed out as described above.
The checkAndPut API happens atomically like the typical compareAndSet (CAS) operation found in many hardware architectures.
The order of mutations is seen to happen in a well-defined order for each row, with no interleaving. For example, if one writer issues the mutation "a=1,b=1,c=1" and another writer issues the mutation "a=2,b=2,c=2", the row must either be "a=1,b=1,c=1" or "a=2,b=2,c=2" and must not be something like "a=1,b=2,c=1".
1. Please note that this is not true across rows for multirow batch mutations.

Consistency and Isolation

All rows returned via any access API will consist of a complete row that existed at some point in the table's history.
This is true across column families - i.e a get of a full row that occurs concurrent with some mutations 1,2,3,4,5 will return a complete row that existed at some point in time between mutation i and i+1 for some i between 1 and 5.

Consistency of Scans

A scan is not a consistent view of a table. Scans do not exhibit snapshot isolation.

Rather, scans have the following properties:

Any row returned by the scan will be a consistent view (i.e. that version of the complete row existed at some point in time)
A scan will always reflect a view of the data at least as new as the beginning of the scan. This satisfies the visibility guarantees enumerated below.
1. For example, if client A writes data X and then communicates via a side channel to client B, any scans started by client B will contain data at least as new as X.
2. A scan must reflect all mutations committed prior to the construction of the scanner, and may reflect some mutations committed subsequent to the construction of the scanner.
3. Scans must include all data written prior to the scan (except in the case where data is subsequently mutated, in which case it may reflect the mutation)

Those familiar with relational databases will recognize this isolation level as "read committed".

Please note that the guarantees listed above regarding scanner consistency are referring to "transaction commit time", not the "timestamp" field of each cell. That is to say, a scanner started at time t may see edits with a timestamp value less than t, if those edits were committed with a "backdated" timestamp after the scanner was constructed.

Visibility

When a client receives a "success" response for any mutation, that mutation is immediately visible to both that client and any client with whom it later communicates through side channels.
A row must never exhibit so-called "time-travel" properties. That is to say, if a series of mutations moves a row sequentially through a series of states, any sequence of concurrent reads will return a subsequence of those states.
1. For example, if a row's cells are mutated using the "incrementColumnValue" API, a client must never see the value of any cell decrease.
2. This is true regardless of which read API is used to read back the mutation.
Any version of a cell that has been returned to a read operation is guaranteed to be durably stored.

Durability

All visible data is also durable data. That is to say, a read will never return data that has not been made durable on disk[1]
Any operation that returns a "success" code (eg does not throw an exception) will be made durable.
Any operation that returns a "failure" code will not be made durable (subject to the Atomicity guarantees above)
All reasonable failure scenarios will not affect any of the guarantees of this document.

Tunability

All of the above guarantees must be possible within HBase. For users who would like to trade off some guarantees for performance, HBase may offer several tuning options. For example:

Visibility may be tuned on a per-read basis to allow stale reads or time travel.
Durability may be tuned to only flush data to disk on a periodic basis

Notes:

[1] In the context of HBase, "durably on disk" implies an hflush() call on the transaction log. This does not actually imply an fsync() to magnetic media, but rather just that the data has been written to the OS cache on all replicas of the log. In the case of a full datacenter power loss, it is possible that the edits are not truly durable.

Todd Lipcon added a comment - 31/Mar/10 23:56 Thanks for reviving this issue, Stack. I thought a bit more about the stale reads thing, and I think the safest bet is this: by default we do not allow stale reads, but in the future we could add a flag on get() calls that explicitly allows it. I think this is more what people expect out of a datastore, and if people want to make the tradeoff they should ask for it. Since we determined above it should be perfectly efficient to be correct, we might as well be correct by default. Here's the current state of the gist: Here's a first pass at some kind of spec. These aren't meant to be final - just posting for discussion. I anticipate that after we (developers) come to some kind of conclusion here we will want to run this by the user list to see if we're missing use cases, etc. Definitions For the sake of common vocabulary, we define the following terms: ATOMICITY : an operation is atomic if it either completes entirely or not at all CONSISTENCY : all actions cause the table to transition from one valid state directly to another (eg a row will not disappear during an update,e tc) ISOLATION : an operation is isolated if it appears to complete independently of any other concurrent transaction DURABILITY : any update that reports "successful" to the client will not be lost VISIBILITY : an update is considered visible if any subsequent read will see the update as having been committed The terms must and may are used as specified by RFC 2119. In short, the word "must" implies that, if some case exists where the statement is not true, it is a bug. The word "may" implies that, even if the guarantee is provided in a current release, users should not rely on it. APIs to consider Read APIs get scan Write APIs put batch put delete Combination (read-modify-write) APIs incrementColumnValue checkAndPut Guarantees Provided Atomicity All mutations are atomic within a row. Any put will either wholely succeed or wholely fail. An operation that returns a "success" code has completely succeeded. An operation that returns a "failure" code has completely failed. An operation that times out may have succeeded and may have failed. However, it will not have partially succeeded or failed. This is true even if the mutation crosses multiple column families within a row. APIs that mutate several rows will not be atomic across the multiple rows. For example, a multiput that operates on rows 'a','b', and 'c' may return having mutated some but not all of the rows. In such cases, these APIs will return a list of success codes, each of which may be succeeded, failed, or timed out as described above. The checkAndPut API happens atomically like the typical compareAndSet (CAS) operation found in many hardware architectures. The order of mutations is seen to happen in a well-defined order for each row, with no interleaving. For example, if one writer issues the mutation "a=1,b=1,c=1" and another writer issues the mutation "a=2,b=2,c=2", the row must either be "a=1,b=1,c=1" or "a=2,b=2,c=2" and must not be something like "a=1,b=2,c=1". Please note that this is not true across rows for multirow batch mutations. Consistency and Isolation All rows returned via any access API will consist of a complete row that existed at some point in the table's history. This is true across column families - i.e a get of a full row that occurs concurrent with some mutations 1,2,3,4,5 will return a complete row that existed at some point in time between mutation i and i+1 for some i between 1 and 5. Consistency of Scans A scan is not a consistent view of a table. Scans do not exhibit snapshot isolation . Rather, scans have the following properties: Any row returned by the scan will be a consistent view (i.e. that version of the complete row existed at some point in time) A scan will always reflect a view of the data at least as new as the beginning of the scan. This satisfies the visibility guarantees enumerated below. For example, if client A writes data X and then communicates via a side channel to client B, any scans started by client B will contain data at least as new as X. A scan must reflect all mutations committed prior to the construction of the scanner, and may reflect some mutations committed subsequent to the construction of the scanner. Scans must include all data written prior to the scan (except in the case where data is subsequently mutated, in which case it may reflect the mutation) Those familiar with relational databases will recognize this isolation level as "read committed". Please note that the guarantees listed above regarding scanner consistency are referring to "transaction commit time", not the "timestamp" field of each cell. That is to say, a scanner started at time t may see edits with a timestamp value less than t, if those edits were committed with a "backdated" timestamp after the scanner was constructed. Visibility When a client receives a "success" response for any mutation, that mutation is immediately visible to both that client and any client with whom it later communicates through side channels. A row must never exhibit so-called "time-travel" properties. That is to say, if a series of mutations moves a row sequentially through a series of states, any sequence of concurrent reads will return a subsequence of those states. For example, if a row's cells are mutated using the "incrementColumnValue" API, a client must never see the value of any cell decrease. This is true regardless of which read API is used to read back the mutation. Any version of a cell that has been returned to a read operation is guaranteed to be durably stored. Durability All visible data is also durable data. That is to say, a read will never return data that has not been made durable on disk [1] Any operation that returns a "success" code (eg does not throw an exception) will be made durable. Any operation that returns a "failure" code will not be made durable (subject to the Atomicity guarantees above) All reasonable failure scenarios will not affect any of the guarantees of this document. Tunability All of the above guarantees must be possible within HBase. For users who would like to trade off some guarantees for performance, HBase may offer several tuning options. For example: Visibility may be tuned on a per-read basis to allow stale reads or time travel. Durability may be tuned to only flush data to disk on a periodic basis Notes: [1] In the context of HBase, "durably on disk" implies an hflush() call on the transaction log. This does not actually imply an fsync() to magnetic media, but rather just that the data has been written to the OS cache on all replicas of the log. In the case of a full datacenter power loss, it is possible that the edits are not truly durable.

Andrew Kyle Purtell added a comment - 01/Apr/10 00:09

by default we do not allow stale reads, but in the future we could add a flag on get() calls that explicitly allows it

If we accept a degraded mode of operation (~~HBASE-2183~~) while HDFS is unavailable or a region is taking on IOEs (switch to read only cascades from region to table), then there will be circumstances where the guarantees normally made by the system will not apply. I would expect the initial strategy to be something like:

Stop accepting writes
Serve data available in cache, set flag or field in response to indicate degraded operation
Serve data available in disk stores that can still be accessed, whichever are not throwing IOEs.

This is relevant to this issue in the sense that committed data may temporarily "disappear".

Andrew Kyle Purtell added a comment - 01/Apr/10 00:09 by default we do not allow stale reads, but in the future we could add a flag on get() calls that explicitly allows it If we accept a degraded mode of operation ( HBASE-2183 ) while HDFS is unavailable or a region is taking on IOEs (switch to read only cascades from region to table), then there will be circumstances where the guarantees normally made by the system will not apply. I would expect the initial strategy to be something like: Stop accepting writes Serve data available in cache, set flag or field in response to indicate degraded operation Serve data available in disk stores that can still be accessed, whichever are not throwing IOEs. This is relevant to this issue in the sense that committed data may temporarily "disappear".

Todd Lipcon added a comment - 01/Apr/10 00:20

Yep, I agree for sure that we want degraded operation to be a possibility. I was thinking, though, that the client should specify "I'm OK with stale data" rather than having to specify "I'm not OK with stale data". Perhaps if you've set "I'm not OK with stale data" (my proposed default) and the system is degraded, it would throw an exception saying "DegradedModeException" so users know they could try again with that flag?

I'm just a bit nervous about people thinking they always get up-to-date data, and then suddenly they see stale reads from the system without having to specifically OK it.

Todd Lipcon added a comment - 01/Apr/10 00:20 Yep, I agree for sure that we want degraded operation to be a possibility. I was thinking, though, that the client should specify "I'm OK with stale data" rather than having to specify "I'm not OK with stale data". Perhaps if you've set "I'm not OK with stale data" (my proposed default) and the system is degraded, it would throw an exception saying "DegradedModeException" so users know they could try again with that flag? I'm just a bit nervous about people thinking they always get up-to-date data, and then suddenly they see stale reads from the system without having to specifically OK it.

Andrew Kyle Purtell added a comment - 01/Apr/10 00:21

DegradedModeException

+1

Andrew Kyle Purtell added a comment - 01/Apr/10 00:21 DegradedModeException +1

Jonathan Gray added a comment - 01/Apr/10 02:35

Awesome stuff, Todd.

According to these guarantees, we would not be able to make regions of a crashed server available until after their log replay completes, correct?

Jonathan Gray added a comment - 01/Apr/10 02:35 Awesome stuff, Todd. According to these guarantees, we would not be able to make regions of a crashed server available until after their log replay completes, correct?

Todd Lipcon added a comment - 01/Apr/10 04:32

According to these guarantees, we would not be able to make regions of a crashed server available until after their log replay completes, correct?

Yea, I think we'd to make the region available in "degraded/stale mode" - then clients who specifically set the "stale data OK" flag would see it just fine. Standard clients that expect new data would get the special exception.

Todd Lipcon added a comment - 01/Apr/10 04:32 According to these guarantees, we would not be able to make regions of a crashed server available until after their log replay completes, correct? Yea, I think we'd to make the region available in "degraded/stale mode" - then clients who specifically set the "stale data OK" flag would see it just fine. Standard clients that expect new data would get the special exception.

Jonathan Gray added a comment - 01/Apr/10 05:00

Right, got it. Sounds good.

Jonathan Gray added a comment - 01/Apr/10 05:00 Right, got it. Sounds good.

Michael Stack added a comment - 01/Apr/10 22:18

.bq According to these guarantees, we would not be able to make regions of a crashed server available until after their log replay completes, correct?

...but no reason we couldn't already up and taking writes in the meantime before the log replay compoletes?

Michael Stack added a comment - 01/Apr/10 22:18 .bq According to these guarantees, we would not be able to make regions of a crashed server available until after their log replay completes, correct? ...but no reason we couldn't already up and taking writes in the meantime before the log replay compoletes?

Todd Lipcon added a comment - 01/Apr/10 22:25

...but no reason we couldn't already up and taking writes in the meantime before the log replay compoletes?

Interesting... I suppose that's true, but only writes that don't require reads first (eg a delete might be problematic because we need to know the newest timestamp prior to it, right?)

Todd Lipcon added a comment - 01/Apr/10 22:25 ...but no reason we couldn't already up and taking writes in the meantime before the log replay compoletes? Interesting... I suppose that's true, but only writes that don't require reads first (eg a delete might be problematic because we need to know the newest timestamp prior to it, right?)

Jonathan Gray added a comment - 01/Apr/10 23:32

Interesting... I suppose that's true, but only writes that don't require reads first (eg a delete might be problematic because we need to know the newest timestamp prior to it, right?)

That's right, so would be kind of weird. But the thing is a majority of use cases are not using deleteLatest or checkAndPut or incrementCV, so would be unfortunate to block all writes for limited cases.

Maybe writes that require reads could through the same DegradedModeException but simple writes would go through in this mode.

Jonathan Gray added a comment - 01/Apr/10 23:32 Interesting... I suppose that's true, but only writes that don't require reads first (eg a delete might be problematic because we need to know the newest timestamp prior to it, right?) That's right, so would be kind of weird. But the thing is a majority of use cases are not using deleteLatest or checkAndPut or incrementCV, so would be unfortunate to block all writes for limited cases. Maybe writes that require reads could through the same DegradedModeException but simple writes would go through in this mode.

Andrew Kyle Purtell added a comment - 01/Apr/10 23:39

Maybe writes that require reads could through the same DegradedModeException but simple writes would go through in this mode.

+1

Andrew Kyle Purtell added a comment - 01/Apr/10 23:39 Maybe writes that require reads could through the same DegradedModeException but simple writes would go through in this mode. +1

Todd Lipcon added a comment - 02/Apr/10 00:39

Yep, I think that makes sense. Do you think anything in this document precludes that? Basically what we're discussing here is cleverness that allows us to be more useful while maintaining the same guarantees. So long as our "spec" (output of this JIRA) talks about user visible properties, we can feel free to be as clever as we want underneath

Todd Lipcon added a comment - 02/Apr/10 00:39 Yep, I think that makes sense. Do you think anything in this document precludes that? Basically what we're discussing here is cleverness that allows us to be more useful while maintaining the same guarantees. So long as our "spec" (output of this JIRA) talks about user visible properties, we can feel free to be as clever as we want underneath

Michael Stack added a comment - 02/Apr/10 04:01

Some formatting mods (Minor).

Michael Stack added a comment - 02/Apr/10 04:01 Some formatting mods (Minor).

Michael Stack added a comment - 02/Apr/10 04:07

Minor alligning of Tunability header heading to match that that of Consistency and Visibility.

Michael Stack added a comment - 02/Apr/10 04:07 Minor alligning of Tunability header heading to match that that of Consistency and Visibility.

Michael Stack added a comment - 02/Apr/10 04:11

I read the spec. with the onlining of regions to take writes but not reads and I do not see violation. I did wonder though if we should have a section on what happens if you try to read and system can't give you known correct data – e.g. the exception thrown if you try to read an onlined region that is replaying edits – but thought this an implementation detail that didn't belong herein.

Michael Stack added a comment - 02/Apr/10 04:11 I read the spec. with the onlining of regions to take writes but not reads and I do not see violation. I did wonder though if we should have a section on what happens if you try to read and system can't give you known correct data – e.g. the exception thrown if you try to read an onlined region that is replaying edits – but thought this an implementation detail that didn't belong herein.

Todd Lipcon added a comment - 07/Apr/10 00:17

Here's a patch that converts the document to forrest style and includes in docs

Todd Lipcon added a comment - 07/Apr/10 00:17 Here's a patch that converts the document to forrest style and includes in docs

Michael Stack added a comment - 08/Apr/10 00:09

Changed my mind. This is patch available and its blocker. Bringing back into 0.20.4.

Michael Stack added a comment - 08/Apr/10 00:09 Changed my mind. This is patch available and its blocker. Bringing back into 0.20.4.

Michael Stack added a comment - 08/Apr/10 00:17

Changed mind again. This doc. should go into the release that includes a sync. otherwise we're no where near to what this doc describes (What you think Todd?)

Michael Stack added a comment - 08/Apr/10 00:17 Changed mind again. This doc. should go into the release that includes a sync. otherwise we're no where near to what this doc describes (What you think Todd?)

Todd Lipcon added a comment - 08/Apr/10 00:22

Yep, that makes sense - we should add the unit tests that 'prove it', and then commit this with the tests once they pass

Todd Lipcon added a comment - 08/Apr/10 00:22 Yep, that makes sense - we should add the unit tests that 'prove it', and then commit this with the tests once they pass

Michael Stack added a comment - 20/Apr/10 23:18

Applied branch and trunk. Applied as incompatible change. Any issue found where we do not align w/ spec is a bad, bad bug and needs fixing. In trunk, site is broke so can't really see this new doc. readily but in trunk you can't see any of the old src doc. It needs moving over. In trunk I verified that its showing up in the left navbar on site and that it looks good.

Thanks for driving this through Todd.

Michael Stack added a comment - 20/Apr/10 23:18 Applied branch and trunk. Applied as incompatible change. Any issue found where we do not align w/ spec is a bad, bad bug and needs fixing. In trunk, site is broke so can't really see this new doc. readily but in trunk you can't see any of the old src doc. It needs moving over. In trunk I verified that its showing up in the left navbar on site and that it looks good. Thanks for driving this through Todd.

Michael Stack added a comment - 12/May/10 23:52

Marking these as fixed against 0.21.0 rather than against 0.20.5.

Michael Stack added a comment - 12/May/10 23:52 Marking these as fixed against 0.21.0 rather than against 0.20.5.

Lars Francke added a comment - 20/Nov/15 12:42

This issue was closed as part of a bulk closing operation on 2015-11-20. All issues that have been resolved and where all fixVersions have been released have been closed (following discussions on the mailing list).

Lars Francke added a comment - 20/Nov/15 12:42 This issue was closed as part of a bulk closing operation on 2015-11-20. All issues that have been resolved and where all fixVersions have been released have been closed (following discussions on the mailing list).

HBase

Enumerate ACID properties of HBase in a well defined spec

Details

Description

Attachments

Attachments

Issue Links

Activity

Definitions

APIs to consider

Guarantees Provided

Atomicity

Consistency and Isolation

Consistency of Scans

Visibility

Durability

Definitions

APIs to consider

Guarantees Provided

Atomicity

Consistency and Isolation

Consistency of Scans

Visibility

Durability

Tunability

People

Dates