[KUDU-1767] Reordering of client operations from the same KuduSession is possible - ASF JIRA

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.1.0
Fix Version/s: None
Component/s: client, tablet
Labels:
None

Description

It is possible for client operations written via the same KuduSession to be reordered on the server side in MANUAL_FLUSH and AUTO_BACKGROUND_FLUSH modes. This violates our desired consistency guarantees.

This may occur because we allow concurrent flushes from the client for throughput reasons and there is nothing enforcing the well-ordering of lock acquisition from a single client session on the server side.

Attachments

Issue Links

blocks

KUDU-1761 Flaky tablet_history_gc-itest due to interleaving of concurrent client flushes

Resolved

relates to

KUDU-1841 Add a client write mode like AUTO_BACKGROUND_FLUSH + data dependencies

Open

KUDU-352 Decide on and implement within-batch ordering for client API

Resolved

KUDU-2261 flush responses' order should match the order we call apply

Resolved

links to

Code review for doc updates

Activity

Ascending order - Click to sort in descending order

Dan Burkert added a comment - 29/Nov/16 17:35

Is this consistency guarantee documented? I was not under the impression that the C++ or Java clients guaranteed this. It really hurts parallelism to do so (only one batch can be in flight to a tablet server at once). Additionally its not that strong of a guarantee when you consider that rows falling in separate tablets absolutely do not have this guarantee.

Dan Burkert added a comment - 29/Nov/16 17:35 Is this consistency guarantee documented? I was not under the impression that the C++ or Java clients guaranteed this. It really hurts parallelism to do so (only one batch can be in flight to a tablet server at once). Additionally its not that strong of a guarantee when you consider that rows falling in separate tablets absolutely do not have this guarantee.

Mike Percy added a comment - 29/Nov/16 19:41

I think most people would expect that the order that they put their updates into the same client is the order that they are applied.

For example, if you do this:

update rowid=5 value=1
update rowid=5 value=2

And then you flush, and it returns OK, but rowid=5 actually contains the value=1 after quiescing, then it appears that you have lost data and that the operations were erroneously misordered.

Mike Percy added a comment - 29/Nov/16 19:41 I think most people would expect that the order that they put their updates into the same client is the order that they are applied. For example, if you do this: update rowid=5 value=1 update rowid=5 value=2 And then you flush, and it returns OK, but rowid=5 actually contains the value=1 after quiescing, then it appears that you have lost data and that the operations were erroneously misordered.

Mike Percy added a comment - 29/Nov/16 20:01

Pulling in some comments from ~~KUDU-1761~~.

From dralves:

You're suggesting that EO semantics should enforce write order but my point was: how is the server supposed to know that the client requires order enforcement versus just trying to do multiple writes at the same time?

There's a "happened before" relationship here that we're expecting to be enforced and that I think should be enforced in the client, outside of EO semantics.

From tlipcon:

I agree that the exactly-once stuff should have no bearing on request ordering; it's only meant to provide idempotency, not any other cross-request guarantees, and I think it would be a mistake to try to bend it into providing some kind of serialization order (for the same reason David mentioned).

My vote is that we commit a change to the client API docs that makes this behavior more clear – probably on the 'Flush' API and on the docs for the different flush modes. It's a slightly surprising bit of semantics, so worth noting, but I don't think it's worth prioritizing a fix at this point in time.

Mike Percy added a comment - 29/Nov/16 20:01 Pulling in some comments from KUDU-1761 . From dralves : You're suggesting that EO semantics should enforce write order but my point was: how is the server supposed to know that the client requires order enforcement versus just trying to do multiple writes at the same time? There's a "happened before" relationship here that we're expecting to be enforced and that I think should be enforced in the client, outside of EO semantics. From tlipcon : I agree that the exactly-once stuff should have no bearing on request ordering; it's only meant to provide idempotency, not any other cross-request guarantees, and I think it would be a mistake to try to bend it into providing some kind of serialization order (for the same reason David mentioned). My vote is that we commit a change to the client API docs that makes this behavior more clear – probably on the 'Flush' API and on the docs for the different flush modes. It's a slightly surprising bit of semantics, so worth noting, but I don't think it's worth prioritizing a fix at this point in time.

Todd Lipcon added a comment - 29/Nov/16 22:20

mpercy - mind doing the commit to the API docs suggested above? Would be good to make the semantics as implemented clear, and then we can figure out how to make them less surprising in some future release.

Todd Lipcon added a comment - 29/Nov/16 22:20 mpercy - mind doing the commit to the API docs suggested above? Would be good to make the semantics as implemented clear, and then we can figure out how to make them less surprising in some future release.

Todd Lipcon added a comment - 05/Dec/16 03:47

Ping mpercy

Todd Lipcon added a comment - 05/Dec/16 03:47 Ping mpercy

Mike Percy added a comment - 05/Dec/16 11:11

tlipcon, I'll doc this, it's on my list for this week.

Mike Percy added a comment - 05/Dec/16 11:11 tlipcon , I'll doc this, it's on my list for this week.

Mike Percy added a comment - 10/Jan/17 03:27 - edited

After discussing this issue with a couple of folks, I am marking this issue Won't Fix for the following reasons:

This behavior does not violate strict serializability because we are talking about operations that occur simultaneously from Kudu's perspective.
There are workarounds for the observed behavior.

Systems that want to write in a particular order can choose to flush one batch at a time to avoid reordering. If single-key ordering conflicts are the primary concern, care can also be taken by client programs to flush an update to a single key before apply()ing a following update to that key while still maintaining some parallelism at load time.

Mike Percy added a comment - 10/Jan/17 03:27 - edited After discussing this issue with a couple of folks, I am marking this issue Won't Fix for the following reasons: This behavior does not violate strict serializability because we are talking about operations that occur simultaneously from Kudu's perspective. There are workarounds for the observed behavior. Systems that want to write in a particular order can choose to flush one batch at a time to avoid reordering. If single-key ordering conflicts are the primary concern, care can also be taken by client programs to flush an update to a single key before apply()ing a following update to that key while still maintaining some parallelism at load time.

Mike Percy added a comment - 19/Jan/17 21:44

I filed KUDU-1841 as a potential way to make dealing with this issue friendlier.

Mike Percy added a comment - 19/Jan/17 21:44 I filed KUDU-1841 as a potential way to make dealing with this issue friendlier.

Dan Burkert added a comment - 09/Oct/17 22:41

I'm reopening this, because I think it is a useful feature to have, and I think we can fix it without impacting performance. I'm not keen on introducing more write modes, because I think even the set we have is unnecessarily confusing, and really only exists because we gradually changed the semantics of the clients instead of designing the semantics up-front. We can strengthen the guarantees without breaking backwards compat, so we don't have to introduce a new mode for this case.

Fixing this would require the result tracker to reject out-of-order write batches from clients. Then, clients are free to have an arbitrary number of in-flight batches, however they should be encouraged to rate limit when memory pressure errors are received.

Dan Burkert added a comment - 09/Oct/17 22:41 I'm reopening this, because I think it is a useful feature to have, and I think we can fix it without impacting performance. I'm not keen on introducing more write modes, because I think even the set we have is unnecessarily confusing, and really only exists because we gradually changed the semantics of the clients instead of designing the semantics up-front. We can strengthen the guarantees without breaking backwards compat, so we don't have to introduce a new mode for this case. Fixing this would require the result tracker to reject out-of-order write batches from clients. Then, clients are free to have an arbitrary number of in-flight batches, however they should be encouraged to rate limit when memory pressure errors are received.

Todd Lipcon added a comment - 09/Oct/17 23:01

I don't think simply rejecting out-of-order writes based on sequence number is doable.

First of all, this would be too strict – two operations bound for different tablets on the same server would end up with a false ordering and thus cause them to be fully serialized rather than parallel. We depend on parallelism across tablets even if they're on the same server.

Second, I think there would be issues with leader changes. Imagine something like:

Client sends seqno 1 to TS A
TS A crashes and so it doesn't yet respond (marks TS A as "down" in its metacache)
Client sends seqno 2 to TS B
TS B receives seqno 2 but hasn't yet seen seqno 1 as a replica
TS B finishes processing the UpdateConsensus RPC from the now-dead TS A and now sees seqno 1 "out of order"

It can't reject seqno 1 at this point because it's already been replicated.

I'm sure it's possible to get stricter semantics, but I think in most cases even if we provided this, clients would end up producing their own variant of this issue. For example, doing a bulk load using Spark or Impala means that many tasks are writing in parallel, and it's not deterministic who "wins" a race when the same row shows up twice.

Todd Lipcon added a comment - 09/Oct/17 23:01 I don't think simply rejecting out-of-order writes based on sequence number is doable. First of all, this would be too strict – two operations bound for different tablets on the same server would end up with a false ordering and thus cause them to be fully serialized rather than parallel. We depend on parallelism across tablets even if they're on the same server. Second, I think there would be issues with leader changes. Imagine something like: Client sends seqno 1 to TS A TS A crashes and so it doesn't yet respond (marks TS A as "down" in its metacache) Client sends seqno 2 to TS B TS B receives seqno 2 but hasn't yet seen seqno 1 as a replica TS B finishes processing the UpdateConsensus RPC from the now-dead TS A and now sees seqno 1 "out of order" It can't reject seqno 1 at this point because it's already been replicated. I'm sure it's possible to get stricter semantics, but I think in most cases even if we provided this, clients would end up producing their own variant of this issue. For example, doing a bulk load using Spark or Impala means that many tasks are writing in parallel, and it's not deterministic who "wins" a race when the same row shows up twice.

Dan Burkert added a comment - 09/Oct/17 23:32

Good point. Could we add a 'depends-on' field to the write batch with the ID of the previous batch?

Dan Burkert added a comment - 09/Oct/17 23:32 Good point. Could we add a 'depends-on' field to the write batch with the ID of the previous batch?

People

Assignee:: Unassigned

Reporter:: Mike Percy

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 29/Nov/16 12:29

Updated:: 25/Jul/19 22:36