Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-1767

Reordering of client operations from the same KuduSession is possible

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.1.0
    • None
    • client, tablet
    • None

    Description

      It is possible for client operations written via the same KuduSession to be reordered on the server side in MANUAL_FLUSH and AUTO_BACKGROUND_FLUSH modes. This violates our desired consistency guarantees.

      This may occur because we allow concurrent flushes from the client for throughput reasons and there is nothing enforcing the well-ordering of lock acquisition from a single client session on the server side.

      Attachments

        Issue Links

          Activity

            danburkert Dan Burkert added a comment -

            Is this consistency guarantee documented? I was not under the impression that the C++ or Java clients guaranteed this. It really hurts parallelism to do so (only one batch can be in flight to a tablet server at once). Additionally its not that strong of a guarantee when you consider that rows falling in separate tablets absolutely do not have this guarantee.

            danburkert Dan Burkert added a comment - Is this consistency guarantee documented? I was not under the impression that the C++ or Java clients guaranteed this. It really hurts parallelism to do so (only one batch can be in flight to a tablet server at once). Additionally its not that strong of a guarantee when you consider that rows falling in separate tablets absolutely do not have this guarantee.
            mpercy Mike Percy added a comment -

            I think most people would expect that the order that they put their updates into the same client is the order that they are applied.

            For example, if you do this:

            update rowid=5 value=1
            update rowid=5 value=2
            

            And then you flush, and it returns OK, but rowid=5 actually contains the value=1 after quiescing, then it appears that you have lost data and that the operations were erroneously misordered.

            mpercy Mike Percy added a comment - I think most people would expect that the order that they put their updates into the same client is the order that they are applied. For example, if you do this: update rowid=5 value=1 update rowid=5 value=2 And then you flush, and it returns OK, but rowid=5 actually contains the value=1 after quiescing, then it appears that you have lost data and that the operations were erroneously misordered.
            mpercy Mike Percy added a comment -

            Pulling in some comments from KUDU-1761.

            From dralves:

            You're suggesting that EO semantics should enforce write order but my point was: how is the server supposed to know that the client requires order enforcement versus just trying to do multiple writes at the same time?

            There's a "happened before" relationship here that we're expecting to be enforced and that I think should be enforced in the client, outside of EO semantics.

            From tlipcon:

            I agree that the exactly-once stuff should have no bearing on request ordering; it's only meant to provide idempotency, not any other cross-request guarantees, and I think it would be a mistake to try to bend it into providing some kind of serialization order (for the same reason David mentioned).

            My vote is that we commit a change to the client API docs that makes this behavior more clear – probably on the 'Flush' API and on the docs for the different flush modes. It's a slightly surprising bit of semantics, so worth noting, but I don't think it's worth prioritizing a fix at this point in time.

            mpercy Mike Percy added a comment - Pulling in some comments from KUDU-1761 . From dralves : You're suggesting that EO semantics should enforce write order but my point was: how is the server supposed to know that the client requires order enforcement versus just trying to do multiple writes at the same time? There's a "happened before" relationship here that we're expecting to be enforced and that I think should be enforced in the client, outside of EO semantics. From tlipcon : I agree that the exactly-once stuff should have no bearing on request ordering; it's only meant to provide idempotency, not any other cross-request guarantees, and I think it would be a mistake to try to bend it into providing some kind of serialization order (for the same reason David mentioned). My vote is that we commit a change to the client API docs that makes this behavior more clear – probably on the 'Flush' API and on the docs for the different flush modes. It's a slightly surprising bit of semantics, so worth noting, but I don't think it's worth prioritizing a fix at this point in time.
            tlipcon Todd Lipcon added a comment -

            mpercy - mind doing the commit to the API docs suggested above? Would be good to make the semantics as implemented clear, and then we can figure out how to make them less surprising in some future release.

            tlipcon Todd Lipcon added a comment - mpercy - mind doing the commit to the API docs suggested above? Would be good to make the semantics as implemented clear, and then we can figure out how to make them less surprising in some future release.
            tlipcon Todd Lipcon added a comment -

            Ping mpercy

            tlipcon Todd Lipcon added a comment - Ping mpercy
            mpercy Mike Percy added a comment -

            tlipcon, I'll doc this, it's on my list for this week.

            mpercy Mike Percy added a comment - tlipcon , I'll doc this, it's on my list for this week.
            mpercy Mike Percy added a comment - - edited

            After discussing this issue with a couple of folks, I am marking this issue Won't Fix for the following reasons:

            1. This behavior does not violate strict serializability because we are talking about operations that occur simultaneously from Kudu's perspective.
            2. There are workarounds for the observed behavior.

            Systems that want to write in a particular order can choose to flush one batch at a time to avoid reordering. If single-key ordering conflicts are the primary concern, care can also be taken by client programs to flush an update to a single key before apply()ing a following update to that key while still maintaining some parallelism at load time.

            mpercy Mike Percy added a comment - - edited After discussing this issue with a couple of folks, I am marking this issue Won't Fix for the following reasons: This behavior does not violate strict serializability because we are talking about operations that occur simultaneously from Kudu's perspective. There are workarounds for the observed behavior. Systems that want to write in a particular order can choose to flush one batch at a time to avoid reordering. If single-key ordering conflicts are the primary concern, care can also be taken by client programs to flush an update to a single key before apply()ing a following update to that key while still maintaining some parallelism at load time.
            mpercy Mike Percy added a comment -

            I filed KUDU-1841 as a potential way to make dealing with this issue friendlier.

            mpercy Mike Percy added a comment - I filed KUDU-1841 as a potential way to make dealing with this issue friendlier.
            danburkert Dan Burkert added a comment -

            I'm reopening this, because I think it is a useful feature to have, and I think we can fix it without impacting performance. I'm not keen on introducing more write modes, because I think even the set we have is unnecessarily confusing, and really only exists because we gradually changed the semantics of the clients instead of designing the semantics up-front. We can strengthen the guarantees without breaking backwards compat, so we don't have to introduce a new mode for this case.

            Fixing this would require the result tracker to reject out-of-order write batches from clients. Then, clients are free to have an arbitrary number of in-flight batches, however they should be encouraged to rate limit when memory pressure errors are received.

            danburkert Dan Burkert added a comment - I'm reopening this, because I think it is a useful feature to have, and I think we can fix it without impacting performance. I'm not keen on introducing more write modes, because I think even the set we have is unnecessarily confusing, and really only exists because we gradually changed the semantics of the clients instead of designing the semantics up-front. We can strengthen the guarantees without breaking backwards compat, so we don't have to introduce a new mode for this case. Fixing this would require the result tracker to reject out-of-order write batches from clients. Then, clients are free to have an arbitrary number of in-flight batches, however they should be encouraged to rate limit when memory pressure errors are received.
            tlipcon Todd Lipcon added a comment -

            I don't think simply rejecting out-of-order writes based on sequence number is doable.

            First of all, this would be too strict – two operations bound for different tablets on the same server would end up with a false ordering and thus cause them to be fully serialized rather than parallel. We depend on parallelism across tablets even if they're on the same server.

            Second, I think there would be issues with leader changes. Imagine something like:

            • Client sends seqno 1 to TS A
            • TS A crashes and so it doesn't yet respond (marks TS A as "down" in its metacache)
            • Client sends seqno 2 to TS B
            • TS B receives seqno 2 but hasn't yet seen seqno 1 as a replica
            • TS B finishes processing the UpdateConsensus RPC from the now-dead TS A and now sees seqno 1 "out of order"

            It can't reject seqno 1 at this point because it's already been replicated.

            I'm sure it's possible to get stricter semantics, but I think in most cases even if we provided this, clients would end up producing their own variant of this issue. For example, doing a bulk load using Spark or Impala means that many tasks are writing in parallel, and it's not deterministic who "wins" a race when the same row shows up twice.

            tlipcon Todd Lipcon added a comment - I don't think simply rejecting out-of-order writes based on sequence number is doable. First of all, this would be too strict – two operations bound for different tablets on the same server would end up with a false ordering and thus cause them to be fully serialized rather than parallel. We depend on parallelism across tablets even if they're on the same server. Second, I think there would be issues with leader changes. Imagine something like: Client sends seqno 1 to TS A TS A crashes and so it doesn't yet respond (marks TS A as "down" in its metacache) Client sends seqno 2 to TS B TS B receives seqno 2 but hasn't yet seen seqno 1 as a replica TS B finishes processing the UpdateConsensus RPC from the now-dead TS A and now sees seqno 1 "out of order" It can't reject seqno 1 at this point because it's already been replicated. I'm sure it's possible to get stricter semantics, but I think in most cases even if we provided this, clients would end up producing their own variant of this issue. For example, doing a bulk load using Spark or Impala means that many tasks are writing in parallel, and it's not deterministic who "wins" a race when the same row shows up twice.
            danburkert Dan Burkert added a comment -

            Good point. Could we add a 'depends-on' field to the write batch with the ID of the previous batch?

            danburkert Dan Burkert added a comment - Good point. Could we add a 'depends-on' field to the write batch with the ID of the previous batch?

            People

              Unassigned Unassigned
              mpercy Mike Percy
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: