Cassandra
  1. Cassandra
  2. CASSANDRA-6106

Provide timestamp with true microsecond resolution

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Won't Fix
    • Fix Version/s: 2.2.0 beta 1
    • Component/s: Core
    • Labels:
    • Environment:

      DSE Cassandra 3.1, but also HEAD

      Description

      I noticed this blog post: http://aphyr.com/posts/294-call-me-maybe-cassandra mentioned issues with millisecond rounding in timestamps and was able to reproduce the issue. If I specify a timestamp in a mutating query, I get microsecond precision, but if I don't, I get timestamps rounded to the nearest millisecond, at least for my first query on a given connection, which substantially increases the possibilities of collision.

      I believe I found the offending code, though I am by no means sure this is comprehensive. I think we probably need a fairly comprehensive replacement of all uses of System.currentTimeMillis() with System.nanoTime().

      1. microtimstamp_random_rev2.patch
        2 kB
        Christopher Smith
      2. microtimstamp_random.patch
        2 kB
        Christopher Smith
      3. microtimstamp.patch
        2 kB
        Christopher Smith

        Issue Links

          Activity

          Hide
          Jonathan Ellis added a comment -

          Unfortunately it's not that easy to get a high-resolution timestamp in Java. Here's what nanotime javadoc says:

          This method can only be used to measure elapsed time and is not related to any other notion of system or wall-clock time. The value returned represents nanoseconds since some fixed but arbitrary origin time (perhaps in the future, so values may be negative). The same origin is used by all invocations of this method in an instance of a Java virtual machine; other virtual machine instances are likely to use a different origin...

          The values returned by this method become meaningful only when the difference between two such values, obtained within the same instance of a Java virtual machine, is computed.

          It is NOT time since any particular epoch and you WILL have problems if you treat it like that. (We've actually had a bug where that slipped past code review: CASSANDRA-4432.)

          http://docs.oracle.com/javase/7/docs/api/java/lang/System.html#nanoTime()

          Show
          Jonathan Ellis added a comment - Unfortunately it's not that easy to get a high-resolution timestamp in Java. Here's what nanotime javadoc says: This method can only be used to measure elapsed time and is not related to any other notion of system or wall-clock time. The value returned represents nanoseconds since some fixed but arbitrary origin time (perhaps in the future, so values may be negative). The same origin is used by all invocations of this method in an instance of a Java virtual machine; other virtual machine instances are likely to use a different origin... The values returned by this method become meaningful only when the difference between two such values, obtained within the same instance of a Java virtual machine, is computed. It is NOT time since any particular epoch and you WILL have problems if you treat it like that. (We've actually had a bug where that slipped past code review: CASSANDRA-4432 .) http://docs.oracle.com/javase/7/docs/api/java/lang/System.html#nanoTime( )
          Hide
          Christopher Smith added a comment -

          Here's proposed patch to HEAD that fixes at least the two cases I found immediately. Uses currentTimeMillis() & nanoTime() to get a precise but calibrated time stamp.

          Show
          Christopher Smith added a comment - Here's proposed patch to HEAD that fixes at least the two cases I found immediately. Uses currentTimeMillis() & nanoTime() to get a precise but calibrated time stamp.
          Hide
          Jonathan Ellis added a comment -

          I'm not sure what this buys us – it basically starts the micros component at a random point in time.

          Show
          Jonathan Ellis added a comment - I'm not sure what this buys us – it basically starts the micros component at a random point in time.
          Hide
          Rick Branson added a comment -

          Could this re-sample the base wallclock at an interval so that it doesn't drift too far? (say, every 1s or 100ms)

          Show
          Rick Branson added a comment - Could this re-sample the base wallclock at an interval so that it doesn't drift too far? (say, every 1s or 100ms)
          Hide
          Jonathan Ellis added a comment -

          Maybe if you spin millis until it rolls to the next value, and grab nanos then?

          Still seems kind of iffy to me

          Show
          Jonathan Ellis added a comment - Maybe if you spin millis until it rolls to the next value, and grab nanos then? Still seems kind of iffy to me
          Hide
          Rick Branson added a comment -

          I think you'd actually have to use the discrepancy between currentTimeMillis() now-start delta vs nanoTime() now-start delta to detect wall-clock drift/change and correct the nanoTime().

          Show
          Rick Branson added a comment - I think you'd actually have to use the discrepancy between currentTimeMillis() now-start delta vs nanoTime() now-start delta to detect wall-clock drift/change and correct the nanoTime().
          Hide
          Jonathan Ellis added a comment -

          I'm saying that even before you start talking about drift, we're correlating millis/nanos and saying, This Is The Reference Point For Zero Micros, but we could actually be 100 micros into the current ms, we could be 900, we have no way to tell.

          Show
          Jonathan Ellis added a comment - I'm saying that even before you start talking about drift, we're correlating millis/nanos and saying, This Is The Reference Point For Zero Micros, but we could actually be 100 micros into the current ms, we could be 900, we have no way to tell.
          Hide
          Rick Branson added a comment -

          Yeah you're right. The thread will probably occasionally get interrupted by the kernel between the currentTimeMillis() and the nanoTime() call. It'd probably take a bunch of iterations to properly "calibrate" the base clock. This is getting pretty far into the extremely-fragile territory.

          Show
          Rick Branson added a comment - Yeah you're right. The thread will probably occasionally get interrupted by the kernel between the currentTimeMillis() and the nanoTime() call. It'd probably take a bunch of iterations to properly "calibrate" the base clock. This is getting pretty far into the extremely-fragile territory.
          Hide
          Christopher Smith added a comment -

          Jonathan:
          If you look at my patch, it calibrates against currentTimeMillis() so that you do get about as proper a "microseconds since the epoch" as is possible. The one thing you could do to improve it would be to periodically recalibrate with currentTimeMillis, but I'd argue that is actually a bad thing, as that would introduce the possibility of timestamps that go back in time.

          What this does is dramatically reduce the probability that two concurrent writes sent to two different nodes will collide (and therefore Cassandra violates its atomicity "guarantee").

          Show
          Christopher Smith added a comment - Jonathan: If you look at my patch, it calibrates against currentTimeMillis() so that you do get about as proper a "microseconds since the epoch" as is possible. The one thing you could do to improve it would be to periodically recalibrate with currentTimeMillis, but I'd argue that is actually a bad thing, as that would introduce the possibility of timestamps that go back in time. What this does is dramatically reduce the probability that two concurrent writes sent to two different nodes will collide (and therefore Cassandra violates its atomicity "guarantee").
          Hide
          Christopher Smith added a comment -

          Regarding the "could be 100 micros in to the current ms, we could be 900, we have no way to tell".

          That is correct. There is no way, with Java, to have a universal timestamp across nodes that is guaranteed to be ordered properly at a precision less than 2 milliseconds.

          However, given that the wobble in network latency is generally >= 1 millisecond anyway, that is FAR less of a concern. If a client wants to ensure which of two writes happens first, it probably can't use server side timestamps.

          The big concern is the chance that two concurrent writes which have two or more overlapping cells be assigned the same timestamp. This is a real risk for any app with lots of concurrent writes to the same cells, regardless of its time sensitivity.

          Show
          Christopher Smith added a comment - Regarding the "could be 100 micros in to the current ms, we could be 900, we have no way to tell". That is correct. There is no way, with Java, to have a universal timestamp across nodes that is guaranteed to be ordered properly at a precision less than 2 milliseconds. However, given that the wobble in network latency is generally >= 1 millisecond anyway , that is FAR less of a concern. If a client wants to ensure which of two writes happens first, it probably can't use server side timestamps. The big concern is the chance that two concurrent writes which have two or more overlapping cells be assigned the same timestamp. This is a real risk for any app with lots of concurrent writes to the same cells, regardless of its time sensitivity.
          Hide
          Jonathan Ellis added a comment -

          I dunno. If it's a problem, then don't paper over it ("less likely" is a synonym for "still possible"); use LWT or fix your data model.

          Show
          Jonathan Ellis added a comment - I dunno. If it's a problem, then don't paper over it ("less likely" is a synonym for "still possible"); use LWT or fix your data model.
          Hide
          Christopher Smith added a comment - - edited

          Look at the above description and also look at the article. LWT doesn't fix this. You could use a vector clock, but then you have all the hell that comes with that.

          I agree the "still possible" is really dumb and a violation of the guarantees that Cassandra documents. As long as Cassandra has this mechanism though, we should make the probabilities way, way lower. With this change the probability of a collision gets to around the kind of odds as UUID collisions (clarification: the odds are still much higher than type 1 UUID's as they are 128-bit, but at least it is as good as you can do with 64b-bit... note that the mechanism employed is quite similar to how type 1's try to avoid collisions), which I think for practical purposes is "good enough".

          Note that the current "+1" trick also creates potentially backwards ordering problems (if you write 2 times in one millisecond to node A and once in the same millisecond to node B, the second write to node A is treated as having been last, even if it happened 999 microseconds before the write to node B).

          Cassandra should use a different mechanism to resolve concurrent writes with the same timestamp. I would propose something more like this:

          If two nodes have different values for a cell, but have the same timestamp for the cell:

          1) Compute the "token" for the record.
          2) Compute replicas 1 to N for that token and assign them those values 1 to N to each node in the datacenter.
          3) If there is a tie, win goes to the replica with the node with the highest value for #2.
          4) If there are two datacenters, each with the same highest value node (note this favours data centers with higher replication factors, which seems... good to me), you resolve in favour of the datacenter whose name alphasorts lowest.

          Show
          Christopher Smith added a comment - - edited Look at the above description and also look at the article. LWT doesn't fix this. You could use a vector clock, but then you have all the hell that comes with that. I agree the "still possible" is really dumb and a violation of the guarantees that Cassandra documents. As long as Cassandra has this mechanism though, we should make the probabilities way, way lower. With this change the probability of a collision gets to around the kind of odds as UUID collisions (clarification: the odds are still much higher than type 1 UUID's as they are 128-bit, but at least it is as good as you can do with 64b-bit... note that the mechanism employed is quite similar to how type 1's try to avoid collisions), which I think for practical purposes is "good enough". Note that the current "+1" trick also creates potentially backwards ordering problems (if you write 2 times in one millisecond to node A and once in the same millisecond to node B, the second write to node A is treated as having been last, even if it happened 999 microseconds before the write to node B). Cassandra should use a different mechanism to resolve concurrent writes with the same timestamp. I would propose something more like this: If two nodes have different values for a cell, but have the same timestamp for the cell: 1) Compute the "token" for the record. 2) Compute replicas 1 to N for that token and assign them those values 1 to N to each node in the datacenter. 3) If there is a tie, win goes to the replica with the node with the highest value for #2. 4) If there are two datacenters, each with the same highest value node (note this favours data centers with higher replication factors, which seems... good to me), you resolve in favour of the datacenter whose name alphasorts lowest.
          Hide
          Jonathan Ellis added a comment -

          Don't retroactively change the description a ton, it ruins the context for the subsequent comments. Here's your addendum:

          There seems to be some confusion here, so I'd like to clarify: the purpose of this patch is NOT to improve the precision of ordering guarantees for concurrent writes to cells. The purpose of this patch is to reduce the probability that concurrent writes to cells are deemed as having occurred at the same time, which is when Cassandra violates its atomicity guarantee.

          To clarify the failure scenario. Cassandra promises that writes to the same record are "atomic", so if you do something like:

          create table foo

          Unknown macro: { i int PRIMARY KEY, x int, y int, }

          ;

          and then send these two queries concurrently:

          insert into foo (i, x, y) values (1, 8, -8);
          insert into foo (i, x, y) values (1, -8, 8);

          you can't be quite sure which of the two writes will be the "last" one, but you do know that if you do:

          select x, y from foo where i = 1;

          you don't know if x is "8" or "-8".
          you don't know if y is "-8" or "8".
          YOU DO KNOW: x + y will equal 0.

          EXCEPT... if the timestamps assigned to the two queries are exactly the same, in which case x + y = 16. Now your writes are not atomic.

          Show
          Jonathan Ellis added a comment - Don't retroactively change the description a ton, it ruins the context for the subsequent comments. Here's your addendum: There seems to be some confusion here, so I'd like to clarify: the purpose of this patch is NOT to improve the precision of ordering guarantees for concurrent writes to cells. The purpose of this patch is to reduce the probability that concurrent writes to cells are deemed as having occurred at the same time , which is when Cassandra violates its atomicity guarantee. To clarify the failure scenario. Cassandra promises that writes to the same record are "atomic", so if you do something like: create table foo Unknown macro: { i int PRIMARY KEY, x int, y int, } ; and then send these two queries concurrently: insert into foo (i, x, y) values (1, 8, -8); insert into foo (i, x, y) values (1, -8, 8); you can't be quite sure which of the two writes will be the "last" one, but you do know that if you do: select x, y from foo where i = 1; you don't know if x is "8" or "-8". you don't know if y is "-8" or "8". YOU DO KNOW: x + y will equal 0. EXCEPT... if the timestamps assigned to the two queries are exactly the same, in which case x + y = 16. Now your writes are not atomic.
          Hide
          Jonathan Ellis added a comment - - edited

          IMO your proposal [edit: for node based conflict resolution] falls into "cure worse than the disease" category, but I'll throw up the Sylvain Lebresne signal for another opinion.

          Show
          Jonathan Ellis added a comment - - edited IMO your proposal [edit: for node based conflict resolution] falls into "cure worse than the disease" category, but I'll throw up the Sylvain Lebresne signal for another opinion.
          Hide
          Christopher Smith added a comment -

          I am curious about how you think this patch makes things worse than before. Can you describe a scenario?

          Regarding the LWT solution... what would you do to resolve two LWT's with the same timestamps and condition clause, but different updates to the same cells?

          Show
          Christopher Smith added a comment - I am curious about how you think this patch makes things worse than before. Can you describe a scenario? Regarding the LWT solution... what would you do to resolve two LWT's with the same timestamps and condition clause, but different updates to the same cells?
          Hide
          Jonathan Ellis added a comment -

          There's no resolution to be done, because only one will be committed out of the paxos "sandbox."

          Show
          Jonathan Ellis added a comment - There's no resolution to be done, because only one will be committed out of the paxos "sandbox."
          Hide
          Christopher Smith added a comment -

          Ah, you are right about the paxos. I keep forgetting there is no DC-local LWT.

          I can't see someone with multiple data centers avoiding the "race" by having everything be LWT. The performance cost (not to mention the vulnerability to partitioning) seem to go against the point of using Cassandra. It's useful tactically, but not as a strategy.

          Show
          Christopher Smith added a comment - Ah, you are right about the paxos. I keep forgetting there is no DC-local LWT. I can't see someone with multiple data centers avoiding the "race" by having everything be LWT. The performance cost (not to mention the vulnerability to partitioning) seem to go against the point of using Cassandra. It's useful tactically, but not as a strategy.
          Hide
          Christopher Smith added a comment -

          As for my proposed alternate strategy for resolving timestamp collisions... I'm not sure it is the best approach. It does, however, get Cassandra to the point where you always have atomic record updates. The "highest value wins" strategy strikes me as essentially undermining the case for Cassandra's "cell level" update resolution strategy.

          Show
          Christopher Smith added a comment - As for my proposed alternate strategy for resolving timestamp collisions... I'm not sure it is the best approach. It does, however, get Cassandra to the point where you always have atomic record updates. The "highest value wins" strategy strikes me as essentially undermining the case for Cassandra's "cell level" update resolution strategy.
          Hide
          Sylvain Lebresne added a comment -

          I agree it would be nice to ensure that it's always the cells of the same user update that wins on a timestamp tie between 2 user updates.

          I think we all agree however that better resolution timestamps does not solve it, it just make timestamp tie less likely.

          So I suppose there is 2 "questions":

          1. can we actually fix the real problem?
          2. given that even if we do find an acceptable way to fix it, it's unlikely to be a simple, minor-release kind of change, what about improving the timestamp resolution in the meantime?

          The 2nd question is easier to answer. My opinion is that if we have a simple way to improve the resolution with no particular downside, then why not. But concerning the attached patch, the one thing I'm not totally sure about is that it seems to "freeze" clock drift at the JVM startup, even if said clock drift is fixed by ntpd afterwards. Though I'm assuming here that nanoTime is not affected by ntpd adjusting the system clock which may be a wrong assumption in the first place, I haven't really checked tbh. And of course, in a healthy environments, one would hope that nptd makes sure no system clock ever drift enough on any node that it makes a meaningful difference to the application. Still, committing a patch that don't really fix a problem but only makes it less likely is only a no brainer if there is no downside whatsoever and I'm just not totally clar on the no downside part.

          Reguarding question 1, what we need (if I'm not mistaken) is a way to break timestamp ties that ensures that given 2 user updates u1 and u2, either all cells from u1 wins or they all lose when resolved against cells of u2.

          For what, one suggestion could be to extend the cell timestamp to include an id of the cordinator of the update. It's enough to have a coordinator id to distinguish timestamp ties because it is relatively easy to ensure a given coordinator never issues updates on the same timestamp (we ensure that only per client connection so far, but making it global to the coordinator is pretty easy). I "think" that may not be very far from Christopher suggestion above, though I don't think we need to mess with datacenters and whatnot. Namely, now that we can do CAS operations, it should be easy to assign a unique id per host in the Cassandra that is short (i.e. we can support 65K nodes with 2 bytes).

          Of course it's not a totally trivial change, but I don't think it's really that complex to do either (I'm relatively confident I can get that done for 2.1 for instance if we decide to go for it). At first glance, the only downside I see would be that it'll add a 2 bytes overhead per cell. But I'm not really sure this would have much measurable impact in practice (given compression in particular) and there is not that complex ways to compact the current timestamp more that we do now if we really want to gain back a few bytes.

          Show
          Sylvain Lebresne added a comment - I agree it would be nice to ensure that it's always the cells of the same user update that wins on a timestamp tie between 2 user updates. I think we all agree however that better resolution timestamps does not solve it, it just make timestamp tie less likely. So I suppose there is 2 "questions": can we actually fix the real problem? given that even if we do find an acceptable way to fix it, it's unlikely to be a simple, minor-release kind of change, what about improving the timestamp resolution in the meantime? The 2nd question is easier to answer. My opinion is that if we have a simple way to improve the resolution with no particular downside, then why not. But concerning the attached patch, the one thing I'm not totally sure about is that it seems to "freeze" clock drift at the JVM startup, even if said clock drift is fixed by ntpd afterwards. Though I'm assuming here that nanoTime is not affected by ntpd adjusting the system clock which may be a wrong assumption in the first place, I haven't really checked tbh. And of course, in a healthy environments, one would hope that nptd makes sure no system clock ever drift enough on any node that it makes a meaningful difference to the application. Still, committing a patch that don't really fix a problem but only makes it less likely is only a no brainer if there is no downside whatsoever and I'm just not totally clar on the no downside part. Reguarding question 1, what we need (if I'm not mistaken) is a way to break timestamp ties that ensures that given 2 user updates u1 and u2, either all cells from u1 wins or they all lose when resolved against cells of u2. For what, one suggestion could be to extend the cell timestamp to include an id of the cordinator of the update. It's enough to have a coordinator id to distinguish timestamp ties because it is relatively easy to ensure a given coordinator never issues updates on the same timestamp (we ensure that only per client connection so far, but making it global to the coordinator is pretty easy). I "think" that may not be very far from Christopher suggestion above, though I don't think we need to mess with datacenters and whatnot. Namely, now that we can do CAS operations, it should be easy to assign a unique id per host in the Cassandra that is short (i.e. we can support 65K nodes with 2 bytes). Of course it's not a totally trivial change, but I don't think it's really that complex to do either (I'm relatively confident I can get that done for 2.1 for instance if we decide to go for it). At first glance, the only downside I see would be that it'll add a 2 bytes overhead per cell. But I'm not really sure this would have much measurable impact in practice (given compression in particular) and there is not that complex ways to compact the current timestamp more that we do now if we really want to gain back a few bytes.
          Hide
          Jonathan Ellis added a comment -

          I like it.

          I'd be inclined to prefer reserving a couple bytes from our existing 64bit clock, noting that we have several centuries of headroom in micros-since-epoch in 48 bits. Compression will help on disk, but we do have to decompress to do anything useful and I don't want to cause more GC on every op to deal with what is a pretty rare corner case.

          Show
          Jonathan Ellis added a comment - I like it. I'd be inclined to prefer reserving a couple bytes from our existing 64bit clock, noting that we have several centuries of headroom in micros-since-epoch in 48 bits. Compression will help on disk, but we do have to decompress to do anything useful and I don't want to cause more GC on every op to deal with what is a pretty rare corner case.
          Hide
          Jonathan Ellis added a comment -

          I'm assuming here that nanoTime is not affected by ntpd adjusting the system clock

          I think that's implied by its design.

          Show
          Jonathan Ellis added a comment - I'm assuming here that nanoTime is not affected by ntpd adjusting the system clock I think that's implied by its design.
          Hide
          Sylvain Lebresne added a comment -

          noting that we have several centuries of headroom in micros-since-epoch in 48 bits

          We can definitively do that, good idea.

          Show
          Sylvain Lebresne added a comment - noting that we have several centuries of headroom in micros-since-epoch in 48 bits We can definitively do that, good idea.
          Hide
          Christopher Smith added a comment -

          I agree there are two approaches to addressing the problem. I had assumed that given the design, it was intended that this problem exist (why even code it up otherwise), but I'm glad if it was not.

          Regarding clock drift: I'm happy to code up a version that recalibrates to currentTimeMillis() every N seconds. I thought about doing it every call, but that would mean calling currrentTimeMillis() and nanoTime() on every write, which seems... expensive. Can someone suggest an appropriate interval?

          As for assigning an ID per node and storing it in the cell, I don't think we should do that. First, we don't really have CAS operations. We have CAS operations if you can coordinate with a majority of nodes. That means, in the case of a split, you would not be able to set ID's for nodes in one portion of the split, and possibly all portions. Worse, you might have situations where for certain portions of the ring one part of the split has the majority while for other portions of the ring with the other part of the split has it. In general, if a new node comes on, I don't think we want to require that it have access to a "quorum" of nodes in order to accept writes.

          I can understand the design to avoid storing extra bytes per cell being appealing. We could store a proper type 1 UUID, but even that still has at least theoretical collisions, and of course the space would be prohibitive.

          Storing the ID in extra bits in the timestamp is NOT a good idea. Sure you have a few centuries of headroom, but the CQL protocol exposes this field's value. Not only can clients read back the value and interpret it as microseconds, but they can write a value to it (and probably should be right now if they want to avoid problems related to this bug). If they are writing client generated values (which, if you think about it, needn't be tied to the time at all), they are in for a nasty surprise: two bytes of their values have been masked off.

          I think it is much simpler to exploit the fact that when two nodes are talking, they already have plenty of context for determining who should win. Then you don't have to store anything extra with each cell. If you want to assign a distinct ID for each node, do so but it would seem the existing Cassandra design already needs to have ways of uniquely identifying each active node, so why bother creating a new system?

          Show
          Christopher Smith added a comment - I agree there are two approaches to addressing the problem. I had assumed that given the design, it was intended that this problem exist (why even code it up otherwise), but I'm glad if it was not. Regarding clock drift: I'm happy to code up a version that recalibrates to currentTimeMillis() every N seconds. I thought about doing it every call, but that would mean calling currrentTimeMillis() and nanoTime() on every write, which seems... expensive. Can someone suggest an appropriate interval? As for assigning an ID per node and storing it in the cell, I don't think we should do that. First, we don't really have CAS operations. We have CAS operations if you can coordinate with a majority of nodes . That means, in the case of a split, you would not be able to set ID's for nodes in one portion of the split, and possibly all portions. Worse, you might have situations where for certain portions of the ring one part of the split has the majority while for other portions of the ring with the other part of the split has it. In general, if a new node comes on, I don't think we want to require that it have access to a "quorum" of nodes in order to accept writes. I can understand the design to avoid storing extra bytes per cell being appealing. We could store a proper type 1 UUID, but even that still has at least theoretical collisions, and of course the space would be prohibitive. Storing the ID in extra bits in the timestamp is NOT a good idea. Sure you have a few centuries of headroom, but the CQL protocol exposes this field's value. Not only can clients read back the value and interpret it as microseconds, but they can write a value to it (and probably should be right now if they want to avoid problems related to this bug). If they are writing client generated values (which, if you think about it, needn't be tied to the time at all), they are in for a nasty surprise: two bytes of their values have been masked off. I think it is much simpler to exploit the fact that when two nodes are talking, they already have plenty of context for determining who should win. Then you don't have to store anything extra with each cell. If you want to assign a distinct ID for each node, do so but it would seem the existing Cassandra design already needs to have ways of uniquely identifying each active node, so why bother creating a new system?
          Hide
          Christopher Smith added a comment -

          I should clarify my comment about client assigned time stamps.

          If you don't mask off bytes in the client generated timestamps for storing the node ID, then they aren't comparable with server generated timestamps, which will cause problems.

          The best you could do is store one extra bit to indicate whether it was a client or server generated timestamp and somehow resolve cases of mixed client & server values accordingly. However, I think that's a huge can of worms not worth diving into. It's better to just use all of the existing 64-bits to store the timestamp and store any and all additional information in additional bytes.

          Show
          Christopher Smith added a comment - I should clarify my comment about client assigned time stamps. If you don't mask off bytes in the client generated timestamps for storing the node ID, then they aren't comparable with server generated timestamps, which will cause problems. The best you could do is store one extra bit to indicate whether it was a client or server generated timestamp and somehow resolve cases of mixed client & server values accordingly. However, I think that's a huge can of worms not worth diving into. It's better to just use all of the existing 64-bits to store the timestamp and store any and all additional information in additional bytes.
          Hide
          Christopher Smith added a comment -

          Regarding resyncing the microsOffset. The reality is that the only problem is if the clock skew gets bigger than the "adjust" window of NTP or whatever other clock adjusting happens in the cluster (let's call that "sigma"). As a client, based solely on the contracts exposed by Cassandra's documentation, if I want a write to win out over a previous write X, I have a few options:

          1) Use client generated timestamps both for write X and my current write, and ensure by other means my write has a higher timestamp value than all others.
          2) Make sure I do my write more than whatever the cluster's "sigma" is since X.
          3) Send all my writes for the cells in question are sent to a specific node (which potentially means cross-data center writes).
          4) After write X, for all columns I want to write to, I do a "select writetime(columnA), writetime(columnB) from ..." with consistency level high enough to exceed replication factor when added to write X's consistency level (for multi-datacenter clusters, that pretty much means both operations are QUORUM, or the select is done with consistency ALL), and then do a write with a client timestamp value that is greater than all of the timestamps I get back.

          Guarantees about ordering of writes are hard to come by, and if you don't use client generated timestamps, you basically should assume that two writes could resolve either way unless they are greater than "sigma" apart, and "sigma" is usually a big enough number that at that point that it is mostly irrelevant.

          The tl;dr is: unless you are explicit about ordering client side, you can't be certain about the ordering of writes. All you can be certain of is atomicity (once this bug is fixed .

          So, to a degree, clock skew isn't terribly important (clients just can't count on how server time stamps will end up resolving the order of writes), but if we're going to adjust it, there probably needs to be a user adjustable value for "sigma" that controls how often the clock gets resync'd.

          Does that sound reasonable?

          Show
          Christopher Smith added a comment - Regarding resyncing the microsOffset. The reality is that the only problem is if the clock skew gets bigger than the "adjust" window of NTP or whatever other clock adjusting happens in the cluster (let's call that "sigma"). As a client, based solely on the contracts exposed by Cassandra's documentation, if I want a write to win out over a previous write X, I have a few options: 1) Use client generated timestamps both for write X and my current write, and ensure by other means my write has a higher timestamp value than all others. 2) Make sure I do my write more than whatever the cluster's "sigma" is since X. 3) Send all my writes for the cells in question are sent to a specific node (which potentially means cross-data center writes). 4) After write X, for all columns I want to write to, I do a "select writetime(columnA), writetime(columnB) from ..." with consistency level high enough to exceed replication factor when added to write X's consistency level (for multi-datacenter clusters, that pretty much means both operations are QUORUM, or the select is done with consistency ALL), and then do a write with a client timestamp value that is greater than all of the timestamps I get back. Guarantees about ordering of writes are hard to come by, and if you don't use client generated timestamps, you basically should assume that two writes could resolve either way unless they are greater than "sigma" apart, and "sigma" is usually a big enough number that at that point that it is mostly irrelevant. The tl;dr is: unless you are explicit about ordering client side, you can't be certain about the ordering of writes. All you can be certain of is atomicity (once this bug is fixed . So, to a degree, clock skew isn't terribly important (clients just can't count on how server time stamps will end up resolving the order of writes), but if we're going to adjust it, there probably needs to be a user adjustable value for "sigma" that controls how often the clock gets resync'd. Does that sound reasonable?
          Hide
          Sylvain Lebresne added a comment -

          if a new node comes on, I don't think we want to require that it have access to a "quorum" of nodes in order to accept writes

          Given the fact we can persist the ID, the only time we'd constraint a node is when it is first bootstrapped before fully joining the ring. Also, nothing forces use to use the number of total nodes for the replication factor of the table that will hold that node->id map (like what we do for system_auth, we might want a special keyspace to control the RF). In any case, I hardly think it's a blocker.

          I think it is much simpler to exploit the fact that when two nodes are talking, they already have plenty of context for determining who should win.

          What do you mean exactly by "when two nodes are talking"? What exact moment of Cassandra inner working are you referring to? Surely you can't be talking of anything related to the write path since 2 updates that conflicts (have the same timestamp) can be in 2 completely separate partitions. So I'm not sure I follow.

          Then you don't have to store anything extra with each cell.

          Given that conflict resolution is done at the cell level and can be done at basically any time after the initial writes, I'm not really sure how that's possible (unless you rewrite Cassandra completely of course, but it seems we agree it's not worth creating a new system).

          The best you could do

          Wanna bet?

          Show
          Sylvain Lebresne added a comment - if a new node comes on, I don't think we want to require that it have access to a "quorum" of nodes in order to accept writes Given the fact we can persist the ID, the only time we'd constraint a node is when it is first bootstrapped before fully joining the ring. Also, nothing forces use to use the number of total nodes for the replication factor of the table that will hold that node->id map (like what we do for system_auth, we might want a special keyspace to control the RF). In any case, I hardly think it's a blocker. I think it is much simpler to exploit the fact that when two nodes are talking, they already have plenty of context for determining who should win. What do you mean exactly by "when two nodes are talking"? What exact moment of Cassandra inner working are you referring to? Surely you can't be talking of anything related to the write path since 2 updates that conflicts (have the same timestamp) can be in 2 completely separate partitions. So I'm not sure I follow. Then you don't have to store anything extra with each cell. Given that conflict resolution is done at the cell level and can be done at basically any time after the initial writes, I'm not really sure how that's possible (unless you rewrite Cassandra completely of course, but it seems we agree it's not worth creating a new system). The best you could do Wanna bet?
          Hide
          Christopher Smith added a comment -

          Given the fact we can persist the ID, the only time we'd constraint a node is when it is first bootstrapped before fully joining the ring.

          True.

          What do you mean exactly by "when two nodes are talking"? What exact moment of Cassandra inner working are you referring to? Surely you can't be talking of anything related to the write path since 2 updates that conflicts (have the same timestamp) can be in 2 completely separate partitions. So I'm not sure I follow.

          Good question, and I apologize for not being clear. I was referring to when a cell's data is being replicated from one node to another.

          If the cluster becomes partitioned, there is no need to deal with matching timestamps on two separate writes in the two different partitions until such time as the partition ends and you are repairing (indeed, before then you can't even be aware of the matter). When that time comes, it eventually devolves to two nodes exchanging information on two cells. That is the context where I thought a simple heuristic could ensure writes are treated atomically.

          Wanna bet?

          Almost want to... not because I think I'd win the bet, but because I know all too well that sooner or later I'd lose, and we'd have a better system.

          Show
          Christopher Smith added a comment - Given the fact we can persist the ID, the only time we'd constraint a node is when it is first bootstrapped before fully joining the ring. True. What do you mean exactly by "when two nodes are talking"? What exact moment of Cassandra inner working are you referring to? Surely you can't be talking of anything related to the write path since 2 updates that conflicts (have the same timestamp) can be in 2 completely separate partitions. So I'm not sure I follow. Good question, and I apologize for not being clear. I was referring to when a cell's data is being replicated from one node to another. If the cluster becomes partitioned, there is no need to deal with matching timestamps on two separate writes in the two different partitions until such time as the partition ends and you are repairing (indeed, before then you can't even be aware of the matter). When that time comes, it eventually devolves to two nodes exchanging information on two cells. That is the context where I thought a simple heuristic could ensure writes are treated atomically. Wanna bet? Almost want to... not because I think I'd win the bet, but because I know all too well that sooner or later I'd lose, and we'd have a better system.
          Hide
          Christopher Smith added a comment -

          An alternate resolution to the timestamp collision using java.util.Random to fill in the least significant bits of the timestamp.

          Show
          Christopher Smith added a comment - An alternate resolution to the timestamp collision using java.util.Random to fill in the least significant bits of the timestamp.
          Hide
          Christopher Smith added a comment - - edited

          Out of curiosity, is there a reason to use volatile for QueryState.clock instead of making it an AtomicLong?

          Show
          Christopher Smith added a comment - - edited Out of curiosity, is there a reason to use volatile for QueryState.clock instead of making it an AtomicLong?
          Hide
          Christopher Smith added a comment -

          I realized the random patch had a minor bug in if time went backwards. One character fix represented in rev2 of the patch provided.

          Show
          Christopher Smith added a comment - I realized the random patch had a minor bug in if time went backwards. One character fix represented in rev2 of the patch provided.
          Hide
          Christopher Smith added a comment -

          The issue behind the rev2 patch also highlights another issue with the "clock skew" concern in general. To a certain degree we have to acknowledge that the existing code already allows incorrect timestamps in the face of clock skew. If a node's time is ahead by a N millis from other nodes, even after the clock gets corrected, writes to that node will win when they shouldn't until N millis have passed, because "clock" is designed to be increasing only.

          Show
          Christopher Smith added a comment - The issue behind the rev2 patch also highlights another issue with the "clock skew" concern in general. To a certain degree we have to acknowledge that the existing code already allows incorrect timestamps in the face of clock skew. If a node's time is ahead by a N millis from other nodes, even after the clock gets corrected, writes to that node will win when they shouldn't until N millis have passed, because "clock" is designed to be increasing only.
          Hide
          Sylvain Lebresne added a comment -

          Alright so for the sake of keeping things focused, I've opened CASSANDRA-6123 to actually fix the underlying issue. Let's focus this ticket on deciding whether having a better precision for timestamp in the meatime is worth it.

          On the last attached patch, haven't had a proper look yet but will try to do so and follow-up asap.

          Show
          Sylvain Lebresne added a comment - Alright so for the sake of keeping things focused, I've opened CASSANDRA-6123 to actually fix the underlying issue. Let's focus this ticket on deciding whether having a better precision for timestamp in the meatime is worth it. On the last attached patch, haven't had a proper look yet but will try to do so and follow-up asap.
          Hide
          Blair Zajac added a comment -

          Since Cassandra ships with JNA, one could use it to wrap gettimeofday() or maybe clock_gettime() for Unix systems and the equivalent for Windows. I did this for an internal Scala server \that needed microsecond precision. One question is is the Windows equivalent is too expensive. Looking at APR's code for getting microsecond time on Windows doesn't look cheap, look at apr_time_now() here:tw

          https://svn.apache.org/repos/asf/apr/apr/trunk/time/win32/time.c

          Show
          Blair Zajac added a comment - Since Cassandra ships with JNA, one could use it to wrap gettimeofday() or maybe clock_gettime() for Unix systems and the equivalent for Windows. I did this for an internal Scala server \that needed microsecond precision. One question is is the Windows equivalent is too expensive. Looking at APR's code for getting microsecond time on Windows doesn't look cheap, look at apr_time_now() here:tw https://svn.apache.org/repos/asf/apr/apr/trunk/time/win32/time.c
          Hide
          Jonathan Ellis added a comment -

          I'd be okay with using gettimeofday when we find it in libc and leaving windows with the status quo.

          Show
          Jonathan Ellis added a comment - I'd be okay with using gettimeofday when we find it in libc and leaving windows with the status quo.
          Hide
          Robert Coli added a comment -

          What are the "since" / "affects" versions for this issue?

          Show
          Robert Coli added a comment - What are the "since" / "affects" versions for this issue?
          Hide
          Christopher Smith added a comment -

          Hey, since 1.2.11 is going out without this issue resolved, wanted to see what the thinking was. While Cassandra ships with JNA, it isn't strictly a requirement, so I think using nanoTime, perhaps with periodic recalibration would be preferable. Maybe you could go with JNA with the currentTimeMillis() fallback, but that basically will mean JNA has to be there on any setup that needs concurrent update resolution, all because of issues with one platform (Windows).

          What about my alternative of using random data for the lower bits? It could be done the same way TimeUUID works, which would go a long way to making collisions statistically unlikely without imposing any potential Windows slowdown.

          Show
          Christopher Smith added a comment - Hey, since 1.2.11 is going out without this issue resolved, wanted to see what the thinking was. While Cassandra ships with JNA, it isn't strictly a requirement , so I think using nanoTime, perhaps with periodic recalibration would be preferable. Maybe you could go with JNA with the currentTimeMillis() fallback, but that basically will mean JNA has to be there on any setup that needs concurrent update resolution, all because of issues with one platform (Windows). What about my alternative of using random data for the lower bits? It could be done the same way TimeUUID works, which would go a long way to making collisions statistically unlikely without imposing any potential Windows slowdown.
          Hide
          Sylvain Lebresne added a comment -

          since 1.2.11 is going out without this issue resolved

          I'm not against targeting 2.0 but this won't go into the 1.2 branch as this is really just an improvement.

          What about my alternative of using random data for the lower bits?

          I don't think we want to do that because at the very least we want to keep the behavior that timestamp generated for the same client connection are always strictly increasing, and it seems to me that randomizing is not really compatible with it.

          nanoTime, perhaps with periodic recalibration

          It's an option, though I'll admit that it feels a bit like a hack. Not totally opposed I guess, but as Jonathan, I think I'd be fine with using gettimeofday and leaving platform that don't support it with the status quo.

          Though on the longer run, I'm starting to be convinced that we should slowly move back to client side timestamps by default (CASSANDRA-6178) so it's unclear to me how much effort is worth putting into this (given that, at the end of the day, this won't ensure timestamp uniqueness anyway and you'd still have to be aware that on timestamp tie, the resolution is based on the value).

          Show
          Sylvain Lebresne added a comment - since 1.2.11 is going out without this issue resolved I'm not against targeting 2.0 but this won't go into the 1.2 branch as this is really just an improvement. What about my alternative of using random data for the lower bits? I don't think we want to do that because at the very least we want to keep the behavior that timestamp generated for the same client connection are always strictly increasing, and it seems to me that randomizing is not really compatible with it. nanoTime, perhaps with periodic recalibration It's an option, though I'll admit that it feels a bit like a hack. Not totally opposed I guess, but as Jonathan, I think I'd be fine with using gettimeofday and leaving platform that don't support it with the status quo. Though on the longer run, I'm starting to be convinced that we should slowly move back to client side timestamps by default ( CASSANDRA-6178 ) so it's unclear to me how much effort is worth putting into this (given that, at the end of the day, this won't ensure timestamp uniqueness anyway and you'd still have to be aware that on timestamp tie, the resolution is based on the value).
          Hide
          Christopher Smith added a comment -

          The random patch still ensures that a given server's generated timestamps are strictly increasing.

          Show
          Christopher Smith added a comment - The random patch still ensures that a given server's generated timestamps are strictly increasing.
          Hide
          Christopher Smith added a comment -

          I think this is worth addressing simply because the collision probabilities are surprisingly high for anyone working off the documentation.

          Show
          Christopher Smith added a comment - I think this is worth addressing simply because the collision probabilities are surprisingly high for anyone working off the documentation.
          Hide
          Jonathan Ellis added a comment -

          Cassandra 2.1 will always have JNA available (CASSANDRA-5872), so I'm still a fan of "use libc."

          I note that this guy says JNR is significantly faster than JNA: https://groups.google.com/d/msg/mechanical-sympathy/lvd78b667TY/huNQGr8elboJ

          Show
          Jonathan Ellis added a comment - Cassandra 2.1 will always have JNA available ( CASSANDRA-5872 ), so I'm still a fan of "use libc." I note that this guy says JNR is significantly faster than JNA: https://groups.google.com/d/msg/mechanical-sympathy/lvd78b667TY/huNQGr8elboJ
          Hide
          Christopher Smith added a comment -

          If we really care about performance for getting the time, on linux we should just create an NIO buffer around the real-time timer and avoid calling out to C to get the time completely.

          Show
          Christopher Smith added a comment - If we really care about performance for getting the time, on linux we should just create an NIO buffer around the real-time timer and avoid calling out to C to get the time completely.
          Hide
          Jonathan Ellis added a comment -

          How would you get the timer address?

          Show
          Jonathan Ellis added a comment - How would you get the timer address?
          Hide
          Jonathan Ellis added a comment -

          How would you get the timer address?

          Christopher Smith?

          Show
          Jonathan Ellis added a comment - How would you get the timer address? Christopher Smith ?
          Hide
          Christopher Smith added a comment - - edited

          Apologies. I dropped the ball and never got back to this. Thanks for pinging me again.

          Obviously, this is non-trivial, or it'd already have been done. However non-trivial != impossible.

          The way to do this would be reading the data directly from the gtod structure (actually vdso_vsyscall_gtod_data: http://lxr.linux.no/linux/arch/x86/include/asm/vgtod.h#L7). There are a bunch of caveats (for one, you'd need to have a fallback strategy if there was no VDSO for the particular architecture you are on), but you could in theory have some JNI code that grabbed the addresses of gtod.wall_time_sec and gtod.wall_time_nsec and then made them visible as direct ByteBuffer's through the NewDirectByteBuffer function. You could then access the memory much like is done in do_realtime (http://lxr.linux.no/linux/arch/x86/vdso/vclock_gettime.c#L173). I imagine you'd need some extra logic in Java space to handle endian issues, and you'd need to make sure you created a memory barrier every time you entered and exited the Java function to read it, but I think that all is doable.

          There might be an even nicer way to do this with JNA, but I've not used it, so I'll leave that to other to others to figure out.

          Show
          Christopher Smith added a comment - - edited Apologies. I dropped the ball and never got back to this. Thanks for pinging me again. Obviously, this is non-trivial, or it'd already have been done. However non-trivial != impossible. The way to do this would be reading the data directly from the gtod structure (actually vdso_vsyscall_gtod_data: http://lxr.linux.no/linux/arch/x86/include/asm/vgtod.h#L7 ). There are a bunch of caveats (for one, you'd need to have a fallback strategy if there was no VDSO for the particular architecture you are on), but you could in theory have some JNI code that grabbed the addresses of gtod.wall_time_sec and gtod.wall_time_nsec and then made them visible as direct ByteBuffer's through the NewDirectByteBuffer function. You could then access the memory much like is done in do_realtime ( http://lxr.linux.no/linux/arch/x86/vdso/vclock_gettime.c#L173 ). I imagine you'd need some extra logic in Java space to handle endian issues, and you'd need to make sure you created a memory barrier every time you entered and exited the Java function to read it, but I think that all is doable. There might be an even nicer way to do this with JNA, but I've not used it, so I'll leave that to other to others to figure out.
          Hide
          Jonathan Ellis added a comment -

          Thanks, Chris.

          I don't think I'm ready to cross the "custom JNI code" rubicon yet. Let's try gettimeofday or clock_gettime like Blair suggested and see if that's acceptable first.

          Show
          Jonathan Ellis added a comment - Thanks, Chris. I don't think I'm ready to cross the "custom JNI code" rubicon yet. Let's try gettimeofday or clock_gettime like Blair suggested and see if that's acceptable first.
          Hide
          Christopher Smith added a comment -

          Isn't using gettimeofday or clock_gettime crossing the "custom JNI code" rubicon anyway?

          Show
          Christopher Smith added a comment - Isn't using gettimeofday or clock_gettime crossing the "custom JNI code" rubicon anyway?
          Hide
          Jonathan Ellis added a comment -

          Shouldn't we be able to do that with JNA?

          Show
          Jonathan Ellis added a comment - Shouldn't we be able to do that with JNA?
          Hide
          Blair Zajac added a comment -

          Yes, I called gettimeofday() with JNA for a previous employer on a closed source project, otherwise I would copy/paste it here.

          Show
          Blair Zajac added a comment - Yes, I called gettimeofday() with JNA for a previous employer on a closed source project, otherwise I would copy/paste it here.
          Hide
          Christopher Smith added a comment -

          So, looking at JNA, it seems like shared memory/NIO buffer logic could be done with it as much as it could be done with JNI. Is it possible what I'm talking about could be done through JNA as much as with JNI?

          Show
          Christopher Smith added a comment - So, looking at JNA, it seems like shared memory/NIO buffer logic could be done with it as much as it could be done with JNI. Is it possible what I'm talking about could be done through JNA as much as with JNI?
          Hide
          Jonathan Ellis added a comment -

          It's the "code that [grabs] the addresses of gtod.wall_time_sec and gtod.wall_time_nsec" that I don't know how to do with JNA.

          Show
          Jonathan Ellis added a comment - It's the "code that [grabs] the addresses of gtod.wall_time_sec and gtod.wall_time_nsec" that I don't know how to do with JNA.
          Hide
          Benedict added a comment - - edited

          It doesn't look safe to me to simply grab gtod.wall_time_sec anyway, even if we could find its location, as the nanos value gets repaired after reading with another call. We could investigate further, but for the time being I have a reasonably straightforward solution here

          I started by simply calling the rt clock_gettime method through JNA, which unfortunately clocks in at a heavy 7 micros; since nanoTime and currentTimeMillis are < 0.03 micros, this seemed a little unacceptable. So what I've done is opted to periodically (once per second) grab the latest micros time via the best method possible (clock_gettime if available, currentTimeMillis * 1000 otherwise) and use this to reset the offset, however to ensure we have a smooth transition I:

          1. Cap the rate of change at 50ms per second
          2. Ensure it never leaps back in time, at least on any given thread (no way to guarantee stronger than this)
          3. Only apply a change if it is at least 1ms out, to avoid noise (possibly should tighten this to 100 micros, or dependent on resolution of time library we're using)

          The result is a method that costs around the same as a raw call to System.nanoTime() but gives pretty decent accuracy. Obviously any method that involves using nanos and calculating an offset from a method that takes ~7micros to return is going to have an inherent inaccuracy, but no more than the 7micros direct method call would itself, and the inaccuracy will be consistent given the jitter reduction I'm applying. At startup we also sample the offset 10k times, derive a 90%ile for elapsed time fetching the offset (we ignore future offsets we calculate that take more than twice this period to sample) and average all of those within the 90%ile.

          Show
          Benedict added a comment - - edited It doesn't look safe to me to simply grab gtod.wall_time_sec anyway, even if we could find its location, as the nanos value gets repaired after reading with another call. We could investigate further, but for the time being I have a reasonably straightforward solution here I started by simply calling the rt clock_gettime method through JNA, which unfortunately clocks in at a heavy 7 micros; since nanoTime and currentTimeMillis are < 0.03 micros, this seemed a little unacceptable. So what I've done is opted to periodically (once per second) grab the latest micros time via the best method possible (clock_gettime if available, currentTimeMillis * 1000 otherwise) and use this to reset the offset, however to ensure we have a smooth transition I: Cap the rate of change at 50ms per second Ensure it never leaps back in time, at least on any given thread (no way to guarantee stronger than this) Only apply a change if it is at least 1ms out, to avoid noise (possibly should tighten this to 100 micros, or dependent on resolution of time library we're using) The result is a method that costs around the same as a raw call to System.nanoTime() but gives pretty decent accuracy. Obviously any method that involves using nanos and calculating an offset from a method that takes ~7micros to return is going to have an inherent inaccuracy, but no more than the 7micros direct method call would itself, and the inaccuracy will be consistent given the jitter reduction I'm applying. At startup we also sample the offset 10k times, derive a 90%ile for elapsed time fetching the offset (we ignore future offsets we calculate that take more than twice this period to sample) and average all of those within the 90%ile.
          Hide
          Sylvain Lebresne added a comment -

          I'd like to summarize my understanding of what we're trying to fix here.

          As far as conflict resolution goes, microsecond resolution is imo rather useless. Given the accuracy of ntp, network latencies and whatnot, no application should ever rely on sub-milliseconds resolution for conflicts, and any application that rely on fine-grained ordering of updates to a cell should really provide client-side timestamp. It doesn't mean we can't use microsecond resolution if it's easy of course, but does mean that imo the bar on what complexity is worth it is rather low.

          This was not the original motivation of this ticket however. The original motivation was to limit the chance of 2 updates A and B getting the exact same timestamp, because when that happens, we could end up with some cell from A and some cell from B. I think we all agreed that the proper fix for that was more complicated and left to CASSANDRA-6123. Yet, as I said earlier, since that fix is much more complicated, I'm fine lowering the chances of timestamp conflicts in the meantime if that's easy for us (less often broken is somewhat better than more often broken, even if not broken is obviously better). But for this point, Christopher solution of randomizing the microseconds bits was actually really simple and probably good enough.

          And to be honest, Benedict's branch complexity is above what I consider reasonable for the concrete problem at hand. I'm surely not very smart, but it doesn't fit my own definition of straightforward. I'm not saying that it's the most complicated thing ever, but it's complicated enough to make me uncomfortable, given that even some simple rounding error on the timestamp could basically destroy user data.

          I'm also not convinced we need that complexity in practice. What about just having a thread call clock_gettime followed by nanoTime every second or so, and then just add the nano time between now and the last time clock_gettime was called to get the current time. It might not be perfect to get the most and best timestamp we can, but it's imo largely good enough for our purpose (and for clocks going back in time, we already handle that in a brute force kind of way in QueryState, which is again imo good enough).

          Show
          Sylvain Lebresne added a comment - I'd like to summarize my understanding of what we're trying to fix here. As far as conflict resolution goes, microsecond resolution is imo rather useless. Given the accuracy of ntp, network latencies and whatnot, no application should ever rely on sub-milliseconds resolution for conflicts, and any application that rely on fine-grained ordering of updates to a cell should really provide client-side timestamp. It doesn't mean we can't use microsecond resolution if it's easy of course, but does mean that imo the bar on what complexity is worth it is rather low. This was not the original motivation of this ticket however. The original motivation was to limit the chance of 2 updates A and B getting the exact same timestamp, because when that happens, we could end up with some cell from A and some cell from B. I think we all agreed that the proper fix for that was more complicated and left to CASSANDRA-6123 . Yet, as I said earlier, since that fix is much more complicated, I'm fine lowering the chances of timestamp conflicts in the meantime if that's easy for us (less often broken is somewhat better than more often broken, even if not broken is obviously better). But for this point, Christopher solution of randomizing the microseconds bits was actually really simple and probably good enough. And to be honest, Benedict's branch complexity is above what I consider reasonable for the concrete problem at hand. I'm surely not very smart, but it doesn't fit my own definition of straightforward. I'm not saying that it's the most complicated thing ever, but it's complicated enough to make me uncomfortable, given that even some simple rounding error on the timestamp could basically destroy user data. I'm also not convinced we need that complexity in practice. What about just having a thread call clock_gettime followed by nanoTime every second or so, and then just add the nano time between now and the last time clock_gettime was called to get the current time. It might not be perfect to get the most and best timestamp we can, but it's imo largely good enough for our purpose (and for clocks going back in time, we already handle that in a brute force kind of way in QueryState, which is again imo good enough).
          Hide
          Benedict added a comment -

          Well I was only implementing what was asked: microsecond accurate timestamps. If we just want disambiguation that's a different matter. But since sub millisecond network trips are pretty easy within a rack I'm not sure that random is a good thing

          Show
          Benedict added a comment - Well I was only implementing what was asked: microsecond accurate timestamps. If we just want disambiguation that's a different matter. But since sub millisecond network trips are pretty easy within a rack I'm not sure that random is a good thing
          Hide
          Benedict added a comment -

          In case it helps at all, I've commented it heavily and simplified the logic quite a bit, by removing the test on "time elapsed to grab the realtime offset", as the effect will be pretty minimal even if it gets a temporarily whack value. It really isn't actually super complicated, but it was a bit ugly to read and non-obvious without comments.

          It's worth noting that having a monotonically increasing time source is probably a good thing in and of itself, which this also provides.

          I've rebased and pushed -f

          Show
          Benedict added a comment - In case it helps at all, I've commented it heavily and simplified the logic quite a bit, by removing the test on "time elapsed to grab the realtime offset", as the effect will be pretty minimal even if it gets a temporarily whack value. It really isn't actually super complicated, but it was a bit ugly to read and non-obvious without comments. It's worth noting that having a monotonically increasing time source is probably a good thing in and of itself, which this also provides. I've rebased and pushed -f
          Hide
          Benedict added a comment -

          One last tweak: wanted to make us just a little tolerant to really whack values caused by GC during a measurement, but done this in a way that makes the code neater rather than uglier.

          Show
          Benedict added a comment - One last tweak: wanted to make us just a little tolerant to really whack values caused by GC during a measurement, but done this in a way that makes the code neater rather than uglier.
          Hide
          Sylvain Lebresne added a comment -

          Sorry but I still find it a tad more complicated that what we really need (I'll note that I really don't want us to screw up timestamps due to some rounding error or whatnot, which is why I rather strongly care about simplicity). I'm probably stupid, but I still need to sit with a pen and paper a few minutes to understand the arithmetic, check there is not edge case not handled and even still, I can't totally convince myself that the whole adjusting business really provide us practical benefits (given how timestamps are used) over just brute forcing monotonity like we currently do in QueryState in the rare cases clocks go slightly backward. But for the record, I'll say on the patch that:

          • QueryState#getTimestamp would need to be changed or this isn't actually used by user queries.
          • I don't totally reconcile saying that clock_gettime is a bit slow, but still having it call on a query thread, even if it's only once per second (which is not even guaranteed because once the validUntilNanos expires, multiple thread might fight over updating the spec). Especially when, if the call is slow for some reason, we incur an even greater cost by retrying the call up to 3 times. Not a huge problem I suppose, but not too ideal in my book.

          So anyway, I've pushed at https://github.com/pcmanus/cassandra/commits/6106_alternative what I think is a simpler yet sufficient solution, so that I'd rather go with that (unless there's an obvious big problem with it I've missed).

          Show
          Sylvain Lebresne added a comment - Sorry but I still find it a tad more complicated that what we really need (I'll note that I really don't want us to screw up timestamps due to some rounding error or whatnot, which is why I rather strongly care about simplicity). I'm probably stupid, but I still need to sit with a pen and paper a few minutes to understand the arithmetic, check there is not edge case not handled and even still, I can't totally convince myself that the whole adjusting business really provide us practical benefits (given how timestamps are used) over just brute forcing monotonity like we currently do in QueryState in the rare cases clocks go slightly backward. But for the record, I'll say on the patch that: QueryState#getTimestamp would need to be changed or this isn't actually used by user queries. I don't totally reconcile saying that clock_gettime is a bit slow, but still having it call on a query thread, even if it's only once per second (which is not even guaranteed because once the validUntilNanos expires, multiple thread might fight over updating the spec). Especially when, if the call is slow for some reason, we incur an even greater cost by retrying the call up to 3 times. Not a huge problem I suppose, but not too ideal in my book. So anyway, I've pushed at https://github.com/pcmanus/cassandra/commits/6106_alternative what I think is a simpler yet sufficient solution, so that I'd rather go with that (unless there's an obvious big problem with it I've missed).
          Hide
          Benedict added a comment -

          Well, if we're not worried about monotonicity, we can certainly simplify the patch - but I would much rather not introduce a new thread for updating the spec; if we're incurring the micros cost on a seconds time horizon, incurring it a few times each second is irrelevant (even a few times per thread in the worst case), it's just a matter of ensuring that we don't incur those micros costs for every query, as it would be a really noticeable percentage of total operation time. We already have far more threads than necessary, let's not introduce one here where it's not necessary (note that 3x loop is only ever going to occur if we overlap with a GC pause, so we're only likely to loop twice at most, and almost never more than once).

          Still, I'd much prefer that we just get comfortable with the math, since if we don't address it now it will go on the backburner and be forgotten. Correctness relies on integer arithmetic always truncating instead of rounding, so there's no floating point weirdness or edge cases to contend with. Behaviour is the same for all value ranges. We always apply a delta to a fixed offset, and the delta is calculated by multiplying by a monotonically increasing value, and dividing by a fixed value - as such the monotonically increasing multiplier either pushes the combined value over into the next whole number after division, or it doesn't. Since we are also guaranteed to only ever move by at most 10% of the elapsed time interval, we essentially get (at most) 1 microsecond in 10 simply appears to take 2 microseconds to elapse.

          TL;DR: every x microseconds we deduct an extra 1microsecond when adjusting backwards, so we can only ever stall time (by 1 microsecond), never jump backwards.

          Either way, since your patch applies no monotonicity guarantees, any slight risk I'm wrong in the analysis really doesn't seem to be important - it's still much better than without it, if that's your only concern?

          Show
          Benedict added a comment - Well, if we're not worried about monotonicity, we can certainly simplify the patch - but I would much rather not introduce a new thread for updating the spec; if we're incurring the micros cost on a seconds time horizon, incurring it a few times each second is irrelevant (even a few times per thread in the worst case), it's just a matter of ensuring that we don't incur those micros costs for every query, as it would be a really noticeable percentage of total operation time. We already have far more threads than necessary, let's not introduce one here where it's not necessary (note that 3x loop is only ever going to occur if we overlap with a GC pause, so we're only likely to loop twice at most, and almost never more than once). Still, I'd much prefer that we just get comfortable with the math, since if we don't address it now it will go on the backburner and be forgotten. Correctness relies on integer arithmetic always truncating instead of rounding, so there's no floating point weirdness or edge cases to contend with. Behaviour is the same for all value ranges. We always apply a delta to a fixed offset, and the delta is calculated by multiplying by a monotonically increasing value, and dividing by a fixed value - as such the monotonically increasing multiplier either pushes the combined value over into the next whole number after division, or it doesn't. Since we are also guaranteed to only ever move by at most 10% of the elapsed time interval, we essentially get (at most) 1 microsecond in 10 simply appears to take 2 microseconds to elapse. TL;DR: every x microseconds we deduct an extra 1microsecond when adjusting backwards, so we can only ever stall time (by 1 microsecond), never jump backwards. Either way, since your patch applies no monotonicity guarantees, any slight risk I'm wrong in the analysis really doesn't seem to be important - it's still much better than without it, if that's your only concern?
          Hide
          Sylvain Lebresne added a comment -

          it's still much better than without it, if that's your only concern?

          I'm not afraid of slight microseconds imprecision that wouldn't matter, I'm afraid of returning a timestamp that is completely broken on some edge case, and the more arithmetic is going on, the more risk there is. Sure we can double and triple check the math to convince ourself, it's just that I don't think your solution bring any real benefits in practice "for conflict-resolution timestamps" over my proposition, and I think my solution is conceptually simpler, and I think we should always go for simpler when we can, and I think we can.

          Now, I've discussed my view on the ticket itself (which I still halfway think could be closed as won't fix since at the end of the day the real problem for which it was opened is really CASSANDRA-6123), and on you branch (for which I don't see the point of "getting comfortable with the math" when there is a simpler solution imo) enough. I don't see much to add at this point. I'm not vetoing your solution, I just can't +1 it when I think my solution is a tad better (because simpler). Let's have someone else look at it and formulate an opinion, probably I'm just being difficult for lack of sleep.

          Show
          Sylvain Lebresne added a comment - it's still much better than without it, if that's your only concern? I'm not afraid of slight microseconds imprecision that wouldn't matter, I'm afraid of returning a timestamp that is completely broken on some edge case, and the more arithmetic is going on, the more risk there is. Sure we can double and triple check the math to convince ourself, it's just that I don't think your solution bring any real benefits in practice "for conflict-resolution timestamps" over my proposition, and I think my solution is conceptually simpler, and I think we should always go for simpler when we can, and I think we can. Now, I've discussed my view on the ticket itself (which I still halfway think could be closed as won't fix since at the end of the day the real problem for which it was opened is really CASSANDRA-6123 ), and on you branch (for which I don't see the point of "getting comfortable with the math" when there is a simpler solution imo) enough. I don't see much to add at this point. I'm not vetoing your solution, I just can't +1 it when I think my solution is a tad better (because simpler). Let's have someone else look at it and formulate an opinion, probably I'm just being difficult for lack of sleep.
          Hide
          Benedict added a comment -

          Well, what I should have made clear is that I am willing to drop the monotonicity guarantees, however I am -1 on your extra thread.

          But I still think the monotonicity guarantees are good, and not so difficult to prove, so if we can get somebody who doesn't have a newborn to contend with to take a look maybe that wouldn't be a bad thing

          In case it helps, here's a quick proof we can never give a whack value:

          1. -100000<= adjustMicros<=100000
          2. expire-adjustFrom=1000000000
          2a. expireMicros-adjustFromMicros=1000000
          3. adjustFromMicros<=micros<=expireMicros
          4. delta = (adjustMicros * (micros-adjustFromMicros)) / (expireMicros-adjustFromMicros)
          5. 2a ^ 3 ^ 4 -> expireMicros-adjustFromMicros > micros-adjustFromMicros -> |delta| <= |adjustMicros|
          

          i.e. the adjustment is definitely always less than adjustMicros, which is itself always less than 100ms per second (per 1 and 2). So we can never give a totally whack result. Can do more thorough proofs of other criteria, but I think this plus my other statement is enough to demonstrate its safety.

          Show
          Benedict added a comment - Well, what I should have made clear is that I am willing to drop the monotonicity guarantees, however I am -1 on your extra thread. But I still think the monotonicity guarantees are good, and not so difficult to prove, so if we can get somebody who doesn't have a newborn to contend with to take a look maybe that wouldn't be a bad thing In case it helps, here's a quick proof we can never give a whack value: 1. -100000<= adjustMicros<=100000 2. expire-adjustFrom=1000000000 2a. expireMicros-adjustFromMicros=1000000 3. adjustFromMicros<=micros<=expireMicros 4. delta = (adjustMicros * (micros-adjustFromMicros)) / (expireMicros-adjustFromMicros) 5. 2a ^ 3 ^ 4 -> expireMicros-adjustFromMicros > micros-adjustFromMicros -> |delta| <= |adjustMicros| i.e. the adjustment is definitely always less than adjustMicros, which is itself always less than 100ms per second (per 1 and 2). So we can never give a totally whack result. Can do more thorough proofs of other criteria, but I think this plus my other statement is enough to demonstrate its safety.
          Hide
          Benedict added a comment -

          Also, just in case overflow might be considered an issue: per 1 and 2a, we have adjustMicros * (micros-adjustFromMicros) <= 10billion, which is well within limits of safe long values

          Show
          Benedict added a comment - Also, just in case overflow might be considered an issue: per 1 and 2a, we have adjustMicros * (micros-adjustFromMicros) <= 10billion, which is well within limits of safe long values
          Hide
          Jonathan Ellis added a comment -

          Do you have time to look at the math, Christopher Smith? The code itself is a single class and reasonably straightforward.

          Show
          Jonathan Ellis added a comment - Do you have time to look at the math, Christopher Smith ? The code itself is a single class and reasonably straightforward.
          Hide
          Benedict added a comment -

          Just to add to the analysis, since it's quite computationally tractable I have uploaded a patch to the branch which brute force checks every possible computation to ensure the result is always monotonically increasing, and within the bounds of what is expected. I have run this to completion and indeed all of my statements above check out empirically.

          Show
          Benedict added a comment - Just to add to the analysis, since it's quite computationally tractable I have uploaded a patch to the branch which brute force checks every possible computation to ensure the result is always monotonically increasing, and within the bounds of what is expected. I have run this to completion and indeed all of my statements above check out empirically.
          Hide
          Christopher Smith added a comment -

          Sorry, I've been crazy busy this week. I'll look this over today or tomorrow and provide feedback.

          Show
          Christopher Smith added a comment - Sorry, I've been crazy busy this week. I'll look this over today or tomorrow and provide feedback.
          Hide
          Jonathan Ellis added a comment -

          Missed the window to get this into 2.0, and for 3.0 we're planning to switch to a unique-per-client "time id" instead (CASSANDRA-6108).

          Show
          Jonathan Ellis added a comment - Missed the window to get this into 2.0, and for 3.0 we're planning to switch to a unique-per-client "time id" instead ( CASSANDRA-6108 ).
          Hide
          Benedict added a comment -

          It seems that, since CASSANDRA-6108 has been shelved, and the alternative follow ups aren't slated for anytime soon (including TimeUUID clientside timestamps), that we should consider introducing this in 3.0 to fix (or limit) the bug with resolution of conflicting data with the same timestamp. We should also ask the driver maintainers to do this on their side for v3.0+ protocol implementations.

          Show
          Benedict added a comment - It seems that, since CASSANDRA-6108 has been shelved, and the alternative follow ups aren't slated for anytime soon (including TimeUUID clientside timestamps), that we should consider introducing this in 3.0 to fix (or limit) the bug with resolution of conflicting data with the same timestamp. We should also ask the driver maintainers to do this on their side for v3.0+ protocol implementations.
          Hide
          Benedict added a comment -

          Reopening for further discussion.

          Show
          Benedict added a comment - Reopening for further discussion.
          Hide
          Benedict added a comment -

          Sorry for ping-ponging the ticket state everyone. Been some lively discussion on IRC.

          Show
          Benedict added a comment - Sorry for ping-ponging the ticket state everyone. Been some lively discussion on IRC.
          Hide
          Benedict added a comment -

          It was decided that v3 protocol implementors should ensure their timestamps are high precision, since this is the default source of timestamps going forwards.

          Show
          Benedict added a comment - It was decided that v3 protocol implementors should ensure their timestamps are high precision, since this is the default source of timestamps going forwards.

            People

            • Assignee:
              Benedict
              Reporter:
              Christopher Smith
            • Votes:
              1 Vote for this issue
              Watchers:
              20 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development