Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-7144

Fix handling of empty counter replication mutations

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Fix Version/s: 1.2.17, 2.0.9
    • Component/s: None
    • Labels:
      None
    • Environment:

      Ubuntu 12.04 w/ Oracle JVM, 5 nodes cluster. Nodes 2GB / 2 Cores in DigitalOcean.

      Description

      First time reporting a bug here, apologies if I'm not posting it in the right space.

      At what seems like random interval, on random nodes in random situations I will get the following exception. After this the hinted handoff start timing out and the node stops participating in the cluster.

      I started seeing these after switching to the Cassandra Python-Driver from the Python-CQL driver.

      ERROR [WRITE-/10.128.180.108] 2014-05-03 13:45:12,843 CassandraDaemon.java (line 198) Exception in thread Thread[WRITE-/10.128.180.108,5,main]
      java.lang.AssertionError
      	at org.apache.cassandra.db.RowMutation$RowMutationSerializer.serialize(RowMutation.java:271)
      	at org.apache.cassandra.db.RowMutation$RowMutationSerializer.serialize(RowMutation.java:259)
      	at org.apache.cassandra.net.MessageOut.serialize(MessageOut.java:120)
      	at org.apache.cassandra.net.OutboundTcpConnection.writeInternal(OutboundTcpConnection.java:251)
      	at org.apache.cassandra.net.OutboundTcpConnection.writeConnected(OutboundTcpConnection.java:203)
      	at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:151)
      ERROR [WRITE-/10.128.194.70] 2014-05-03 13:45:12,843 CassandraDaemon.java (line 198) Exception in thread Thread[WRITE-/10.128.194.70,5,main]
      java.lang.AssertionError
      	at org.apache.cassandra.db.RowMutation$RowMutationSerializer.serialize(RowMutation.java:271)
      	at org.apache.cassandra.db.RowMutation$RowMutationSerializer.serialize(RowMutation.java:259)
      	at org.apache.cassandra.net.MessageOut.serialize(MessageOut.java:120)
      	at org.apache.cassandra.net.OutboundTcpConnection.writeInternal(OutboundTcpConnection.java:251)
      	at org.apache.cassandra.net.OutboundTcpConnection.writeConnected(OutboundTcpConnection.java:203)
      	at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:151)
      ERROR [MutationStage:118] 2014-05-03 13:45:15,048 CassandraDaemon.java (line 198) Exception in thread Thread[MutationStage:118,5,main]
      java.lang.AssertionError
      	at org.apache.cassandra.db.RowMutation$RowMutationSerializer.serialize(RowMutation.java:271)
      	at org.apache.cassandra.db.RowMutation$RowMutationSerializer.serialize(RowMutation.java:259)
      	at org.apache.cassandra.utils.FBUtilities.serialize(FBUtilities.java:654)
      	at org.apache.cassandra.db.HintedHandOffManager.hintFor(HintedHandOffManager.java:137)
      	at org.apache.cassandra.service.StorageProxy.writeHintForMutation(StorageProxy.java:908)
      	at org.apache.cassandra.service.StorageProxy$6.runMayThrow(StorageProxy.java:881)
      	at org.apache.cassandra.service.StorageProxy$HintRunnable.run(StorageProxy.java:1981)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at java.lang.Thread.run(Thread.java:744)
      ERROR [MutationStage:117] 2014-05-03 13:45:15,048 CassandraDaemon.java (line 198) Exception in thread Thread[MutationStage:117,5,main]
      java.lang.AssertionError
      	at org.apache.cassandra.db.RowMutation$RowMutationSerializer.serialize(RowMutation.java:271)
      	at org.apache.cassandra.db.RowMutation$RowMutationSerializer.serialize(RowMutation.java:259)
      	at org.apache.cassandra.utils.FBUtilities.serialize(FBUtilities.java:654)
      	at org.apache.cassandra.db.HintedHandOffManager.hintFor(HintedHandOffManager.java:137)
      	at org.apache.cassandra.service.StorageProxy.writeHintForMutation(StorageProxy.java:908)
      	at org.apache.cassandra.service.StorageProxy$6.runMayThrow(StorageProxy.java:881)
      	at org.apache.cassandra.service.StorageProxy$HintRunnable.run(StorageProxy.java:1981)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at java.lang.Thread.run(Thread.java:744)
      

      The service must be restarted for the node to come back online. Let me know any additional configuration details needed.

      1. 7144.txt
        2 kB
        Aleksey Yeschenko
      2. 7144-2.txt
        2 kB
        Aleksey Yeschenko

        Activity

        Hide
        jbellis Jonathan Ellis added a comment -

        You didn't give a version so I'm not 100% sure but I think you're seeing this assert:

                    int size = rm.modifications.size();
                    out.writeInt(size);
                    assert size > 0;
        

        which means that your client is sending a zero-length mutation to the server.

        do we not validate that in the native protocol Tyler Hobbs?

        Show
        jbellis Jonathan Ellis added a comment - You didn't give a version so I'm not 100% sure but I think you're seeing this assert: int size = rm.modifications.size(); out.writeInt(size); assert size > 0; which means that your client is sending a zero-length mutation to the server. do we not validate that in the native protocol Tyler Hobbs ?
        Hide
        nesnub Maxime Lamothe-Brassard added a comment -

        I had mentioned it, probably in the wrong field since it got removed. 2.0.7, latest at the moment. Quick update, I ended up decomissioning the node, wiping /var/lib/cassandra/* and then re-joining the cluster (same install, just wiped the dir). It has 100% solved my problem (so far anyway, about 2 days of uptime).

        So it sounds like the data on disk was somehow corrupted somewhere between my client code update and repairs.

        Show
        nesnub Maxime Lamothe-Brassard added a comment - I had mentioned it, probably in the wrong field since it got removed. 2.0.7, latest at the moment. Quick update, I ended up decomissioning the node, wiping /var/lib/cassandra/* and then re-joining the cluster (same install, just wiped the dir). It has 100% solved my problem (so far anyway, about 2 days of uptime). So it sounds like the data on disk was somehow corrupted somewhere between my client code update and repairs.
        Hide
        thobbs Tyler Hobbs added a comment -

        Maxime Lamothe-Brassard when you saw this, were you using prepared statements or unprepared statements? Were you using BATCH, INSERT, UPDATE, and/or DELETE?

        So far I've tried a few things to reproduce this with no luck.

        Show
        thobbs Tyler Hobbs added a comment - Maxime Lamothe-Brassard when you saw this, were you using prepared statements or unprepared statements? Were you using BATCH, INSERT, UPDATE, and/or DELETE? So far I've tried a few things to reproduce this with no luck.
        Hide
        nesnub Maxime Lamothe-Brassard added a comment -

        I was not using prepared statements. I was doing INSERT and SELECT, no DELETE.

        I have some more info now. As I said I rebuilt the data on the node and it eliminated the problem. However, a day later, I found myself killing my ingestor (doing lots of INSERT), it's a python script using the new cassandra python-driver. When I did that I got the exception above. Thinking it was a "runtime" bug, I just kept going. Then the hinted-handoff started timing out on that box, so I restarted cassandra. From that point on, I would get the same exception without ever killing ingestors, at random interval. It seems as if killing the script during a query ended up sending data to cassandra that made it corrupt something on disk and that from that point on whenever it reached that part of the data on disk (I use that liberally, I just mean NOT directly from the script doing the ingestion) it would throw the exact same exception, leading to the timing out again. Restart of the cassandra node did nothing, I had to rebuild the data again. So now I'm very paranoid about killing my ingestion.

        The ingestion uses UNLOGGED BATCH for some of the data for performance as well as normal INSERT.

        Show
        nesnub Maxime Lamothe-Brassard added a comment - I was not using prepared statements. I was doing INSERT and SELECT, no DELETE. I have some more info now. As I said I rebuilt the data on the node and it eliminated the problem. However, a day later, I found myself killing my ingestor (doing lots of INSERT), it's a python script using the new cassandra python-driver. When I did that I got the exception above. Thinking it was a "runtime" bug, I just kept going. Then the hinted-handoff started timing out on that box, so I restarted cassandra. From that point on, I would get the same exception without ever killing ingestors, at random interval. It seems as if killing the script during a query ended up sending data to cassandra that made it corrupt something on disk and that from that point on whenever it reached that part of the data on disk (I use that liberally, I just mean NOT directly from the script doing the ingestion) it would throw the exact same exception, leading to the timing out again. Restart of the cassandra node did nothing, I had to rebuild the data again. So now I'm very paranoid about killing my ingestion. The ingestion uses UNLOGGED BATCH for some of the data for performance as well as normal INSERT.
        Hide
        jason.punyon Jason Punyon added a comment -

        I'm Jason and I work at Stack Overflow. We're bit by this right now on our 2.0.6 cluster and we haven't wiped it out yet.

        We got to this state on Friday after a power outage took out our entire datacenter (https://twitter.com/JasonPunyon/status/467290879569719296). We have a 3-node cluster and we interact with it using the CqlSharp library. We do our reads and writes at QUORUM, some of them are batches. The first symptom of the problem was that both read and write operations were timing out. When we dug in, we found that nodes 1 and 3 were throwing errors (not enough responses), but node 2 would execute the exact same operations fine at quorum. Nodes 1 and 3 would execute reads fine at ONE.

        We're pretty interested in getting it fixed. Is there any way we can help?

        Show
        jason.punyon Jason Punyon added a comment - I'm Jason and I work at Stack Overflow. We're bit by this right now on our 2.0.6 cluster and we haven't wiped it out yet. We got to this state on Friday after a power outage took out our entire datacenter ( https://twitter.com/JasonPunyon/status/467290879569719296 ). We have a 3-node cluster and we interact with it using the CqlSharp library. We do our reads and writes at QUORUM, some of them are batches. The first symptom of the problem was that both read and write operations were timing out. When we dug in, we found that nodes 1 and 3 were throwing errors (not enough responses), but node 2 would execute the exact same operations fine at quorum. Nodes 1 and 3 would execute reads fine at ONE. We're pretty interested in getting it fixed. Is there any way we can help?
        Hide
        thobbs Tyler Hobbs added a comment -

        Hi Jason Punyon, thanks for your offer to help. The best way to help at the moment would be to help come up with a reproduceable test case, if you can. I have one other ticket on my queue in front of this one, so I should be able to give it proper attention pretty soon.

        Show
        thobbs Tyler Hobbs added a comment - Hi Jason Punyon , thanks for your offer to help. The best way to help at the moment would be to help come up with a reproduceable test case, if you can. I have one other ticket on my queue in front of this one, so I should be able to give it proper attention pretty soon.
        Hide
        jason.punyon Jason Punyon added a comment -

        Tyler Hobbs We racked our brains a bit, we can't really conceive of a direction to go in to replicate it again (remember it's still live on our cluster, we can run any diagnostics you want), aside from rebuilding and then randomly bringing power down (what we did the first time).

        Show
        jason.punyon Jason Punyon added a comment - Tyler Hobbs We racked our brains a bit, we can't really conceive of a direction to go in to replicate it again (remember it's still live on our cluster, we can run any diagnostics you want), aside from rebuilding and then randomly bringing power down (what we did the first time).
        Hide
        thobbs Tyler Hobbs added a comment -

        Jason Punyon Maxime Lamothe-Brassard can you provide any other details about the writes? I presume you were using prepared statements? How large were the batches? Roughly what do the individual inserts (within the batches or otherwise) look like?

        Show
        thobbs Tyler Hobbs added a comment - Jason Punyon Maxime Lamothe-Brassard can you provide any other details about the writes? I presume you were using prepared statements? How large were the batches? Roughly what do the individual inserts (within the batches or otherwise) look like?
        Hide
        nesnub Maxime Lamothe-Brassard added a comment -

        The writes were not using prepared statements. The batch contain average INSERT/UPDATEs, but I also have a counter update interleaved between the statements. Here is a sample:

        self.queries.append( ( ( 'UPDATE met SET n = n + 1 WHERE id = %(id)s', { 'id' : k } ) )
        self.queries.append( ( '''BEGIN UNLOGGED BATCH
                                                          INSERT INTO om ( id, obj, otype ) VALUES ( %(id)s, %(obj)s, %(type)s );
                                                          UPDATE loc SET last = %(ts)s WHERE sid = %(sid)s AND otype = %(type)s AND id = %(id)s;
                                                          INSERT INTO lid ( id, sid ) VALUES ( %(id)s, %(sid)s );
                                                          APPLY BATCH;''', { 'id' : k,
                                                                             'obj' : unicode( obj[ 0 ] ).lower(),
                                                                             'type' : obj[ 1 ],
                                                                             'ts' : ts,
                                                                             'sid' : ag } ) )
        
        Show
        nesnub Maxime Lamothe-Brassard added a comment - The writes were not using prepared statements. The batch contain average INSERT/UPDATEs, but I also have a counter update interleaved between the statements. Here is a sample: self.queries.append( ( ( 'UPDATE met SET n = n + 1 WHERE id = %(id)s', { 'id' : k } ) ) self.queries.append( ( '''BEGIN UNLOGGED BATCH INSERT INTO om ( id, obj, otype ) VALUES ( %(id)s, %(obj)s, %(type)s ); UPDATE loc SET last = %(ts)s WHERE sid = %(sid)s AND otype = %(type)s AND id = %(id)s; INSERT INTO lid ( id, sid ) VALUES ( %(id)s, %(sid)s ); APPLY BATCH;''', { 'id' : k, 'obj' : unicode( obj[ 0 ] ).lower(), 'type' : obj[ 1 ], 'ts' : ts, 'sid' : ag } ) )
        Hide
        kohlisankalp sankalp kohli added a comment -

        Another point here is that one of these assertion errors which is caused in at OutboundTcpConnection.java:151 will cause that thread to die.

        This will be very bad as it will not be recreated and it won't be able to talk to a node which had that connection.

        Show
        kohlisankalp sankalp kohli added a comment - Another point here is that one of these assertion errors which is caused in at OutboundTcpConnection.java:151 will cause that thread to die. This will be very bad as it will not be recreated and it won't be able to talk to a node which had that connection.
        Hide
        kohlisankalp sankalp kohli added a comment -

        Maxime Lamothe-Brassard Jason Punyon
        Are you using CAS based inserts?

        Show
        kohlisankalp sankalp kohli added a comment - Maxime Lamothe-Brassard Jason Punyon Are you using CAS based inserts?
        Hide
        rlow Richard Low added a comment -

        We saw this in 3 separate clusters at roughly the same time. In 2 clusters it happened on one node only. On 1 cluster it happened on 3 nodes.

        Every time it happened, there were 3 errors in WRITE threads with stack trace:

        java.lang.AssertionError
        	at org.apache.cassandra.db.RowMutation$RowMutationSerializer.serialize(RowMutation.java:271)
        	at org.apache.cassandra.db.RowMutation$RowMutationSerializer.serialize(RowMutation.java:259)
        	at org.apache.cassandra.net.MessageOut.serialize(MessageOut.java:120)
        	at org.apache.cassandra.net.OutboundTcpConnection.writeInternal(OutboundTcpConnection.java:251)
        	at org.apache.cassandra.net.OutboundTcpConnection.writeConnected(OutboundTcpConnection.java:203)
        	at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:151)
        

        and the IPs in the thread names are all adjacent in the ring (we have replication 3+3 in 2 DCs). 2-3 seconds following that, there are then stack traces in the MutationStage like:

        java.lang.AssertionError
        	at org.apache.cassandra.db.RowMutation$RowMutationSerializer.serialize(RowMutation.java:271)
        	at org.apache.cassandra.db.RowMutation$RowMutationSerializer.serialize(RowMutation.java:259)
        	at org.apache.cassandra.utils.FBUtilities.serialize(FBUtilities.java:654)
        	at org.apache.cassandra.db.HintedHandOffManager.hintFor(HintedHandOffManager.java:137)
        	at org.apache.cassandra.service.StorageProxy.writeHintForMutation(StorageProxy.java:897)
        	at org.apache.cassandra.service.StorageProxy$6.runMayThrow(StorageProxy.java:870)
        	at org.apache.cassandra.service.StorageProxy$HintRunnable.run(StorageProxy.java:1961)
        

        The number of these varies from 5 to 65.

        Show
        rlow Richard Low added a comment - We saw this in 3 separate clusters at roughly the same time. In 2 clusters it happened on one node only. On 1 cluster it happened on 3 nodes. Every time it happened, there were 3 errors in WRITE threads with stack trace: java.lang.AssertionError at org.apache.cassandra.db.RowMutation$RowMutationSerializer.serialize(RowMutation.java:271) at org.apache.cassandra.db.RowMutation$RowMutationSerializer.serialize(RowMutation.java:259) at org.apache.cassandra.net.MessageOut.serialize(MessageOut.java:120) at org.apache.cassandra.net.OutboundTcpConnection.writeInternal(OutboundTcpConnection.java:251) at org.apache.cassandra.net.OutboundTcpConnection.writeConnected(OutboundTcpConnection.java:203) at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:151) and the IPs in the thread names are all adjacent in the ring (we have replication 3+3 in 2 DCs). 2-3 seconds following that, there are then stack traces in the MutationStage like: java.lang.AssertionError at org.apache.cassandra.db.RowMutation$RowMutationSerializer.serialize(RowMutation.java:271) at org.apache.cassandra.db.RowMutation$RowMutationSerializer.serialize(RowMutation.java:259) at org.apache.cassandra.utils.FBUtilities.serialize(FBUtilities.java:654) at org.apache.cassandra.db.HintedHandOffManager.hintFor(HintedHandOffManager.java:137) at org.apache.cassandra.service.StorageProxy.writeHintForMutation(StorageProxy.java:897) at org.apache.cassandra.service.StorageProxy$6.runMayThrow(StorageProxy.java:870) at org.apache.cassandra.service.StorageProxy$HintRunnable.run(StorageProxy.java:1961) The number of these varies from 5 to 65.
        Hide
        rlow Richard Low added a comment -

        In fact, the number of MutationStage AssertionErrors always come in groups of 5 (within 1ms of each other).

        Show
        rlow Richard Low added a comment - In fact, the number of MutationStage AssertionErrors always come in groups of 5 (within 1ms of each other).
        Hide
        rlow Richard Low added a comment -

        I have a reproduction. Not sure if it was the cause of our issue but it repros an identical situation.

        I have a 3 node cluster running on my box. I ran these cql commands:

        cqlsh> create keyspace ks with replication = {'class':'SimpleStrategy', 'replication_factor':2};
        cqlsh> use ks;
        cqlsh:ks> create table counter_test (a text primary key, b counter) with gc_grace_seconds = 0;
        cqlsh:ks> consistency all;
        Consistency level set to ALL.
        cqlsh:ks> update counter_test set b = b + 1 where a = 'a';
        cqlsh:ks> delete from counter_test where a = 'a';
        cqlsh:ks> update counter_test set b = b + 1 where a = 'a';
        Request did not complete within rpc_timeout.
        

        And in the logs:

        ERROR [WRITE-/127.0.0.2] 2014-06-03 14:37:24,186 CassandraDaemon.java (line 196) Exception in thread Thread[WRITE-/127.0.0.2,5,main]
        java.lang.AssertionError
                at org.apache.cassandra.db.RowMutation$RowMutationSerializer.serialize(RowMutation.java:271)
                at org.apache.cassandra.db.RowMutation$RowMutationSerializer.serialize(RowMutation.java:259)
                at org.apache.cassandra.net.MessageOut.serialize(MessageOut.java:120)
                at org.apache.cassandra.net.OutboundTcpConnection.writeInternal(OutboundTcpConnection.java:251)
                at org.apache.cassandra.net.OutboundTcpConnection.writeConnected(OutboundTcpConnection.java:203)
                at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:151)
        ERROR [MutationStage:156] 2014-06-03 14:37:26,301 CassandraDaemon.java (line 196) Exception in thread Thread[MutationStage:156,5,main]
        java.lang.AssertionError
                at org.apache.cassandra.db.RowMutation$RowMutationSerializer.serialize(RowMutation.java:271)
                at org.apache.cassandra.db.RowMutation$RowMutationSerializer.serialize(RowMutation.java:259)
                at org.apache.cassandra.utils.FBUtilities.serialize(FBUtilities.java:654)
                at org.apache.cassandra.db.HintedHandOffManager.hintFor(HintedHandOffManager.java:137)
                at org.apache.cassandra.service.StorageProxy.writeHintForMutation(StorageProxy.java:897)
                at org.apache.cassandra.service.StorageProxy$6.runMayThrow(StorageProxy.java:870)
                at org.apache.cassandra.service.StorageProxy$HintRunnable.run(StorageProxy.java:1961)
                at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
                at java.util.concurrent.FutureTask.run(FutureTask.java:262)
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
                at java.lang.Thread.run(Thread.java:745)
        

        The problem is that the row tombstone has microsecond timestamps but the counter has milliseconds (I just created CASSANDRA-7346 for this). When the local counter mutation is applied, it is immediately removed because the row tombstone has a later timestamp. When we read it back, once the row tombstone becomes GCable (that's why I set gc_grace to 0) we get null for the row, so we don't add a mutation to the RowMutation object (in CounterMutation.makeReplicaionMutation()). When this mutation gets sent it kills the connection thread, times out so we write a hint, which causes the second assertion failure. We now have to wait until the row tombstone becomes GCed to be able to inc anything in this row again.

        Even if there wasn't CASSANDRA-7346, it is legitimate to insert a row tombstone with a later timestamp so this could still happen. We need to detect this case and handle it so the empty counter increment doesn't cause trouble. We can just say it's done because the counter is deleted and not bother to replicate.

        I'll keep looking in case there are other code paths that trigger this.

        Show
        rlow Richard Low added a comment - I have a reproduction. Not sure if it was the cause of our issue but it repros an identical situation. I have a 3 node cluster running on my box. I ran these cql commands: cqlsh> create keyspace ks with replication = {'class':'SimpleStrategy', 'replication_factor':2}; cqlsh> use ks; cqlsh:ks> create table counter_test (a text primary key, b counter) with gc_grace_seconds = 0; cqlsh:ks> consistency all; Consistency level set to ALL. cqlsh:ks> update counter_test set b = b + 1 where a = 'a'; cqlsh:ks> delete from counter_test where a = 'a'; cqlsh:ks> update counter_test set b = b + 1 where a = 'a'; Request did not complete within rpc_timeout. And in the logs: ERROR [WRITE-/127.0.0.2] 2014-06-03 14:37:24,186 CassandraDaemon.java (line 196) Exception in thread Thread [WRITE-/127.0.0.2,5,main] java.lang.AssertionError at org.apache.cassandra.db.RowMutation$RowMutationSerializer.serialize(RowMutation.java:271) at org.apache.cassandra.db.RowMutation$RowMutationSerializer.serialize(RowMutation.java:259) at org.apache.cassandra.net.MessageOut.serialize(MessageOut.java:120) at org.apache.cassandra.net.OutboundTcpConnection.writeInternal(OutboundTcpConnection.java:251) at org.apache.cassandra.net.OutboundTcpConnection.writeConnected(OutboundTcpConnection.java:203) at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:151) ERROR [MutationStage:156] 2014-06-03 14:37:26,301 CassandraDaemon.java (line 196) Exception in thread Thread [MutationStage:156,5,main] java.lang.AssertionError at org.apache.cassandra.db.RowMutation$RowMutationSerializer.serialize(RowMutation.java:271) at org.apache.cassandra.db.RowMutation$RowMutationSerializer.serialize(RowMutation.java:259) at org.apache.cassandra.utils.FBUtilities.serialize(FBUtilities.java:654) at org.apache.cassandra.db.HintedHandOffManager.hintFor(HintedHandOffManager.java:137) at org.apache.cassandra.service.StorageProxy.writeHintForMutation(StorageProxy.java:897) at org.apache.cassandra.service.StorageProxy$6.runMayThrow(StorageProxy.java:870) at org.apache.cassandra.service.StorageProxy$HintRunnable.run(StorageProxy.java:1961) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang. Thread .run( Thread .java:745) The problem is that the row tombstone has microsecond timestamps but the counter has milliseconds (I just created CASSANDRA-7346 for this). When the local counter mutation is applied, it is immediately removed because the row tombstone has a later timestamp. When we read it back, once the row tombstone becomes GCable (that's why I set gc_grace to 0) we get null for the row, so we don't add a mutation to the RowMutation object (in CounterMutation.makeReplicaionMutation()). When this mutation gets sent it kills the connection thread, times out so we write a hint, which causes the second assertion failure. We now have to wait until the row tombstone becomes GCed to be able to inc anything in this row again. Even if there wasn't CASSANDRA-7346 , it is legitimate to insert a row tombstone with a later timestamp so this could still happen. We need to detect this case and handle it so the empty counter increment doesn't cause trouble. We can just say it's done because the counter is deleted and not bother to replicate. I'll keep looking in case there are other code paths that trigger this.
        Hide
        iamaleksey Aleksey Yeschenko added a comment -

        Attaching a simple 1.2 based patch. Only 1.2 and 2.0 are affected, the 2.1 implementation is not.

        Show
        iamaleksey Aleksey Yeschenko added a comment - Attaching a simple 1.2 based patch. Only 1.2 and 2.0 are affected, the 2.1 implementation is not.
        Hide
        jason.punyon Jason Punyon added a comment -

        Having looked at our writes, the were almost exclusively batches (sized 100-1000) and a counter batch (max size 1000). We used CAS for exactly one table (our migrations table) which was used very infrequently (in our build, a few times a day).

        Show
        jason.punyon Jason Punyon added a comment - Having looked at our writes, the were almost exclusively batches (sized 100-1000) and a counter batch (max size 1000). We used CAS for exactly one table (our migrations table) which was used very infrequently (in our build, a few times a day).
        Hide
        rlow Richard Low added a comment -

        Jason Punyon did you ever issue any row deletes to your counter table?

        Aleksey Yeschenko I don't see any patch

        Show
        rlow Richard Low added a comment - Jason Punyon did you ever issue any row deletes to your counter table? Aleksey Yeschenko I don't see any patch
        Hide
        jason.punyon Jason Punyon added a comment -

        Richard Low Yes we did.

        Show
        jason.punyon Jason Punyon added a comment - Richard Low Yes we did.
        Hide
        rlow Richard Low added a comment -

        Jason Punyon great, this is almost certainly the cause for you. You should check out CASSANDRA-7346 since row deletes on counter tables don't do at all what you expect.

        Show
        rlow Richard Low added a comment - Jason Punyon great, this is almost certainly the cause for you. You should check out CASSANDRA-7346 since row deletes on counter tables don't do at all what you expect.
        Hide
        rlow Richard Low added a comment -

        Aleksey Yeschenko I tried your patch (rebased for 2.0.6 where we're seeing the problem). It stops the AssertionError which is great, but I still get a timeout for the response. Do we need to poke the responseHandler too?

        Show
        rlow Richard Low added a comment - Aleksey Yeschenko I tried your patch (rebased for 2.0.6 where we're seeing the problem). It stops the AssertionError which is great, but I still get a timeout for the response. Do we need to poke the responseHandler too?
        Hide
        iamaleksey Aleksey Yeschenko added a comment -

        Do we need to poke the responseHandler too?

        We do. Oh this is ugly-ish.

        Show
        iamaleksey Aleksey Yeschenko added a comment - Do we need to poke the responseHandler too? We do. Oh this is ugly-ish.
        Hide
        rlow Richard Low added a comment -

        Confirmed v2 works on 2.0.6. There is no AssertionError or timeout. I haven't verified it hasn't broken anything else though but it looks like it shouldn't.

        Show
        rlow Richard Low added a comment - Confirmed v2 works on 2.0.6. There is no AssertionError or timeout. I haven't verified it hasn't broken anything else though but it looks like it shouldn't.
        Hide
        iamaleksey Aleksey Yeschenko added a comment -

        Committed, thanks. (fixed non-reusing the replicationMutation local var on commit).

        Show
        iamaleksey Aleksey Yeschenko added a comment - Committed, thanks. (fixed non-reusing the replicationMutation local var on commit).
        Hide
        rlow Richard Low added a comment -

        Confirmed that we issued counter deletes so it's the same root cause. Thanks for the fix!

        Show
        rlow Richard Low added a comment - Confirmed that we issued counter deletes so it's the same root cause. Thanks for the fix!

          People

          • Assignee:
            iamaleksey Aleksey Yeschenko
            Reporter:
            nesnub Maxime Lamothe-Brassard
            Reviewer:
            Richard Low
          • Votes:
            5 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development