Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Fix Version/s: None
    • Component/s: Core
    • Labels:
      None
    • Environment:

      Amazon Linux

      Description

      Seeing errors like these:

      2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard

      What does it mean ?

      1. err.txt
        450 kB
        Ivan Sobolev
      2. cassandra-mck.log.bz2
        2.90 MB
        mck

        Activity

        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Resolved Resolved
        495d 22h 37m 1 Jonathan Ellis 14/Nov/13 05:38
        Jonathan Ellis made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Won't Fix [ 2 ]
        Hide
        Jonathan Ellis added a comment -

        The only way we know how to fix this is as part of the 2.0 rewrite (CASSANDRA-4775).

        Show
        Jonathan Ellis added a comment - The only way we know how to fix this is as part of the 2.0 rewrite ( CASSANDRA-4775 ).
        Hide
        Phil Pirozhkov added a comment -

        Cassandra 1.2.5, single node dev local installation.
        Schema:

        CREATE TABLE reporting (
            zoom int,
            time timestamp,
            total counter,
            PRIMARY KEY (zoom, time)
        )
            WITH CLUSTERING ORDER BY (time ASC);
        
        update reporting set total = total + 1 where zoom = 0 and time = 1234142142141;
        update reporting set total = total + 1 where zoom = 1 and time = 1234142142141;
        update reporting set total = total + 1 where zoom = 2 and time = 1234142142141;
        

        Query:

        select * from reporting where zoom=0;

        may produce different results, either rpc timeout either 'total' is null.
        Nodetool repair does nothing and hangs time to time.
        Chance to reproduce 50%.
        Tried to change to batch commitlog mode, same result (but 10 times less performant).

        Show
        Phil Pirozhkov added a comment - Cassandra 1.2.5, single node dev local installation. Schema: CREATE TABLE reporting ( zoom int , time timestamp, total counter, PRIMARY KEY (zoom, time) ) WITH CLUSTERING ORDER BY (time ASC); update reporting set total = total + 1 where zoom = 0 and time = 1234142142141; update reporting set total = total + 1 where zoom = 1 and time = 1234142142141; update reporting set total = total + 1 where zoom = 2 and time = 1234142142141; Query: select * from reporting where zoom=0; may produce different results, either rpc timeout either 'total' is null. Nodetool repair does nothing and hangs time to time. Chance to reproduce 50%. Tried to change to batch commitlog mode, same result (but 10 times less performant).
        Gavin made changes -
        Workflow patch-available, re-open possible [ 12753292 ] reopen-resolved, no closed status, patch-avail, testing [ 12756006 ]
        Gavin made changes -
        Workflow no-reopen-closed, patch-avail [ 12710415 ] patch-available, re-open possible [ 12753292 ]
        Hide
        Janne Jalkanen added a comment -

        Turns out that no amount of repair (I ran both repair -pr and full repair) allows the counter values to converge. One node had consistently wrong counts that would not be repaired no matter what. In the end I took out the node, removed all data and brought it back into the cluster and let it reinitialize itself. Now the values are converged.

        Show
        Janne Jalkanen added a comment - Turns out that no amount of repair (I ran both repair -pr and full repair) allows the counter values to converge. One node had consistently wrong counts that would not be repaired no matter what. In the end I took out the node, removed all data and brought it back into the cluster and let it reinitialize itself. Now the values are converged.
        Hide
        Janne Jalkanen added a comment - - edited

        I'm seeing this while running repair -pr. Three-cluster node, RF 3. Straight upgrade from 1.0.12 to 1.1.8; no topology changes. I see two invalid shard IDs, counts differ by more than one - sometimes even by 3000 or more. Seems random to my eyes.

        Our counters are in a composite column family, no TTLs in use. We mostly increment by one, but sometimes more.

        I did disablegossip, disablethrift, drain, shutdown, upgrade, restart on every node in a rolling fashion. Then I did upgradesstables and repair -pr on every node when the entire cluster had been upgraded. Environment is Ubuntu Linux 12.04 LTS, JVM is OpenJDK 7u9.

        Last repair picked 497 invalid counter shards, and we have approximately 8 million counters, of which about a hundred is incremented each second (and sometimes subtracted from if our read repair kicks in - we have our own in-app repair for certain low values). All the counter writes are batched with 100 increments/batch. So this is only affecting a really small subset, though it's rather annoying when it happens, as it means that you can never really trust the counters to be even in the ballpark :-/

        Show
        Janne Jalkanen added a comment - - edited I'm seeing this while running repair -pr. Three-cluster node, RF 3. Straight upgrade from 1.0.12 to 1.1.8; no topology changes. I see two invalid shard IDs, counts differ by more than one - sometimes even by 3000 or more. Seems random to my eyes. Our counters are in a composite column family, no TTLs in use. We mostly increment by one, but sometimes more. I did disablegossip, disablethrift, drain, shutdown, upgrade, restart on every node in a rolling fashion. Then I did upgradesstables and repair -pr on every node when the entire cluster had been upgraded. Environment is Ubuntu Linux 12.04 LTS, JVM is OpenJDK 7u9. Last repair picked 497 invalid counter shards, and we have approximately 8 million counters, of which about a hundred is incremented each second (and sometimes subtracted from if our read repair kicks in - we have our own in-app repair for certain low values). All the counter writes are batched with 100 increments/batch. So this is only affecting a really small subset, though it's rather annoying when it happens, as it means that you can never really trust the counters to be even in the ballpark :-/
        Hide
        Ed Solovey added a comment -

        We are on 1.1.6 and are seeing this on a three node cluster with replication factor of 2. Is there a workaround for this? Corrupted counters are a showstopper for us and we'll have to move off Cassandra if we can't resolve this.

        Show
        Ed Solovey added a comment - We are on 1.1.6 and are seeing this on a three node cluster with replication factor of 2. Is there a workaround for this? Corrupted counters are a showstopper for us and we'll have to move off Cassandra if we can't resolve this.
        Hide
        Michael Kjellman added a comment -

        unfortunately, hitting this as well. We increment by different values as well. RF=3 on 1.1.6. Happened to me after i did a nodetool drain and restarted a node. When it came back up started seeing it being logged.

        Show
        Michael Kjellman added a comment - unfortunately, hitting this as well. We increment by different values as well. RF=3 on 1.1.6. Happened to me after i did a nodetool drain and restarted a node. When it came back up started seeing it being logged.
        mck made changes -
        Attachment cassandra-mck.log.bz2 [ 12552451 ]
        Hide
        mck added a comment - - edited

        Sylvain, here's log from one node. For most of the log we were running 1.0.8. And then at line 2883399 we upgraded (and this was the first node to upgrade) to 1.1.6.

        The error msg comes every few seconds.
        Our counters are sub-columns inside supercolumns.
        We completed the upgrade on all nodes. Then restarted again (because jna was missing).

        We are now running upgradesstables but that's not in this logfile. The error msgs still appear.

        An operational problem we've had recently is that we had one node down for ~one month (faulty raid controller) and when we finally brought the node back into the cluster nightly repairs would never finish. In the end we just disabled nightly repairs (we don't have tombstones) with the plan that an upgrade and upgradesstables would bring us back to a state where repairs would work again. I have no idea if this can be related.

        Show
        mck added a comment - - edited Sylvain, here's log from one node. For most of the log we were running 1.0.8. And then at line 2883399 we upgraded (and this was the first node to upgrade) to 1.1.6. The error msg comes every few seconds. Our counters are sub-columns inside supercolumns. We completed the upgrade on all nodes. Then restarted again (because jna was missing). We are now running upgradesstables but that's not in this logfile. The error msgs still appear. An operational problem we've had recently is that we had one node down for ~one month (faulty raid controller) and when we finally brought the node back into the cluster nightly repairs would never finish. In the end we just disabled nightly repairs (we don't have tombstones) with the plan that an upgrade and upgradesstables would bring us back to a state where repairs would work again. I have no idea if this can be related.
        Hide
        Ivan Sobolev added a comment -

        Sylvain Lebresne
        Quick question: do you always increment by the same value by any chance?

        Attached a log has not only +1 increments(though, think not any log would help you there )

        We run 1.1.5, no upgradesstables, most probably unclean shutdown too.

        Show
        Ivan Sobolev added a comment - Sylvain Lebresne Quick question: do you always increment by the same value by any chance? Attached a log has not only +1 increments(though, think not any log would help you there ) We run 1.1.5, no upgradesstables, most probably unclean shutdown too.
        Ivan Sobolev made changes -
        Field Original Value New Value
        Attachment err.txt [ 12551346 ]
        Hide
        Ivan Sobolev added a comment -

        We have examples where increment is not always 1.

        Show
        Ivan Sobolev added a comment - We have examples where increment is not always 1.
        Hide
        Jonathan Ellis added a comment -

        "On a bootstrap" sounds more like CASSANDRA-4071.

        Show
        Jonathan Ellis added a comment - "On a bootstrap" sounds more like CASSANDRA-4071 .
        Hide
        Eric Lubow added a comment - - edited

        We are getting this on DSE 2.2 (C* 1.1.5) on a new node during bootstrap. We upgraded the cluster from C* 1.0.10 about 10 days ago and upgradesstables was run on every node and we repaired the entire cluster. We ran We've been getting this error sporadically on various nodes at various points but it's not consistent. I've double and triple checked every node looking for sstable files named "- hd -" and I don't see any (assuming that's enough to tell that the sstable has been upgraded. If this error is an effect of requiring one to run upgradesstables, then how would it happen during a bootstrap? All nodes involved in this cluster are 1.1.5.

        Show
        Eric Lubow added a comment - - edited We are getting this on DSE 2.2 (C* 1.1.5) on a new node during bootstrap. We upgraded the cluster from C* 1.0.10 about 10 days ago and upgradesstables was run on every node and we repaired the entire cluster. We ran We've been getting this error sporadically on various nodes at various points but it's not consistent. I've double and triple checked every node looking for sstable files named " - hd - " and I don't see any (assuming that's enough to tell that the sstable has been upgraded. If this error is an effect of requiring one to run upgradesstables, then how would it happen during a bootstrap? All nodes involved in this cluster are 1.1.5.
        Hide
        Chris Herron added a comment -

        I'm really starting to think that CASSANDRA-4071 is likely the main cause for this and is very easy to reproduce in that case. The commit log we've discussed earlier can also trigger that error, but it's probably much harder to trigger.

        In our case:

        • We haven't made any topology changes
        • Our test drops and recreates the affected CFs. No nodes die during the test (w.r.t. unclean shutdown and commit log)
        • After previous load test runs under different configuration (see below), no nodes die, and we use nodetool drain before restarting with updated configs.

        Note my earlier comment above I said:

        In investigating CASSANDRA-4687 we disabled key cache, repeated the load+upgradesstables test and these invalid counter shard warnings did not appear.

        Given that we don't have a topology change, can you think of a scenario where a commitlog issue is still contributing?

        Show
        Chris Herron added a comment - I'm really starting to think that CASSANDRA-4071 is likely the main cause for this and is very easy to reproduce in that case. The commit log we've discussed earlier can also trigger that error, but it's probably much harder to trigger. In our case: We haven't made any topology changes Our test drops and recreates the affected CFs. No nodes die during the test (w.r.t. unclean shutdown and commit log) After previous load test runs under different configuration (see below), no nodes die, and we use nodetool drain before restarting with updated configs. Note my earlier comment above I said: In investigating CASSANDRA-4687 we disabled key cache, repeated the load+upgradesstables test and these invalid counter shard warnings did not appear. Given that we don't have a topology change, can you think of a scenario where a commitlog issue is still contributing?
        Hide
        Sylvain Lebresne added a comment -

        No, sorry just happened to pick that example.

        That's ok, thanks nonetheless.

        I'm really starting to think that CASSANDRA-4071 is likely the main cause for this and is very easy to reproduce in that case. The commit log we've discussed earlier can also trigger that error, but it's probably much harder to trigger. The good news is that as explained on the ticket, CASSANDRA-4071 won't corrupt counters unless you are at RF=1 (in which case that's not a good news). The bad one is I'm not really sure how to fix it.

        Show
        Sylvain Lebresne added a comment - No, sorry just happened to pick that example. That's ok, thanks nonetheless. I'm really starting to think that CASSANDRA-4071 is likely the main cause for this and is very easy to reproduce in that case. The commit log we've discussed earlier can also trigger that error, but it's probably much harder to trigger. The good news is that as explained on the ticket, CASSANDRA-4071 won't corrupt counters unless you are at RF=1 (in which case that's not a good news). The bad one is I'm not really sure how to fix it.
        Hide
        Chris Herron added a comment -

        Quick question: do you always increment by the same value by any chance?

        No, sorry just happened to pick that example. We have many other log entries where both values are higher and don't differ by 1.

        Show
        Chris Herron added a comment - Quick question: do you always increment by the same value by any chance? No, sorry just happened to pick that example. We have many other log entries where both values are higher and don't differ by 1.
        Hide
        Sylvain Lebresne added a comment -

        Could this be related to: CASSANDRA-4071?

        I missed that earlier on, but yes, Bartłomiej is correct, CASSANDRA-4071 will totally trigger 'invalid counter shard detected' messages. As described in the ticket, if you don't use RF=1, this shouldn't actually create data loss, but it would still trigger the log until things get compacted away.

        Show
        Sylvain Lebresne added a comment - Could this be related to: CASSANDRA-4071 ? I missed that earlier on, but yes, Bartłomiej is correct, CASSANDRA-4071 will totally trigger 'invalid counter shard detected' messages. As described in the ticket, if you don't use RF=1, this shouldn't actually create data loss, but it would still trigger the log until things get compacted away.
        Hide
        Sylvain Lebresne added a comment -

        Quick question: do you always increment by the same value by any chance? I'm asking because the last log you've pasted indicates the conflicting information found correspond to 1 increment only, and in the first case the value is 1, on the second the value is 2. If you always increment by say 1, that would tell us which one is wrong (I'm not yet sure which conclusion I would draw from that but more info can't hurt ).

        Show
        Sylvain Lebresne added a comment - Quick question: do you always increment by the same value by any chance? I'm asking because the last log you've pasted indicates the conflicting information found correspond to 1 increment only, and in the first case the value is 1, on the second the value is 2. If you always increment by say 1, that would tell us which one is wrong (I'm not yet sure which conclusion I would draw from that but more info can't hurt ).
        Hide
        Chris Herron added a comment -

        Another observation since: in previous runs with key cache disabled we were not seeing any errors. However I've since found some invalid counter shard errors that are occurring during normal compaction.

        ERROR [CompactionExecutor:6] 2012-10-19 15:43:50,920 org.apache.cassandra.db.context.CounterContext invalid counter shard detected; (15b843e0-ff7c-11e0-0000-07f4b18563ff, 1, 1) and (15b843e0-ff7c-11e0-0000-07f4b18563ff, 1, 2) differ only
         in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
        

        So to be clear, this particular scenario is:

        • C* 1.1.6 with key cache disabled.
        • Load test ran earlier against this same setup; but no upgradesstables during that run; no errors under load during that test run.
        • Later, some nightly jobs ran that read from Super CF counters, write to other CFs.
        • Compaction activity occurs later after load test and nightly jobs complete. Invalid counter shard errors are seen for some CFs. Gleaning from the log output order, the affected CF's:
          • Did have upgradesstables run upon them in previous configurations (1.1.6, key cache on)
          • Have not been written to at all for the purpose of the load test I've been mentioning.
          • Have been read from for these nightly jobs.
        Show
        Chris Herron added a comment - Another observation since: in previous runs with key cache disabled we were not seeing any errors. However I've since found some invalid counter shard errors that are occurring during normal compaction. ERROR [CompactionExecutor:6] 2012-10-19 15:43:50,920 org.apache.cassandra.db.context.CounterContext invalid counter shard detected; (15b843e0-ff7c-11e0-0000-07f4b18563ff, 1, 1) and (15b843e0-ff7c-11e0-0000-07f4b18563ff, 1, 2) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard So to be clear, this particular scenario is: C* 1.1.6 with key cache disabled. Load test ran earlier against this same setup; but no upgradesstables during that run; no errors under load during that test run. Later, some nightly jobs ran that read from Super CF counters, write to other CFs. Compaction activity occurs later after load test and nightly jobs complete. Invalid counter shard errors are seen for some CFs. Gleaning from the log output order, the affected CF's: Did have upgradesstables run upon them in previous configurations (1.1.6, key cache on) Have not been written to at all for the purpose of the load test I've been mentioning. Have been read from for these nightly jobs.
        Hide
        Sylvain Lebresne added a comment -

        Ok. The fact that you only reproduce when using upgradesstables is definitively interesting. I'll check if I can see something causing that in upgradesstables. I'll keep you posted.

        Show
        Sylvain Lebresne added a comment - Ok. The fact that you only reproduce when using upgradesstables is definitively interesting. I'll check if I can see something causing that in upgradesstables. I'll keep you posted.
        Hide
        Chris Herron added a comment -

        Also, during that test, is there anything involving streaming going on (a repair, a node bootstrapping/moving/decommissioning)?

        There are definitely no repairs or node bootstrapping/moving/decommissioning happening during the test.
        Re-ran the test and the JMX stats for StreamStage indicated zero tasks on all nodes after the test completed.

        Show
        Chris Herron added a comment - Also, during that test, is there anything involving streaming going on (a repair, a node bootstrapping/moving/decommissioning)? There are definitely no repairs or node bootstrapping/moving/decommissioning happening during the test. Re-ran the test and the JMX stats for StreamStage indicated zero tasks on all nodes after the test completed.
        Hide
        Chris Herron added a comment -

        Probably, how was it deleted/recreated. Did you drop and recreate?

        Yes, dropped (the schema migration flavor) and recreated a CF of the same name.

        Perform the same test without the upgradesstables part (i.e. only the writes and reads). If so, does that change something?

        Have already tested that scenario. Running this load test without the concurrent upgradesstables compaction activity, the problem does not exhibit.

        during that test, is there anything involving streaming going on (a repair, a node bootstrapping/moving/decommissioning)?

        Not that I know of. I can test again and monitor for streaming activity to see.

        By the way, as we've been testing in preparation for a 1.1.x upgrade, we were seeing symptoms of CASSANDRA-4571, CASSANDRA-4687 as well as this issue on C* 1.1.6. In investigating CASSANDRA-4687 we disabled key cache, repeated the load+upgradesstables test and these invalid counter shard warnings did not appear.

        Show
        Chris Herron added a comment - Probably, how was it deleted/recreated. Did you drop and recreate? Yes, dropped (the schema migration flavor) and recreated a CF of the same name. Perform the same test without the upgradesstables part (i.e. only the writes and reads). If so, does that change something? Have already tested that scenario. Running this load test without the concurrent upgradesstables compaction activity, the problem does not exhibit. during that test, is there anything involving streaming going on (a repair, a node bootstrapping/moving/decommissioning)? Not that I know of. I can test again and monitor for streaming activity to see. By the way, as we've been testing in preparation for a 1.1.x upgrade, we were seeing symptoms of CASSANDRA-4571 , CASSANDRA-4687 as well as this issue on C* 1.1.6. In investigating CASSANDRA-4687 we disabled key cache, repeated the load+upgradesstables test and these invalid counter shard warnings did not appear.
        Hide
        Sylvain Lebresne added a comment -

        Is that equivalent?

        Probably, how was it deleted/recreated. Did you drop and recreate?

        This does reproduce for every test run.

        Interesting, you may on something. Does it make sense for you to perform the same test without the upgradesstables part (i.e. only the writes and reads). If so, does that change something? Also, during that test, is there anything involving streaming going on (a repair, a node bootstrapping/moving/decommissioning)? Trying to narrow down what's involved in you test.

        Show
        Sylvain Lebresne added a comment - Is that equivalent? Probably, how was it deleted/recreated. Did you drop and recreate? This does reproduce for every test run. Interesting, you may on something. Does it make sense for you to perform the same test without the upgradesstables part (i.e. only the writes and reads). If so, does that change something? Also, during that test, is there anything involving streaming going on (a repair, a node bootstrapping/moving/decommissioning)? Trying to narrow down what's involved in you test.
        Hide
        Chris Herron added a comment -

        Unless you've been able to reproduce on a brand new cluster where the commit log was set to batch from the beginning (in which case, if you have an easy way to reproduce, that would be interesting to know)

        In our test the affected Super CF is completely deleted and recreated - so in that sense the commit log was set to batch from the beginning. Is that equivalent?

        This does reproduce for every test run. Unfortunately our test is non-trivial to share. It involves heavy writes and moderate reads to counters, while simultaneously running upgradesstables on all nodes upon multiple CF's (including the affected one). Interestingly, the symptom does appear even before compaction reaches the Super CF that's active during the test.

        Show
        Chris Herron added a comment - Unless you've been able to reproduce on a brand new cluster where the commit log was set to batch from the beginning (in which case, if you have an easy way to reproduce, that would be interesting to know) In our test the affected Super CF is completely deleted and recreated - so in that sense the commit log was set to batch from the beginning. Is that equivalent? This does reproduce for every test run. Unfortunately our test is non-trivial to share. It involves heavy writes and moderate reads to counters, while simultaneously running upgradesstables on all nodes upon multiple CF's (including the affected one). Interestingly, the symptom does appear even before compaction reaches the Super CF that's active during the test.
        Hide
        Sylvain Lebresne added a comment -

        After reading Sylvain's comments above, tried running the same test with commitlog_sync: batch - we get a similar volume of the same errors

        Just to clarify, using batch commit log should only avoid the initial problem to reproduce (assuming the analysis of the problem is correct of course). However, contrarily to what the error message pretends, the existing "invalid counter shards" don't "heal" themselves as soon a the message is logged. In fact, the message is logged each time we merge counter columns that have conflicting shards and when that merge is triggered by a compaction, it will indeed "heal" the shard. But we also merge each time we read for instance. In other words, even if batch commit log fixes the problem, one will need to compact everything/wait for everything to be compacted to have all logged messages disappear. Unless you've been able to reproduce on a brand new cluster where the commit log was set to batch from the beginning (in which case, if you have an easy way to reproduce, that would be interesting to know).

        Show
        Sylvain Lebresne added a comment - After reading Sylvain's comments above, tried running the same test with commitlog_sync: batch - we get a similar volume of the same errors Just to clarify, using batch commit log should only avoid the initial problem to reproduce (assuming the analysis of the problem is correct of course). However, contrarily to what the error message pretends, the existing "invalid counter shards" don't "heal" themselves as soon a the message is logged. In fact, the message is logged each time we merge counter columns that have conflicting shards and when that merge is triggered by a compaction, it will indeed "heal" the shard. But we also merge each time we read for instance. In other words, even if batch commit log fixes the problem, one will need to compact everything/wait for everything to be compacted to have all logged messages disappear. Unless you've been able to reproduce on a brand new cluster where the commit log was set to batch from the beginning (in which case, if you have an easy way to reproduce, that would be interesting to know).
        Hide
        Chris Herron added a comment -

        We are seeing large volumes of this error on all nodes when running a load test while also running upgradesstables on multiple CF's on each node.

        After reading Sylvain's comments above, tried running the same test with commitlog_sync: batch - we get a similar volume of the same errors.

        (Running a build from branch cassandra-1.1 at commit 4d2e5e73b127dc0b335176ddc1dec1f0244e7f6d, with Java 6u35 on Amazon Linux 2.6.35)

        Show
        Chris Herron added a comment - We are seeing large volumes of this error on all nodes when running a load test while also running upgradesstables on multiple CF's on each node. After reading Sylvain's comments above, tried running the same test with commitlog_sync: batch - we get a similar volume of the same errors. (Running a build from branch cassandra-1.1 at commit 4d2e5e73b127dc0b335176ddc1dec1f0244e7f6d, with Java 6u35 on Amazon Linux 2.6.35)
        Hide
        Bartłomiej Romański added a comment -

        And the last comment. Could this be related to: CASSANDRA-4071? If I understand the description correctly any topology changes (adding a node, moving a node) when the counter is spread across more than one sstable can result in the "invalid counter shard detected" error message during reads. Am I right?

        Show
        Bartłomiej Romański added a comment - And the last comment. Could this be related to: CASSANDRA-4071 ? If I understand the description correctly any topology changes (adding a node, moving a node) when the counter is spread across more than one sstable can result in the "invalid counter shard detected" error message during reads. Am I right?
        Hide
        Bartłomiej Romański added a comment -

        In the previous comment I wanted to point directly to CASSANDRA-4436 - I've mixed up numbers.

        One more thing: could hinted-handoff be possible somehow related to this issue? We've got a problem with it (CASSANDRA-4673) which was discovered in (more or less) in the same time that our counters problems. Is there a possibility that sending hinted-handoff a few times ends up with incrementing counters a few time?

        Show
        Bartłomiej Romański added a comment - In the previous comment I wanted to point directly to CASSANDRA-4436 - I've mixed up numbers. One more thing: could hinted-handoff be possible somehow related to this issue? We've got a problem with it ( CASSANDRA-4673 ) which was discovered in (more or less) in the same time that our counters problems. Is there a possibility that sending hinted-handoff a few times ends up with incrementing counters a few time?
        Hide
        Bartłomiej Romański added a comment -

        Is it possible to predict how dangerous this bug could be? We are already experiencing very serious problems with CASSANDRA-4639. Our counter values suddenly became a few times higher than expected. As you can imagine this is a disaster from the business point of view. We are already seriously thinking about going back to SQL databases :/ I wonder how (if) this bug (and possible other counter related bugs) can affect us. We rely heavily on counters.

        Can this bug possibly lead to incorrect counter values? Temporarily or permanently - will running repair fix it?

        How incorrect counter values could be? Loosing a couple increments immediately preceding a node failure is probably acceptable in most cases. Is it possible to loose more increments? Or end up in completely incorrect counter values as in CASSANDRA-4639?

        What would exactly happen after hitting this bug. Running repair should fix it? The "self-healing" mechanism would actually make counter consistent again? Or we will get this error messages over and over?

        Sorry for writing a comment full of questions, but I've got very limited knowledge of cassandra internals. I'll be very thankful if someone could refer to the questions above.

        Show
        Bartłomiej Romański added a comment - Is it possible to predict how dangerous this bug could be? We are already experiencing very serious problems with CASSANDRA-4639 . Our counter values suddenly became a few times higher than expected. As you can imagine this is a disaster from the business point of view. We are already seriously thinking about going back to SQL databases :/ I wonder how (if) this bug (and possible other counter related bugs) can affect us. We rely heavily on counters. Can this bug possibly lead to incorrect counter values? Temporarily or permanently - will running repair fix it? How incorrect counter values could be? Loosing a couple increments immediately preceding a node failure is probably acceptable in most cases. Is it possible to loose more increments? Or end up in completely incorrect counter values as in CASSANDRA-4639 ? What would exactly happen after hitting this bug. Running repair should fix it? The "self-healing" mechanism would actually make counter consistent again? Or we will get this error messages over and over? Sorry for writing a comment full of questions, but I've got very limited knowledge of cassandra internals. I'll be very thankful if someone could refer to the questions above.
        Hide
        Peter Schuller added a comment -

        @Sylvain I know it wouldn't be correlated with the same node; I was referring to uncontrolled shutdowns in general in the cluster.

        @Omid: Presumably the premise was that the mutation goes through the commit log on the leader prior to replication. I'm not sure if this is the case, but if it is, then it should work.

        @jbellis FWIW, our counter use-cases are such that going commit log synch is probably not feasable due to very high write throughput. Doesn't mean other people's use-cases are the same, and of course I fully support the idea of being correct by default (as opposed to performant by default).

        @Sylvain again: I agree about refreshing nodeid on every unclean restart being potentially dangerous. Counters are already huge due to the size of counter shards, and refreshing nodeids in any situation which might result in en-masse refreshment can definitely be dangerous both from a CPU usage perspective as well as a disk space one.

        Show
        Peter Schuller added a comment - @Sylvain I know it wouldn't be correlated with the same node; I was referring to uncontrolled shutdowns in general in the cluster. @Omid: Presumably the premise was that the mutation goes through the commit log on the leader prior to replication. I'm not sure if this is the case, but if it is, then it should work. @jbellis FWIW, our counter use-cases are such that going commit log synch is probably not feasable due to very high write throughput. Doesn't mean other people's use-cases are the same, and of course I fully support the idea of being correct by default (as opposed to performant by default). @Sylvain again: I agree about refreshing nodeid on every unclean restart being potentially dangerous. Counters are already huge due to the size of counter shards, and refreshing nodeids in any situation which might result in en-masse refreshment can definitely be dangerous both from a CPU usage perspective as well as a disk space one.
        Hide
        Omid Aladini added a comment - - edited

        A simple "workaround" is to use batch commit log, but that has a potentially important performance impact.

        I'm a bit confused why batch commit would solve the problem. If cassandra crashes before the batch is fsynced, the counter mutations in the batch which it was the leader for will still be lost although they might have been applied on other replicas. The difference would be that the mutations won't be acknowledged to the client, and since counters aren't idempotent, the client won't know weather to retry or not. Am I missing something?

        Show
        Omid Aladini added a comment - - edited A simple "workaround" is to use batch commit log, but that has a potentially important performance impact. I'm a bit confused why batch commit would solve the problem. If cassandra crashes before the batch is fsynced, the counter mutations in the batch which it was the leader for will still be lost although they might have been applied on other replicas. The difference would be that the mutations won't be acknowledged to the client, and since counters aren't idempotent, the client won't know weather to retry or not. Am I missing something?
        Hide
        Sylvain Lebresne added a comment -

        That's an option. Though not an exactly short term one (I suspect mixing periodic and batch cf on the same commit log might require a bit of care; unless you were thinking of having multiple commit logs, but I'm not sure that would be a good thing).

        But hey, I don't have a much better solution so far, so looking at that option is definitively worth it (since it's generally useful anyway).

        Show
        Sylvain Lebresne added a comment - That's an option. Though not an exactly short term one (I suspect mixing periodic and batch cf on the same commit log might require a bit of care; unless you were thinking of having multiple commit logs, but I'm not sure that would be a good thing). But hey, I don't have a much better solution so far, so looking at that option is definitively worth it (since it's generally useful anyway).
        Hide
        Jonathan Ellis added a comment -

        You would do more fsyncs, but only the CFs in actual batch mode would have to block for them. Periodic mode just queues the CL op and moves on to memtable append immediately.

        Show
        Jonathan Ellis added a comment - You would do more fsyncs, but only the CFs in actual batch mode would have to block for them. Periodic mode just queues the CL op and moves on to memtable append immediately.
        Hide
        Brandon Williams added a comment -

        Under multiple-CF concurrency, wouldn't you effectively end up with batch mode?

        Show
        Brandon Williams added a comment - Under multiple-CF concurrency, wouldn't you effectively end up with batch mode?
        Hide
        Jonathan Ellis added a comment -

        Maybe it's time to make commitlog mode (off/periodic/batch) per-CF instead of mix of global and per-KS. Then we could automatically force counter CF to batch.

        Show
        Jonathan Ellis added a comment - Maybe it's time to make commitlog mode (off/periodic/batch) per-CF instead of mix of global and per-KS. Then we could automatically force counter CF to batch.
        Hide
        Sylvain Lebresne added a comment -

        That's a very good point. Counters do rely on the fact that nodes do not lose the increments they are "leader" for (or that they don't reuse the same nodeId if they do), but unless the commit log uses batch mode, this can happen. And that will lead to exactly the exception seen here, so I'd say there's a very good chance this is the problem.

        I'll note that if that is indeed a problem, it's very possible that the error logged happens only much later (after the "unclean" shutdown) and on some other node than the one having died. So not being able to correlate the error to an unclean shutdown doesn't really indicate that it's not related.

        The consequence of this happening is that the increments that have been lost with un-synced commit log are lost. Meaning that with the default configuration, one could lose up to 10 seconds of the increments (for which the dying node is leader). However, I think it is also possible to have results from read to miss slightly more than that, though that last part should fix itself if the counter is incremented again.

        As for the error message logged, it's possible that lots of them are logged even though only a small number of counters are affected since it's print during column reconciliation and thus could be logged many time for the same counter.

        A simple "workaround" is to use batch commit log, but that has a potentially important performance impact.

        Another solution I've though of would be to try to detect unclean shutdown (by marking something during clean shutdown and checking for that at startup) and if we detect one, to renew the nodeId. The problem with that is that this potentially mean renewing the nodeId pretty often. And each time we do that, the internal representation of counter grow and I'm really afraid it will be a problem in that case. And while we have some mechanism to shrink back counter by merging sub-counts when the nodeId is renewed too often, that mechanism assumes that the node owning the nodeId has the more up-to-date value for this sub-count, which is exactly the problem here. So overall I don't have any good idea to fix this. Other ideas?

        Show
        Sylvain Lebresne added a comment - That's a very good point. Counters do rely on the fact that nodes do not lose the increments they are "leader" for (or that they don't reuse the same nodeId if they do), but unless the commit log uses batch mode, this can happen. And that will lead to exactly the exception seen here, so I'd say there's a very good chance this is the problem. I'll note that if that is indeed a problem, it's very possible that the error logged happens only much later (after the "unclean" shutdown) and on some other node than the one having died. So not being able to correlate the error to an unclean shutdown doesn't really indicate that it's not related. The consequence of this happening is that the increments that have been lost with un-synced commit log are lost. Meaning that with the default configuration, one could lose up to 10 seconds of the increments (for which the dying node is leader). However, I think it is also possible to have results from read to miss slightly more than that, though that last part should fix itself if the counter is incremented again. As for the error message logged, it's possible that lots of them are logged even though only a small number of counters are affected since it's print during column reconciliation and thus could be logged many time for the same counter. A simple "workaround" is to use batch commit log, but that has a potentially important performance impact. Another solution I've though of would be to try to detect unclean shutdown (by marking something during clean shutdown and checking for that at startup) and if we detect one, to renew the nodeId. The problem with that is that this potentially mean renewing the nodeId pretty often. And each time we do that, the internal representation of counter grow and I'm really afraid it will be a problem in that case. And while we have some mechanism to shrink back counter by merging sub-counts when the nodeId is renewed too often, that mechanism assumes that the node owning the nodeId has the more up-to-date value for this sub-count, which is exactly the problem here. So overall I don't have any good idea to fix this. Other ideas?
        Hide
        Peter Schuller added a comment -

        @Fabien That sounds plausible to me upon first read at least. I cannot confirm or deny whether it's possible that I've only seen this under circumstances where a non-clean shutdown has taken place. In particular, I cannot (at the top of my head anyway) think of a way in which the situation you describe is ever prevented from happening. So regardless of whether or not this is the explanation for this problem, it seems to me to at least be an explanation for it.

        Show
        Peter Schuller added a comment - @Fabien That sounds plausible to me upon first read at least. I cannot confirm or deny whether it's possible that I've only seen this under circumstances where a non-clean shutdown has taken place. In particular, I cannot (at the top of my head anyway) think of a way in which the situation you describe is ever prevented from happening. So regardless of whether or not this is the explanation for this problem, it seems to me to at least be an explanation for it.
        Hide
        Charles Brophy added a comment -

        Yes, we operate under a heavy write load and we do have frequent compaction as a result, but under normal conditions I never see this exception - or at least it doesn't happen often enough for me to catch. Following a repair, however, it's a guarantee for us. Could it be as simple as:

        • the two servers are participants in the same key-range replicant and the sstables contain the same key/row/column references
        • the process of streaming repair is sending a set of key/row/column references to the requestor in the same sstable as the out-of-sync data it's already aware of via repair
        • Compaction finds the duplicate references in the recently received sstables - they're basically the other replicant's copies of that data

        It seems that the act of sending the sstables from one server to the other when both are replicants of the same key range would be expected to result in duplicate references. I'm probably way off.

        Show
        Charles Brophy added a comment - Yes, we operate under a heavy write load and we do have frequent compaction as a result, but under normal conditions I never see this exception - or at least it doesn't happen often enough for me to catch. Following a repair, however, it's a guarantee for us. Could it be as simple as: the two servers are participants in the same key-range replicant and the sstables contain the same key/row/column references the process of streaming repair is sending a set of key/row/column references to the requestor in the same sstable as the out-of-sync data it's already aware of via repair Compaction finds the duplicate references in the recently received sstables - they're basically the other replicant's copies of that data It seems that the act of sending the sstables from one server to the other when both are replicants of the same key range would be expected to result in duplicate references. I'm probably way off.
        Hide
        Fabien Rousseau added a comment -

        Could it be that counter increment is written in commitlog and sent to other replicas, BUT : commitlog is not yet flushed on disk AND cassandra is stopped/killed.

        Ex :

        • node A receives an increment of 1 for c1
        • it stores (A, 1, 1)
        • it sends the increment to the replicas (A, 1, 1)
        • node is killed (without commitog being flushed to disk)
        • on restart, node A receives an increment of 3 for c1
        • it stores (A, 1, 3) (because it has no way of knowing the clock "1" was already attributed)
        Show
        Fabien Rousseau added a comment - Could it be that counter increment is written in commitlog and sent to other replicas, BUT : commitlog is not yet flushed on disk AND cassandra is stopped/killed. Ex : node A receives an increment of 1 for c1 it stores (A, 1, 1) it sends the increment to the replicas (A, 1, 1) node is killed (without commitog being flushed to disk) on restart, node A receives an increment of 3 for c1 it stores (A, 1, 3) (because it has no way of knowing the clock "1" was already attributed)
        Hide
        Peter Schuller added a comment -

        I am not 100% certain, but I am fairly certain, that we've seen this on nodes that haven't done any streaming whatsoever.

        With respect to duplicates: It's certainly not supposed to happen. A given counter shard, from a given node id, and with a given clock, should only ever be produced exactly once by exactly one node. Obviously the bug isn't supposed to happen to begin with, so that doesn't mean the bug isn't related to streaming.

        Hmmm.

        Do you have a lot of writes normally? Is it possible that the correlation with streaming is because of the fact that it initiates significant amounts of compaction?

        Show
        Peter Schuller added a comment - I am not 100% certain, but I am fairly certain, that we've seen this on nodes that haven't done any streaming whatsoever. With respect to duplicates: It's certainly not supposed to happen. A given counter shard, from a given node id, and with a given clock, should only ever be produced exactly once by exactly one node. Obviously the bug isn't supposed to happen to begin with, so that doesn't mean the bug isn't related to streaming. Hmmm. Do you have a lot of writes normally? Is it possible that the correlation with streaming is because of the fact that it initiates significant amounts of compaction?
        Hide
        Charles Brophy added a comment - - edited

        We have a six node cluster [1.1.3, jdk 1.6.33, CentOs 6] with even key range balance, random partitioner, and with replication factor=2. I get these errors immediately following running "nodetool repair" but ONLY if a streaming repair happens as a result. We are serving live updates to our counters from our clickstream. My guess is that the sstable being streamed between the servers winds up becoming "out of date" for the duration of the streaming process and ends up containing these "duplicates" that are vetted during the subsequent compaction. In any case, for us it is 100% reproducible via: nodetool repair -> streaming repair -> subsequent compaction. Let me know if you need more details. Hope this helps!

        Show
        Charles Brophy added a comment - - edited We have a six node cluster [1.1.3, jdk 1.6.33, CentOs 6] with even key range balance, random partitioner, and with replication factor=2. I get these errors immediately following running "nodetool repair" but ONLY if a streaming repair happens as a result. We are serving live updates to our counters from our clickstream. My guess is that the sstable being streamed between the servers winds up becoming "out of date" for the duration of the streaming process and ends up containing these "duplicates" that are vetted during the subsequent compaction. In any case, for us it is 100% reproducible via: nodetool repair -> streaming repair -> subsequent compaction. Let me know if you need more details. Hope this helps!
        Hide
        Michael Theroux added a comment -

        Ignore that comment about row-caching. I see these errors in the log dating back the the 11th of July (long before we enabled rowcaching)

        Show
        Michael Theroux added a comment - Ignore that comment about row-caching. I see these errors in the log dating back the the 11th of July (long before we enabled rowcaching)
        Hide
        Michael Theroux added a comment -

        I just hit this myself on 1.1.2 on two nodes of a six node cluster. The cluster has been stable for a couple of weeks.

        If it makes any difference, we recently enabled row-caching.

        ...
        INFO [AntiEntropyStage:1] 2012-07-24 11:05:55,537 AntiEntropyService.java (line 206) repair #b9355020-d57e-11e1-0000-7c4549350fdf Received merkle tree for caches from /10.29.214.111
        ERROR [CompactionExecutor:183] 2012-07-24 11:05:58,532 CounterContext.java (line 381) invalid counter shard detected; (6be74ab0-6cc6-11e1-0000-242d50cf1fd7, 1, -1) and (6be74ab0-6cc6-11e1-0000-242d50cf1fd7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
        ERROR [CompactionExecutor:183] 2012-07-24 11:05:58,533 CounterContext.java (line 381) invalid counter shard detected; (6be74ab0-6cc6-11e1-0000-242d50cf1fd7, 1, 1) and (6be74ab0-6cc6-11e1-0000-242d50cf1fd7, 1, -1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
        ERROR [CompactionExecutor:183] 2012-07-24 11:05:58,534 CounterContext.java (line 381) invalid counter shard detected; (6be74ab0-6cc6-11e1-0000-242d50cf1fd7, 1, -1) and (6be74ab0-6cc6-11e1-0000-242d50cf1fd7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
        ERROR [CompactionExecutor:183] 2012-07-24 11:05:58,534 CounterContext.java (line 381) invalid counter shard detected; (6be74ab0-6cc6-11e1-0000-242d50cf1fd7, 1, 1) and (6be74ab0-6cc6-11e1-0000-242d50cf1fd7, 1, -1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
        ERROR [CompactionExecutor:183] 2012-07-24 11:05:58,534 CounterContext.java (line 381) invalid counter shard detected; (6be74ab0-6cc6-11e1-0000-242d50cf1fd7, 1, -1) and (6be74ab0-6cc6-11e1-0000-242d50cf1fd7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
        ERROR [CompactionExecutor:183] 2012-07-24 11:05:58,535 CounterContext.java (line 381) invalid counter shard detected; (6be74ab0-6cc6-11e1-0000-242d50cf1fd7, 1, 1) and (6be74ab0-6cc6-11e1-0000-242d50cf1fd7, 1, -1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
        INFO [AntiEntropyStage:1] 2012-07-24 11:06:05,541 AntiEntropyService.java (line 206) repair #b9355020-d57e-11e1-0000-7c4549350fdf Received merkle tree for caches from /10.144.15.6
        ...

        Show
        Michael Theroux added a comment - I just hit this myself on 1.1.2 on two nodes of a six node cluster. The cluster has been stable for a couple of weeks. If it makes any difference, we recently enabled row-caching. ... INFO [AntiEntropyStage:1] 2012-07-24 11:05:55,537 AntiEntropyService.java (line 206) repair #b9355020-d57e-11e1-0000-7c4549350fdf Received merkle tree for caches from /10.29.214.111 ERROR [CompactionExecutor:183] 2012-07-24 11:05:58,532 CounterContext.java (line 381) invalid counter shard detected; (6be74ab0-6cc6-11e1-0000-242d50cf1fd7, 1, -1) and (6be74ab0-6cc6-11e1-0000-242d50cf1fd7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard ERROR [CompactionExecutor:183] 2012-07-24 11:05:58,533 CounterContext.java (line 381) invalid counter shard detected; (6be74ab0-6cc6-11e1-0000-242d50cf1fd7, 1, 1) and (6be74ab0-6cc6-11e1-0000-242d50cf1fd7, 1, -1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard ERROR [CompactionExecutor:183] 2012-07-24 11:05:58,534 CounterContext.java (line 381) invalid counter shard detected; (6be74ab0-6cc6-11e1-0000-242d50cf1fd7, 1, -1) and (6be74ab0-6cc6-11e1-0000-242d50cf1fd7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard ERROR [CompactionExecutor:183] 2012-07-24 11:05:58,534 CounterContext.java (line 381) invalid counter shard detected; (6be74ab0-6cc6-11e1-0000-242d50cf1fd7, 1, 1) and (6be74ab0-6cc6-11e1-0000-242d50cf1fd7, 1, -1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard ERROR [CompactionExecutor:183] 2012-07-24 11:05:58,534 CounterContext.java (line 381) invalid counter shard detected; (6be74ab0-6cc6-11e1-0000-242d50cf1fd7, 1, -1) and (6be74ab0-6cc6-11e1-0000-242d50cf1fd7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard ERROR [CompactionExecutor:183] 2012-07-24 11:05:58,535 CounterContext.java (line 381) invalid counter shard detected; (6be74ab0-6cc6-11e1-0000-242d50cf1fd7, 1, 1) and (6be74ab0-6cc6-11e1-0000-242d50cf1fd7, 1, -1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard INFO [AntiEntropyStage:1] 2012-07-24 11:06:05,541 AntiEntropyService.java (line 206) repair #b9355020-d57e-11e1-0000-7c4549350fdf Received merkle tree for caches from /10.144.15.6 ...
        Hide
        Peter Schuller added a comment -

        You are triggering a self-healing mechanism introduced in CASSANDRA-3641. I can confirm we're seeing it on 1.1 still too, even on sstables that haven't been upgraded from older versions. I don't think anyone knows exactly why it's happening.

        The condition being checked for is never supposed to happen. Prior to the CASSANDRA-3641 the result would be counters with values that never converge even with read-repair; post-CASSANDRA-3641 the values converge. But depending on the root cause, it's unclear whether there is a danger of incorrect counter values.

        Show
        Peter Schuller added a comment - You are triggering a self-healing mechanism introduced in CASSANDRA-3641 . I can confirm we're seeing it on 1.1 still too, even on sstables that haven't been upgraded from older versions. I don't think anyone knows exactly why it's happening. The condition being checked for is never supposed to happen. Prior to the CASSANDRA-3641 the result would be counters with values that never converge even with read-repair; post- CASSANDRA-3641 the values converge. But depending on the root cause, it's unclear whether there is a danger of incorrect counter values.
        Senthilvel Rangaswamy created issue -

          People

          • Assignee:
            Unassigned
            Reporter:
            Senthilvel Rangaswamy
          • Votes:
            7 Vote for this issue
            Watchers:
            26 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development