Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-8819

LOCAL_QUORUM writes returns wrong message

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Fix Version/s: 2.0.13
    • Component/s: None
    • Labels:
    • Environment:

      CentOS 6.6

      Description

      We have two DC3, each with 7 nodes.
      Here is the keyspace setup:

      create keyspace test
      with placement_strategy = 'NetworkTopologyStrategy'
      and strategy_options =

      {DC2 : 3, DC1 : 3}

      and durable_writes = true;

      We brought down two nodes in DC2 for maintenance. We only write to DC1 using local_quroum (using datastax JavaClient)
      But we see this errors in the log:
      Cassandra timeout during write query at consistency LOCAL_QUORUM (4 replica were required but only 3 acknowledged the write
      why does it say 4 replica were required? and Why would it give error back to client since local_quorum should succeed.

      Here are the output from nodetool status

      Note: Ownership information does not include topology; for complete information, specify a keyspace
      Datacenter: DC2
      ===============
      Status=Up/Down

      / State=Normal/Leaving/Joining/Moving
      – Address Load Tokens Owns Host ID Rack
      UN 10.2.0.1 10.92 GB 256 7.9% XXXXXXXXXXXXXXXXXXXXXXXXXXXX RAC206
      UN 10.2.0.2 6.17 GB 256 8.0% XXXXXXXXXXXXXXXXXXXXXXXXXXXX RAC106
      UN 10.2.0.3 6.63 GB 256 7.3% XXXXXXXXXXXXXXXXXXXXXXXXXXXX RAC107
      DL 10.2.0.4 1.54 GB 256 7.7% XXXXXXXXXXXXXXXXXXXXXXXXXXXX RAC107
      UN 10.2.0.5 6.02 GB 256 6.6% XXXXXXXXXXXXXXXXXXXXXXXXXXXX RAC106
      UJ 10.2.0.6 3.68 GB 256 ? XXXXXXXXXXXXXXXXXXXXXXXXXXXX RAC205
      UN 10.2.0.7 7.22 GB 256 7.7% XXXXXXXXXXXXXXXXXXXXXXXXXXXX RAC205
      Datacenter: DC1
      ===============
      Status=Up/Down
      / State=Normal/Leaving/Joining/Moving
      – Address Load Tokens Owns Host ID Rack
      UN 10.1.0.1 6.04 GB 256 8.6% XXXXXXXXXXXXXXXXXXXXXXXXXXXX RAC10
      UN 10.1.0.2 7.55 GB 256 7.4% XXXXXXXXXXXXXXXXXXXXXXXXXXXX RAC8
      UN 10.1.0.3 5.83 GB 256 7.0% XXXXXXXXXXXXXXXXXXXXXXXXXXXX RAC9
      UN 10.1.0.4 7.34 GB 256 7.9% XXXXXXXXXXXXXXXXXXXXXXXXXXXX RAC6
      UN 10.1.0.5 7.57 GB 256 8.0% XXXXXXXXXXXXXXXXXXXXXXXXXXXX RAC7
      UN 10.1.0.6 5.31 GB 256 7.3% XXXXXXXXXXXXXXXXXXXXXXXXXXXX RAC10
      UN 10.1.0.7 5.47 GB 256 8.6% XXXXXXXXXXXXXXXXXXXXXXXXXXXX RAC9

      I did a cql trace on the query and here is the trace, and it does say
      Write timeout; received 3 of 4 required replies | 17:27:52,831 | 10.1.0.1 | 2002873

      at the end. I guess that is where the client gets the error from. But the rows was inserted to Cassandra correctly. And I traced read with local_quorum and it behaves correctly and the reads don't go to DC2. The problem is only with writes on local_quorum.

      Tracing session: 5a789fb0-b70d-11e4-8fca-99bff9c19890
      
       activity                                                                                                                                    | timestamp    | source      | source_elapsed
      ---------------------------------------------------------------------------------------------------------------------------------------------+--------------+-------------+----------------
                                                                                                                                execute_cql3_query | 17:27:50,828 |  10.1.0.1 |              0
       Parsing insert into test (user_id, created, event_data, event_id)values ( 123456789 , 9eab8950-b70c-11e4-8fca-99bff9c19891, 'test', '16'); | 17:27:50,828 |  10.1.0.1 |             39
                                                                                                                               Preparing statement | 17:27:50,828 |  10.1.0.1 |            135
                                                                                                                 Message received from /10.1.0.1 | 17:27:50,829 |  10.1.0.5 |             25
                                                                                                                    Sending message to /10.1.0.5 | 17:27:50,829 |  10.1.0.1 |            421
                                                                                                         Executing single-partition query on users | 17:27:50,829 |  10.1.0.5 |            177
                                                                                                                      Acquiring sstable references | 17:27:50,829 |  10.1.0.5 |            191
                                                                                                                       Merging memtable tombstones | 17:27:50,830 |  10.1.0.5 |            208
                                                                                                                 Message received from /10.1.0.5 | 17:27:50,830 |  10.1.0.1 |           1461
                                                                                                                 Message received from /10.1.0.1 | 17:27:50,830 |  10.1.0.2 |             14
                                                                                                                       Key cache hit for sstable 1 | 17:27:50,830 |  10.1.0.5 |            254
                                                                                                              Processing response from /10.1.0.5 | 17:27:50,830 |  10.1.0.1 |           1514
                                                                                                         Executing single-partition query on users | 17:27:50,830 |  10.1.0.2 |             78
                                                                                                       Seeking to partition beginning in data file | 17:27:50,830 |  10.1.0.5 |            264
                                                                                                                    Sending message to /10.1.0.2 | 17:27:50,830 |  10.1.0.1 |           1517
                                                                                                                      Acquiring sstable references | 17:27:50,830 |  10.1.0.2 |             85
                                                                         Skipped 0/1 non-slice-intersecting sstables, included 0 due to tombstones | 17:27:50,830 |  10.1.0.5 |            453
                                                                                                                 Determining replicas for mutation | 17:27:50,830 |  10.1.0.1 |           1746
                                                                                                                       Merging memtable tombstones | 17:27:50,830 |  10.1.0.2 |             97
                                                                                                        Merging data from memtables and 1 sstables | 17:27:50,830 |  10.1.0.5 |            476
                                                                                                                 Message received from /10.1.0.2 | 17:27:50,830 |  10.1.0.1 |           2369
                                                                                                                       Key cache hit for sstable 2 | 17:27:50,830 |  10.1.0.2 |            120
                                                                                                                Read 1 live and 0 tombstoned cells | 17:27:50,830 |  10.1.0.5 |            506
                                                                                                       Seeking to partition beginning in data file | 17:27:50,830 |  10.1.0.2 |            123
                                                                                                                 Enqueuing response to /10.1.0.1 | 17:27:50,830 |  10.1.0.5 |            570
                                                                         Skipped 0/1 non-slice-intersecting sstables, included 0 due to tombstones | 17:27:50,830 |  10.1.0.2 |            288
                                                                                                                    Sending message to /10.1.0.1 | 17:27:50,830 |  10.1.0.5 |            617
                                                                                                        Merging data from memtables and 1 sstables | 17:27:50,830 |  10.1.0.2 |            297
                                                                                                                Read 1 live and 0 tombstoned cells | 17:27:50,830 |  10.1.0.2 |            319
                                                                                                                 Enqueuing response to /10.1.0.1 | 17:27:50,830 |  10.1.0.2 |            362
                                                                                                                    Sending message to /10.1.0.1 | 17:27:50,830 |  10.1.0.2 |            420
                                                                                                                 Message received from /10.1.0.1 | 17:27:50,831 |   10.1.0.4 |             17
                                                                                                                    Sending message to /10.1.0.2 | 17:27:50,831 |  10.1.0.1 |           2435
                                                                                                                 Message received from /10.1.0.1 | 17:27:50,831 |  10.1.0.2 |              8
                                                                                                                    Acquiring switchLock read lock | 17:27:50,831 |   10.1.0.4 |             61
                                                                                                              Processing response from /10.1.0.2 | 17:27:50,831 |  10.1.0.1 |           2488
                                                                                                                    Acquiring switchLock read lock | 17:27:50,831 |  10.1.0.2 |             44
                                                                                                                            Appending to commitlog | 17:27:50,831 |   10.1.0.4 |             78
                                                                                          Not hinting /10.2.0.4 which has been down 364809650ms | 17:27:50,831 |  10.1.0.1 |           2503
                                                                                                                            Appending to commitlog | 17:27:50,831 |  10.1.0.2 |             62
                                                                                                                          Adding to event memtable | 17:27:50,831 |   10.1.0.4 |             96
                                                                                                                     Sending message to /10.1.0.4 | 17:27:50,831 |  10.1.0.1 |           2557
                                                                                                                          Adding to event memtable | 17:27:50,831 |  10.1.0.2 |             80
                                                                                                                 Enqueuing response to /10.1.0.1 | 17:27:50,831 |   10.1.0.4 |            177
                                                                                                                    Acquiring switchLock read lock | 17:27:50,831 |  10.1.0.1 |           2669
                                                                                                                 Enqueuing response to /10.1.0.1 | 17:27:50,831 |  10.1.0.2 |            160
                                                                                                                    Sending message to /10.1.0.1 | 17:27:50,831 |   10.1.0.4 |            333
                                                                                                                            Appending to commitlog | 17:27:50,831 |  10.1.0.1 |           2682
                                                                                                                    Sending message to /10.1.0.1 | 17:27:50,831 |  10.1.0.2 |            266
                                                                                                                          Adding to event memtable | 17:27:50,831 |  10.1.0.1 |           2720
                                                                                                                   Sending message to /10.2.0.5 | 17:27:50,831 |  10.1.0.1 |           2758
                                                                                                                 Message received from /10.1.0.2 | 17:27:50,831 |  10.1.0.1 |           2989
                                                                                                              Processing response from /10.1.0.2 | 17:27:50,831 |  10.1.0.1 |           3024
                                                                                                                  Message received from /10.1.0.4 | 17:27:50,832 |  10.1.0.1 |           3764
                                                                                                               Processing response from /10.1.0.4 | 17:27:50,832 |  10.1.0.1 |           3805
                                                                                                                 Message received from /10.1.0.1 | 17:27:50,841 | 10.2.0.5 |             24
                                                                                                         Enqueuing forwarded write to /10.2.0.7 | 17:27:50,841 | 10.2.0.5 |            255
                                                                                                         Enqueuing forwarded write to /10.2.0.3 | 17:27:50,841 | 10.2.0.5 |            283
                                                                                                          Enqueuing forwarded write to /10.2.0.6 | 17:27:50,841 | 10.2.0.5 |            307
                                                                                                                    Acquiring switchLock read lock | 17:27:50,841 | 10.2.0.5 |            362
                                                                                                                            Appending to commitlog | 17:27:50,841 | 10.2.0.5 |            380
                                                                                                                   Sending message to /10.2.0.7 | 17:27:50,841 | 10.2.0.5 |            382
                                                                                                                          Adding to event memtable | 17:27:50,841 | 10.2.0.5 |            411
                                                                                                                    Sending message to /10.2.0.6 | 17:27:50,841 | 10.2.0.5 |            429
                                                                                                                 Enqueuing response to /10.1.0.1 | 17:27:50,841 | 10.2.0.5 |            484
                                                                                                                   Sending message to /10.2.0.3 | 17:27:50,841 | 10.2.0.5 |            561
                                                                                                                    Sending message to /10.1.0.1 | 17:27:50,841 | 10.2.0.5 |            625
                                                                                                                    Acquiring switchLock read lock | 17:27:50,842 |  10.2.0.6 |           1031
                                                                                                                    Acquiring switchLock read lock | 17:27:50,842 | 10.2.0.7 |            178
                                                                                                                            Appending to commitlog | 17:27:50,842 |  10.2.0.6 |           1066
                                                                                                                            Appending to commitlog | 17:27:50,842 | 10.2.0.7 |            196
                                                                                                                          Adding to event memtable | 17:27:50,842 |  10.2.0.6 |           1118
                                                                                                                          Adding to event memtable | 17:27:50,842 | 10.2.0.7 |            231
                                                                                                                 Enqueuing response to /10.1.0.1 | 17:27:50,842 |  10.2.0.6 |           1181
                                                                                                                 Enqueuing response to /10.1.0.1 | 17:27:50,842 | 10.2.0.7 |            286
                                                                                                                    Sending message to /10.1.0.1 | 17:27:50,842 | 10.2.0.7 |            421
                                                                                                                    Acquiring switchLock read lock | 17:27:50,843 | 10.2.0.3 |           1216
                                                                                                                    Sending message to /10.1.0.1 | 17:27:50,843 |  10.2.0.6 |           1382
                                                                                                                            Appending to commitlog | 17:27:50,843 | 10.2.0.3 |           1239
                                                                                                                          Adding to event memtable | 17:27:50,843 | 10.2.0.3 |           1313
                                                                                                                 Enqueuing response to /10.1.0.1 | 17:27:50,843 | 10.2.0.3 |           1407
                                                                                                                    Sending message to /10.1.0.1 | 17:27:50,843 | 10.2.0.3 |           1631
                                                                                                                Message received from /10.2.0.5 | 17:27:50,851 |  10.1.0.1 |          23333
                                                                                                             Processing response from /10.2.0.5 | 17:27:50,851 |  10.1.0.1 |          23380
                                                                                                                Message received from /10.2.0.7 | 17:27:50,852 |  10.1.0.1 |          23908
                                                                                                             Processing response from /10.2.0.7 | 17:27:50,852 |  10.1.0.1 |          23953
                                                                                                                 Message received from /10.2.0.6 | 17:27:50,853 |  10.1.0.1 |          25143
                                                                                                              Processing response from /10.2.0.6 | 17:27:50,853 |  10.1.0.1 |          25178
                                                                                                                Message received from /10.2.0.3 | 17:27:50,854 |  10.1.0.1 |          25478
                                                                                                             Processing response from /10.2.0.3 | 17:27:50,854 |  10.1.0.1 |          25516
                                                                                                   Write timeout; received 3 of 4 required replies | 17:27:52,831 |  10.1.0.1 |        2002873
                                                                                                                                  Request complete | 17:27:52,833 |  10.1.0.1 |        2005989
      

      cqlsh:XXXX> CONSISTENCY
      Current consistency level is LOCAL_QUORUM.

      1. 8819-2.0.patch
        2 kB
        Sylvain Lebresne

        Issue Links

          Activity

          Hide
          philipthompson Philip Thompson added a comment -

          Any ideas what this could be Tyler? We discussed it on IRC today. It doesn't appear to be CASSANDRA-7947 and Wei says they aren't using LWT. All nodes agree on the schema.

          Show
          philipthompson Philip Thompson added a comment - Any ideas what this could be Tyler? We discussed it on IRC today. It doesn't appear to be CASSANDRA-7947 and Wei says they aren't using LWT. All nodes agree on the schema.
          Hide
          philipthompson Philip Thompson added a comment - - edited

          Wei Zhu, what exactly is the query you are executing?

          EDIT: Nevermind, it is in the trace.

          Show
          philipthompson Philip Thompson added a comment - - edited Wei Zhu , what exactly is the query you are executing? EDIT: Nevermind, it is in the trace.
          Hide
          philipthompson Philip Thompson added a comment -

          What is the schema for your table? I'm assuming

          CREATE TABLE test (
            user_id int,
            created timeuuid,
            event_data text,
            event_id text,
            PRIMARY KEY ((user_id), created)
          
          Show
          philipthompson Philip Thompson added a comment - What is the schema for your table? I'm assuming CREATE TABLE test ( user_id int , created timeuuid, event_data text, event_id text, PRIMARY KEY ((user_id), created)
          Hide
          philipthompson Philip Thompson added a comment - - edited

          I have sanity checked this and a six node, two DC cluster on 2.0.8 is working as expected for me locally.

          Wei can reproduce in cqlsh and through java driver, so it isn't some sort of driver bug. If all of the nodes agree on the schema, and on nodetool status info, I'm stumped as to what could be going on.

          Show
          philipthompson Philip Thompson added a comment - - edited I have sanity checked this and a six node, two DC cluster on 2.0.8 is working as expected for me locally. Wei can reproduce in cqlsh and through java driver, so it isn't some sort of driver bug. If all of the nodes agree on the schema, and on nodetool status info, I'm stumped as to what could be going on.
          Hide
          slebresne Sylvain Lebresne added a comment -

          There is both a joining and a leaving node (in DC2), so what probably happens is that for some writes there is 2 "pending endpoints". During a write, if there is a joining node and that node is a future replica for the write (no matter its datacenter), we add it to the number of nodes waited for (thus making us temporarily wait for more node than strictly required by the consistency) to avoid potentially failing CL guarantees post-bootstrap, see CASSANDRA-833. I'll note that this is the explanation for the error message, and the error message is not really the problem here: cassandra does sometimes wait for more node that what the CL initially suggests and that is not a bug.

          What is a bug though is that this code is broken for LOCAL_QUORUM reads because while we can end up waiting for more nodes than there is replica in the local datacenter, DataCenterWriteResponseHandler ignores acks from other DC nodes and so it's a guaranteed timeout in that situation (even though, in the example, there should be at least 1 replica up in DC2 and so we most likely did got the proper number of answers, we just ignored one).

          We could fix DataCenterWriteResponseHandler to count some responses from remote DCs when there is pending endpoints (as long as we still fulfill a local quorum of course), but the bigger question is maybe whether we should include pending endpoints that are not local in AbstractWriteResponseHandler.totalBlockFor in the first place. I'll note that in theory, counting them as we do is kind of the correct thing to do: CL guarantees are based on the combination of the write and read CL, and we don't know what CL the user will use on read. But in practice, using LOCAL_QUORUM means you only care about local DC guarantees and so I think it would be fine and a lot more practical to exclude non-local endpoints.

          Thoughts (Jonathan Ellis in particular since you were the author of CASSANDRA-833)?

          Show
          slebresne Sylvain Lebresne added a comment - There is both a joining and a leaving node (in DC2), so what probably happens is that for some writes there is 2 "pending endpoints". During a write, if there is a joining node and that node is a future replica for the write (no matter its datacenter), we add it to the number of nodes waited for (thus making us temporarily wait for more node than strictly required by the consistency) to avoid potentially failing CL guarantees post-bootstrap, see CASSANDRA-833 . I'll note that this is the explanation for the error message, and the error message is not really the problem here: cassandra does sometimes wait for more node that what the CL initially suggests and that is not a bug. What is a bug though is that this code is broken for LOCAL_QUORUM reads because while we can end up waiting for more nodes than there is replica in the local datacenter, DataCenterWriteResponseHandler ignores acks from other DC nodes and so it's a guaranteed timeout in that situation (even though, in the example, there should be at least 1 replica up in DC2 and so we most likely did got the proper number of answers, we just ignored one). We could fix DataCenterWriteResponseHandler to count some responses from remote DCs when there is pending endpoints (as long as we still fulfill a local quorum of course), but the bigger question is maybe whether we should include pending endpoints that are not local in AbstractWriteResponseHandler.totalBlockFor in the first place. I'll note that in theory , counting them as we do is kind of the correct thing to do: CL guarantees are based on the combination of the write and read CL, and we don't know what CL the user will use on read. But in practice, using LOCAL_QUORUM means you only care about local DC guarantees and so I think it would be fine and a lot more practical to exclude non-local endpoints. Thoughts ( Jonathan Ellis in particular since you were the author of CASSANDRA-833 )?
          Hide
          weizhu Wei Zhu added a comment -

          For us, the reason we set consistency level to be LOCAL_QUORUM is that we don't require (immediate) consistency in DC2, in our case is a backup DC. We are under the assumption that we can do whatever we want to DC2 as long as the consistency level is LOCAL_QUORUM. I totally agree that the non-local endpoints should be excluded for LOCAL_QUORUM even for bootstrapping nodes.

          Show
          weizhu Wei Zhu added a comment - For us, the reason we set consistency level to be LOCAL_QUORUM is that we don't require (immediate) consistency in DC2, in our case is a backup DC. We are under the assumption that we can do whatever we want to DC2 as long as the consistency level is LOCAL_QUORUM. I totally agree that the non-local endpoints should be excluded for LOCAL_QUORUM even for bootstrapping nodes.
          Hide
          jbellis Jonathan Ellis added a comment -

          in practice, using LOCAL_QUORUM means you only care about local DC guarantees and so I think it would be fine and a lot more practical to exclude non-local endpoints

          I agree. By using LOCAL CL you're signaling that you are fine with repairing to achieve consistency in remote DCs if necessary.

          Show
          jbellis Jonathan Ellis added a comment - in practice, using LOCAL_QUORUM means you only care about local DC guarantees and so I think it would be fine and a lot more practical to exclude non-local endpoints I agree. By using LOCAL CL you're signaling that you are fine with repairing to achieve consistency in remote DCs if necessary.
          Hide
          slebresne Sylvain Lebresne added a comment -

          Alright then, attaching relatively simple patch to filter non-local pending endpoints for LOCAL_QUORUM. It should be possible to write a dtests for this, but that's probably a bit involved. The simpler setup is probably with 2 DC, RF=1 in each, queries in DC1 and one joining node in DC2 (but you'll still want to query multiple keys to be sure you the joining node is pending for the queried token, and of course you need to make sure the queries are done while the node is bootstraping. Philip Thompson, want to give a shot at such dtest when you have some time?

          Show
          slebresne Sylvain Lebresne added a comment - Alright then, attaching relatively simple patch to filter non-local pending endpoints for LOCAL_QUORUM . It should be possible to write a dtests for this, but that's probably a bit involved. The simpler setup is probably with 2 DC, RF=1 in each, queries in DC1 and one joining node in DC2 (but you'll still want to query multiple keys to be sure you the joining node is pending for the queried token, and of course you need to make sure the queries are done while the node is bootstraping. Philip Thompson , want to give a shot at such dtest when you have some time?
          Hide
          philipthompson Philip Thompson added a comment -

          Yes I should be able to write up a dtest for this.

          Show
          philipthompson Philip Thompson added a comment - Yes I should be able to write up a dtest for this.
          Hide
          thobbs Tyler Hobbs added a comment -

          Hmm, the changes from CASSANDRA-8058 already fixed the totalBlockFor calculation for DatacenterWriteResponder (although the commit is a bit different/cleaner than the original patch). Since this was seen in 2.0.8, I think this ticket is just a dupe.

          Regardless, we could still use a dtest for this.

          Show
          thobbs Tyler Hobbs added a comment - Hmm, the changes from CASSANDRA-8058 already fixed the totalBlockFor calculation for DatacenterWriteResponder (although the commit is a bit different/cleaner than the original patch). Since this was seen in 2.0.8, I think this ticket is just a dupe. Regardless, we could still use a dtest for this.
          Hide
          slebresne Sylvain Lebresne added a comment -

          Ok, so marking as duplicate since it is, but Philip Thompson, feel free to leave "making a dtest for this" on your todo list (and feel free to ping me if you have trouble reproducing).

          Show
          slebresne Sylvain Lebresne added a comment - Ok, so marking as duplicate since it is, but Philip Thompson , feel free to leave "making a dtest for this" on your todo list (and feel free to ping me if you have trouble reproducing).
          Hide
          aboudreault Alan Boudreault added a comment -

          FYI, a dtest has been created for this ticket: https://github.com/riptano/cassandra-dtest/pull/194 , not committed yet but will be soon.

          Show
          aboudreault Alan Boudreault added a comment - FYI, a dtest has been created for this ticket: https://github.com/riptano/cassandra-dtest/pull/194 , not committed yet but will be soon.
          Hide
          aboudreault Alan Boudreault added a comment -

          Committed!

          Show
          aboudreault Alan Boudreault added a comment - Committed!

            People

            • Assignee:
              slebresne Sylvain Lebresne
              Reporter:
              weizhu Wei Zhu
              Reviewer:
              Tyler Hobbs
              Tester:
              Alan Boudreault
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development