Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-17411

Network partition causes write ONE timeouts when using counters in Cassandra 4

    XMLWordPrintableJSON

Details

    Description

      Affected versions:
      • 4.x
      Observed behavior:

      When executing CL=ONE writes on a table with a counter column, if one of the nodes is network partitioned from the others, clients keep sending requests to it.

      Even though this may be a "driver" problem, I've been able to reproduce it with both java and python datastax drivers using their latest available versions and given the behavior only changes depending on the Cassandra version, well, here I am.

      Expected behavior:

      In Cassandra 3 after all inflight requests fail (expected), no new requests are sent to the partitioned node. The expectation is that Cassandra 4 behaves the same way.

      How to reproduce:
      # Create a cluster with the desired version, will go with 4.x for this example
      ccm create bug-report -v 4.0.3
      ccm populate -n 2:2:2
      ccm start
      
      # Create schemas and so on
      CQL=$(cat <<END
      CONSISTENCY ALL;
      DROP KEYSPACE IF EXISTS demo;
      CREATE KEYSPACE demo WITH REPLICATION = {'class': 'NetworkTopologyStrategy', 'dc1': 2, 'dc2': 2, 'dc3': 2};
      CREATE TABLE demo.demo (pk uuid PRIMARY KEY, count counter) WITH compaction = {'class': 'LeveledCompactionStrategy'};
      END
      )
      ccm node1 cqlsh --verbose --exec="${CQL}"
      
      # Launch the attached app.py
      # requires cassandra-driver
      python3 app.py "127.0.0.1" "9042"
      
      # Wait a bit for the app to settle, proceed to next step once you see 3 messages in stdout like:
      # 2022-03-01 15:41:51,557 - target-dc2 - __main__ - INFO - Got 0/572 (0.000000) timeouts/total_rqs in the last 1 minute
      
      # Partition one node with iptables
      iptables -A INPUT -p tcp --destination 127.0.0.1 --destination-port 7000 -j DROP; iptables -A INPUT -p tcp --destination 127.0.0.1 --destination-port 9042 -j DROP
      

      Some time after executing the iptables command in cassandra-3 the output should be similar to:

      2022-03-01 15:41:51,557 - target-dc2 - __main__ - INFO - Got 0/572 (0.000000) timeouts/total_rqs in the last 1 minute
      2022-03-01 15:41:51,576 - target-dc3 - __main__ - INFO - Got 0/572 (0.000000) timeouts/total_rqs in the last 1 minute
      2022-03-01 15:41:58,032 - target-dc1 - __main__ - INFO - Got 6/252 (2.380952) timeouts/total_rqs in the last 1 minute
      2022-03-01 15:42:51,560 - target-dc2 - __main__ - INFO - Got 0/570 (0.000000) timeouts/total_rqs in the last 1 minute
      2022-03-01 15:42:51,620 - target-dc3 - __main__ - INFO - Got 0/570 (0.000000) timeouts/total_rqs in the last 1 minute
      2022-03-01 15:42:58,101 - target-dc1 - __main__ - INFO - Got 2/354 (0.564972) timeouts/total_rqs in the last 1 minute
      2022-03-01 15:43:51,602 - target-dc2 - __main__ - INFO - Got 0/571 (0.000000) timeouts/total_rqs in the last 1 minute
      2022-03-01 15:43:51,672 - target-dc3 - __main__ - INFO - Got 0/571 (0.000000) timeouts/total_rqs in the last 1 minute
      2022-03-01 15:43:58,153 - target-dc1 - __main__ - INFO - Got 0/572 (0.000000) timeouts/total_rqs in the last 1 minute
      

      as the timeouts/rqs shows, in about 2 minutes the partitioned node stops receiving traffic

      while as in cassandra-4

      2022-03-01 15:49:39,068 - target-dc3 - __main__ - INFO - Got 0/566 (0.000000) timeouts/total_rqs in the last 1 minute
      2022-03-01 15:49:39,107 - target-dc2 - __main__ - INFO - Got 0/566 (0.000000) timeouts/total_rqs in the last 1 minute
      2022-03-01 15:49:41,206 - target-dc1 - __main__ - INFO - Got 2/444 (0.450450) timeouts/total_rqs in the last 1 minute
      2022-03-01 15:50:39,095 - target-dc3 - __main__ - INFO - Got 0/569 (0.000000) timeouts/total_rqs in the last 1 minute
      2022-03-01 15:50:39,148 - target-dc2 - __main__ - INFO - Got 0/569 (0.000000) timeouts/total_rqs in the last 1 minute
      2022-03-01 15:50:42,589 - target-dc1 - __main__ - INFO - Got 7/13 (53.846154) timeouts/total_rqs in the last 1 minute
      2022-03-01 15:51:39,125 - target-dc3 - __main__ - INFO - Got 0/567 (0.000000) timeouts/total_rqs in the last 1 minute
      2022-03-01 15:51:39,159 - target-dc2 - __main__ - INFO - Got 0/567 (0.000000) timeouts/total_rqs in the last 1 minute
      2022-03-01 15:51:44,902 - target-dc1 - __main__ - INFO - Got 12/22 (54.545455) timeouts/total_rqs in the last 1 minute
      2022-03-01 15:52:39,188 - target-dc2 - __main__ - INFO - Got 0/567 (0.000000) timeouts/total_rqs in the last 1 minute
      2022-03-01 15:52:39,202 - target-dc3 - __main__ - INFO - Got 0/568 (0.000000) timeouts/total_rqs in the last 1 minute
      2022-03-01 15:52:46,994 - target-dc1 - __main__ - INFO - Got 12/20 (60.000000) timeouts/total_rqs in the last 1 minute
      2022-03-01 15:53:39,202 - target-dc2 - __main__ - INFO - Got 0/568 (0.000000) timeouts/total_rqs in the last 1 minute
      2022-03-01 15:53:39,238 - target-dc3 - __main__ - INFO - Got 0/568 (0.000000) timeouts/total_rqs in the last 1 minute
      2022-03-01 15:53:48,765 - target-dc1 - __main__ - INFO - Got 12/17 (70.588235) timeouts/total_rqs in the last 1 minute
      2022-03-01 15:54:39,203 - target-dc2 - __main__ - INFO - Got 0/568 (0.000000) timeouts/total_rqs in the last 1 minute
      2022-03-01 15:54:39,291 - target-dc3 - __main__ - INFO - Got 0/568 (0.000000) timeouts/total_rqs in the last 1 minute
      2022-03-01 15:54:51,291 - target-dc1 - __main__ - INFO - Got 12/24 (50.000000) timeouts/total_rqs in the last 1 minute
      2022-03-01 15:55:39,258 - target-dc2 - __main__ - INFO - Got 0/570 (0.000000) timeouts/total_rqs in the last 1 minute
      2022-03-01 15:55:39,327 - target-dc3 - __main__ - INFO - Got 0/570 (0.000000) timeouts/total_rqs in the last 1 minute
      2022-03-01 15:55:53,365 - target-dc1 - __main__ - INFO - Got 12/20 (60.000000) timeouts/total_rqs in the last 1 minute
      

      as the timeouts/rqs shows, the partitioned node keeps receving traffic until the iptables rules are removed

      Attachments

        1. app.py
          3 kB
          Pere Balaguer

        Activity

          People

            brandon.williams Brandon Williams
            pbalaguer Pere Balaguer
            Brandon Williams
            Andres de la Peña
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: