Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-16752

Python dtest queries executed after restarting a node might fail

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Normal
    • Resolution: Fixed
    • 4.0-rc2, 4.0
    • Test/dtest/python
    • None

    Description

      We have multiple dtest failures in queries with CL=ALL that are executed right after restarting a node:

      The error is a server-side failure telling that there aren't enough replicas alive to achieve the required consistency level:

      tools/assertions.py:182: in assert_all
          res = session.execute(simple_query) if timeout is None else session.execute(simple_query, timeout=timeout)
      ../venv/src/cassandra-driver/cassandra/cluster.py:2618: in execute
          return self.execute_async(query, parameters, trace, custom_payload, timeout, execution_profile, paging_state, host, execute_as).result()
      ...
      E           cassandra.Unavailable: Error from server: code=1000 [Unavailable exception] message="Cannot achieve consistency level ALL" info={'consistency': 'ALL', 'required_replicas': 2, 'alive_replicas': 1}
      

      The Jenkins failure in test_14330 can be reproduced in the multiplexer, although it requires many iterations:

      As a hint, this same test survives the multiplexer if we add a short sleep right after restarting the node.

      I think that the reason for the failure is that CCM is no properly waiting for the propagation of the node status. So, the session uses a not-restarted nodes to run the query with CL=ALL before that node actually sees the restarted node. The parameter wait_other_notice in node.start should guarantee that this doesn't happen, but I suspect that it isn't working as expected. When using the wait_other_notice option on node start, the node.watch_log_for_alive function looks for the regexp %s.* now UP in the logs of the other nodes, to ensure that they consider that node as up. However, that expression is matched by two different log entries emitted on log restart:

      Node /127.0.0.2:7000 has restarted, now UP
      ...
      Node /127.0.0.2:7000 is now UP
      

      The first message is emitted by Gossiper.handleMajorStateChange, while the second message is emitted by Gossiper.realMarkAlive. I think that we need to wait for the second one, while CCM is waiting only for the first one.

      Indeed, we can easily reproduce the failure in any of the affected dtests by adding a sleep at the beginning of Gossiper.realMarkAlive. We can get around the failure if we manually modify the test to wait for the right log message, as it's done here. With that wait the test survives the multiplexer rounds.

      This problem seems quite unlikely to happen since the time window is very short, but we have seen it in CI and it affects multiple Python dtests.

      Attachments

        Issue Links

          Activity

            People

              adelapena Andres de la Peña
              adelapena Andres de la Peña
              Andres de la Peña
              Brandon Williams
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: