[CASSANDRA-16752] Python dtest queries executed after restarting a node might fail - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Normal
Resolution: Fixed
Fix Version/s: 4.0-rc2, 4.0
Component/s: Test/dtest/python
Labels:
None

Bug Category:
Correctness - Test Failure
Severity:
Normal
Complexity:
Normal
Discovered By:
DTest
Platform:

All
Impacts:

None
Since Version:

NA
Source Control Link:

https://github.com/riptano/ccm/commit/ce612ea71587bf263ed513cb8f8d5dfcf7c8dadb
Test and Documentation Plan:

Hide

Some of the affected tests should survive the multiplexer.

Show
Some of the affected tests should survive the multiplexer.

Description

We have multiple dtest failures in queries with CL=ALL that are executed right after restarting a node:

The error is a server-side failure telling that there aren't enough replicas alive to achieve the required consistency level:

tools/assertions.py:182: in assert_all
    res = session.execute(simple_query) if timeout is None else session.execute(simple_query, timeout=timeout)
../venv/src/cassandra-driver/cassandra/cluster.py:2618: in execute
    return self.execute_async(query, parameters, trace, custom_payload, timeout, execution_profile, paging_state, host, execute_as).result()
...
E           cassandra.Unavailable: Error from server: code=1000 [Unavailable exception] message="Cannot achieve consistency level ALL" info={'consistency': 'ALL', 'required_replicas': 2, 'alive_replicas': 1}

The Jenkins failure in test_14330 can be reproduced in the multiplexer, although it requires many iterations:

As a hint, this same test survives the multiplexer if we add a short sleep right after restarting the node.

I think that the reason for the failure is that CCM is no properly waiting for the propagation of the node status. So, the session uses a not-restarted nodes to run the query with CL=ALL before that node actually sees the restarted node. The parameter wait_other_notice in node.start should guarantee that this doesn't happen, but I suspect that it isn't working as expected. When using the wait_other_notice option on node start, the node.watch_log_for_alive function looks for the regexp %s.* now UP in the logs of the other nodes, to ensure that they consider that node as up. However, that expression is matched by two different log entries emitted on log restart:

Node /127.0.0.2:7000 has restarted, now UP
...
Node /127.0.0.2:7000 is now UP

The first message is emitted by Gossiper.handleMajorStateChange, while the second message is emitted by Gossiper.realMarkAlive. I think that we need to wait for the second one, while CCM is waiting only for the first one.

Indeed, we can easily reproduce the failure in any of the affected dtests by adding a sleep at the beginning of Gossiper.realMarkAlive. We can get around the failure if we manually modify the test to wait for the right log message, as it's done here. With that wait the test survives the multiplexer rounds.

This problem seems quite unlikely to happen since the time window is very short, but we have seen it in CI and it affects multiple Python dtests.

Attachments

Issue Links

links to

PR CCM

Activity

People

Assignee:: Andres de la Peña

Reporter:: Andres de la Peña

Authors:: Andres de la Peña

Reviewers:: Brandon Williams

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 18/Jun/21 19:52

Updated:: 21/Jun/21 15:46

Resolved:: 21/Jun/21 15:37