[CASSANDRA-19097] Test Failure: bootstrap_test.TestBootstrap.* - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Normal
Resolution: Fixed
Fix Version/s: 4.0.12, 4.1.4, 5.0-rc1, 5.0, 5.1
Component/s: CI
Labels:
None

Bug Category:
Correctness - Test Failure
Severity:
Normal
Complexity:
Normal
Discovered By:
User Report
Platform:

All
Impacts:

None
Since Version:

4.0
Source Control Link:

https://github.com/apache/cassandra-dtest/commit/15895d64238bc5e6d4469a07131110d5ad168c71 https://github.com/apache/cassandra-dtest/commit/6821227d935acffad3b1a1e38429d96ca592e45b
Test and Documentation Plan:

Hide

See PR

Show
See PR

Description

test_killed_wiped_node_cannot_join
test_read_from_bootstrapped_node
test_shutdown_wiped_node_cannot_join

Seen in dtests_offheap in ~~CASSANDRA-19034~~
https://app.circleci.com/pipelines/github/michaelsembwever/cassandra/258/workflows/cea7d697-ca33-40bb-8914-fb9fc662820a/jobs/21162/parallel-runs/38

self = <bootstrap_test.TestBootstrap object at 0x7fc471171d50>

    def test_killed_wiped_node_cannot_join(self):
>       self._wiped_node_cannot_join_test(gently=False)

bootstrap_test.py:608: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <bootstrap_test.TestBootstrap object at 0x7fc471171d50>, gently = False

    def _wiped_node_cannot_join_test(self, gently):
        """
        @jira_ticket CASSANDRA-9765
        Test that if we stop a node and wipe its data then the node cannot join
        when it is not a seed. Test both a nice shutdown or a forced shutdown, via
        the gently parameter.
        """
        cluster = self.cluster
        cluster.set_environment_variable('CASSANDRA_TOKEN_PREGENERATION_DISABLED', 'True')
        cluster.populate(3)
        cluster.start()
    
        stress_table = 'keyspace1.standard1'
    
        # write some data
        node1 = cluster.nodelist()[0]
        node1.stress(['write', 'n=10K', 'no-warmup', '-rate', 'threads=8'])
    
        session = self.patient_cql_connection(node1)
        original_rows = list(session.execute("SELECT * FROM {}".format(stress_table,)))
    
        # Add a new node, bootstrap=True ensures that it is not a seed
        node4 = new_node(cluster, bootstrap=True)
        node4.start(wait_for_binary_proto=True)
    
        session = self.patient_cql_connection(node4)
>       assert original_rows == list(session.execute("SELECT * FROM {}".format(stress_table,)))
E       assert [Row(key=b'PP...e9\xbb'), ...] == [Row(key=b'PP...e9\xbb'), ...]
E         At index 587 diff: Row(key=b'OP2656L630', C0=b"E02\xd2\x8clBv\tr\n\xe3\x01\xdd\xf2\x8a\x91\x7f-\x9dm'\xa5\xe7PH\xef\xc1xlO\xab+d", C1=b"\xb2\xc0j\xff\xcb'\xe3\xcc\x0b\x93?\x18@\xc4\xc7tV\xb7q\xeeF\x82\xa4\xd3\xdcFl\xd9\x87 \x9a\xde\xdc\xa3", C2=b'\xed\xf8\x8d%\xa4\xa6LPs;\x98f\xdb\xca\x913\xba{M\x8d6XW\x01\xea-\xb5<J\x1eo\xa0F\xbe', C3=b'\x9ec\xcf\xc7\xec\xa5\x85Z]\xa6\x19\xeb\xc4W\x1d%lyZj\xb9\x94I\x90\xebZ\xdba\xdd\xdc\x9e\x82\x95\x1c', C4=b'\xab\x9e\x13\x8b\xc6\x15D\x9b\xccl\xdcX\xb23\xd0\x8b\xa3\xba7\xc1c\xf7F\x1d\xf8e\xbd\x89\xcb\xd8\xd1)f\xdd') != Row(key=b'4LN78NONP0', C0=b"\xdf\x90\xb3/u\xc9/C\xcdOYG3\x070@#\xc3k\xaa$M'\x19\xfb\xab\xc0\x10]\xa6\xac\x1d\x81\xad", C1=b'\x8a\xb7j\x95\xf9\xbd?&\x11\xaaH\xcd\x87\xaa\xd2\x85\x08X\xea9\x94\xae8U\x92\xad\xb0\x1b9\xff\x87Z\xe81', C2=b'6\x1d\xa1-\xf77\xc7\xde+`\xb7\x89\xaa\xcd\xb5_\xe5\xb3\x04\xc7\xb1\x95e\x81s\t1\x8b\x16sc\x0eMm', C3=b'\xfbi\x08;\xc9\x94\x15}r\xfe\x1b\xae5\xf6v\x83\xae\xff\x82\x9b`J\xc2D\xa6k+\xf3\xd3\xff{C\xd0;', C4=b'\x8f\x87\x18\x0f\xfa\xadK"\x9e\x96\x87:tiu\xa5\x99\xe1_Ax\xa3\x12\xb4Z\xc9v\xa5\xad\xb8{\xc0\xa3\x93')
E         Left contains 2830 more items, first extra item: Row(key=b'5N7N172K30', C0=b'Y\x81\xa6\x02\x89\xa0hyp\x00O\xe9kFp$\x86u\xea\n\x7fK\x99\xe1\xf6G\xf77\xf7\xd7\xe1\xc7L\x...0\x87a\x03\xee', C4=b'\xe8\xd8\x17\xf3\x14\x16Q\x9d\\jb\xde=\x81\xc1B\x9c;T\xb1\xa2O-\x87zF=\x04`\x04\xbd\xc9\x95\xad')
E         Full diff:
E           [
…

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

jenkinslogs.zip
23/Jan/24 08:48
692 kB
Berenguer Blasi

Issue Links

links to

CASSANDRA-19097-5.0 PR

CASSANDRA-19097 dtests PR

Activity

Ascending order - Click to sort in descending order

Jacek Lewandowski added a comment - 27/Nov/23 09:41

Let me add to that: test_shutdown_wiped_node_cannot_join

https://app.circleci.com/pipelines/github/jacek-lewandowski/cassandra/1119/workflows/932a44e3-c25e-4c5e-b677-66579380faeb/jobs/50782/tests

Jacek Lewandowski added a comment - 27/Nov/23 09:41 Let me add to that: test_shutdown_wiped_node_cannot_join https://app.circleci.com/pipelines/github/jacek-lewandowski/cassandra/1119/workflows/932a44e3-c25e-4c5e-b677-66579380faeb/jobs/50782/tests

Jacek Lewandowski added a comment - 28/Nov/23 08:24

Those tests seem to fail not only on 5.0, I can see they failing on 4.0 as well: for example here: https://app.circleci.com/pipelines/github/jacek-lewandowski/cassandra/1125/workflows/a199ad8c-1a83-43c9-9218-a90c3472d113/jobs/51601, or https://app.circleci.com/pipelines/github/jacek-lewandowski/cassandra/1125/workflows/a199ad8c-1a83-43c9-9218-a90c3472d113/jobs/51579, the patch which was tested there is very unlikely to be the cause

Jacek Lewandowski added a comment - 28/Nov/23 08:24 Those tests seem to fail not only on 5.0, I can see they failing on 4.0 as well: for example here: https://app.circleci.com/pipelines/github/jacek-lewandowski/cassandra/1125/workflows/a199ad8c-1a83-43c9-9218-a90c3472d113/jobs/51601 , or https://app.circleci.com/pipelines/github/jacek-lewandowski/cassandra/1125/workflows/a199ad8c-1a83-43c9-9218-a90c3472d113/jobs/51579 , the patch which was tested there is very unlikely to be the cause

Michael Semb Wever added a comment - 28/Nov/23 11:00

There were dtest changes with TCM too. Both those runs are recent.

Michael Semb Wever added a comment - 28/Nov/23 11:00 There were dtest changes with TCM too. Both those runs are recent.

Brandon Williams added a comment - 05/Dec/23 23:32

I haven't been able reproduce these locally, and here is 500 runs on both j11 and j17.

Brandon Williams added a comment - 05/Dec/23 23:32 I haven't been able reproduce these locally, and here is 500 runs on both j11 and j17.

Berenguer Blasi added a comment - 12/Dec/23 08:42

Seen here https://ci-cassandra.apache.org/job/Cassandra-5.0/138/testReport/junit/dtest-upgrade.upgrade_tests.bootstrap_upgrade_test/TestBootstrapUpgrade/test_decommissioned_wiped_node_can_join/

Berenguer Blasi added a comment - 12/Dec/23 08:42 Seen here https://ci-cassandra.apache.org/job/Cassandra-5.0/138/testReport/junit/dtest-upgrade.upgrade_tests.bootstrap_upgrade_test/TestBootstrapUpgrade/test_decommissioned_wiped_node_can_join/

Paulo Motta added a comment - 05/Jan/24 19:00

Seen test_read_from_bootstrapped_node failure in 5.0-18999-j17

Paulo Motta added a comment - 05/Jan/24 19:00 Seen test_read_from_bootstrapped_node failure in 5.0-18999-j17

Berenguer Blasi added a comment - 16/Jan/24 09:28

Confirmed it happens across all versions. This is a recent 4.1 one but I haven't managed to repro yet...

Berenguer Blasi added a comment - 16/Jan/24 09:28 Confirmed it happens across all versions. This is a recent 4.1 one but I haven't managed to repro yet...

Berenguer Blasi added a comment - 17/Jan/24 10:30 - edited

Writes and reads are happening at LOCAL_ONE on those tests. These are 4-node tests so switching reads to CL.ALL is the best bet so far. We'll only know after the merge as I didn't manage to repro. If we're happy with the suggested PR I'd say we merge and keep an eye on them.

All failures both for upgrade and non-upgrade tests, which are all based off the same base class, match the test methods I had to touch. I did run a 500 repeats CI run for one of the test methods to spare CI cycles , all other methods are fixed in the same way.

Berenguer Blasi added a comment - 17/Jan/24 10:30 - edited Writes and reads are happening at LOCAL_ONE on those tests. These are 4-node tests so switching reads to CL.ALL is the best bet so far. We'll only know after the merge as I didn't manage to repro. If we're happy with the suggested PR I'd say we merge and keep an eye on them. All failures both for upgrade and non-upgrade tests, which are all based off the same base class, match the test methods I had to touch. I did run a 500 repeats CI run for one of the test methods to spare CI cycles , all other methods are fixed in the same way.

Brandon Williams added a comment - 17/Jan/24 13:57

We have approximately zero traction without a reproduction, so +1 to the PR, let's see how it goes.

Brandon Williams added a comment - 17/Jan/24 13:57 We have approximately zero traction without a reproduction, so +1 to the PR, let's see how it goes.

Berenguer Blasi added a comment - 18/Jan/24 06:32

Tested locally test passes 4.0->trunk and committed. Thx for the review!

Berenguer Blasi added a comment - 18/Jan/24 06:32 Tested locally test passes 4.0->trunk and committed. Thx for the review!

Brandon Williams added a comment - 18/Jan/24 22:02

Writes and reads are happening at LOCAL_ONE on those tests.

Hmm, I think that was probably intentional to test the consistency of the newly added nodes, and reading at ALL will hide problems there. So I think this is useful to see if the issue persists since we cannot reproduce but I don't think we should leave this change in permanently.

Brandon Williams added a comment - 18/Jan/24 22:02 Writes and reads are happening at LOCAL_ONE on those tests. Hmm, I think that was probably intentional to test the consistency of the newly added nodes, and reading at ALL will hide problems there. So I think this is useful to see if the issue persists since we cannot reproduce but I don't think we should leave this change in permanently.

Berenguer Blasi added a comment - 19/Jan/24 06:47

^Icwym. I think you're right. We could be having range movement/streaming problems that would get hidden. I'll add a note to myself to check for failures say next Tuesday as these errors happened often. Then I'll revert and on the next PR add a note explaining what you just said.

Berenguer Blasi added a comment - 19/Jan/24 06:47 ^Icwym. I think you're right. We could be having range movement/streaming problems that would get hidden. I'll add a note to myself to check for failures say next Tuesday as these errors happened often. Then I'll revert and on the next PR add a note explaining what you just said.

Berenguer Blasi added a comment - 19/Jan/24 09:35 - edited

I couldn't find the exact dtest SHA this run used. But if circle times are correct this shows a failure even after reading at ALL. So it seems we need to revert this and that we have some bootstrap legit problem.

Berenguer Blasi added a comment - 19/Jan/24 09:35 - edited I couldn't find the exact dtest SHA this run used. But if circle times are correct this shows a failure even after reading at ALL. So it seems we need to revert this and that we have some bootstrap legit problem.

Brandon Williams added a comment - 19/Jan/24 12:07

But if circle times are correct this shows a failure even after reading at ALL.

If that is true the corruption runs deeper than just bootstrap, it affects the whole cluster if ALL doesn't help. Let's stick with your plan of waiting until Tuesday and see if this persists or anything else shakes out.

Brandon Williams added a comment - 19/Jan/24 12:07 But if circle times are correct this shows a failure even after reading at ALL. If that is true the corruption runs deeper than just bootstrap, it affects the whole cluster if ALL doesn't help. Let's stick with your plan of waiting until Tuesday and see if this persists or anything else shakes out.

Berenguer Blasi added a comment - 19/Jan/24 12:39

^Yeah as I am reviewing things and as jenkins runs some more SHAs, this coming up often as it does, it should give us some idea of the impact of reading at ALL...

Berenguer Blasi added a comment - 19/Jan/24 12:39 ^Yeah as I am reviewing things and as jenkins runs some more SHAs, this coming up often as it does, it should give us some idea of the impact of reading at ALL...

Brandon Williams added a comment - 19/Jan/24 13:35

Reopened to make sure we don't forget.

Brandon Williams added a comment - 19/Jan/24 13:35 Reopened to make sure we don't forget.

Berenguer Blasi added a comment - 22/Jan/24 06:37 - edited

Another one that failed with the ALL read. Confirmed as the failure reports prints the new code https://ci-cassandra.apache.org/job/Cassandra-5.0-dtest-offheap/152/jdk=jdk_11_latest,label=cassandra-dtest,split=13/consoleFull

Berenguer Blasi added a comment - 22/Jan/24 06:37 - edited Another one that failed with the ALL read. Confirmed as the failure reports prints the new code https://ci-cassandra.apache.org/job/Cassandra-5.0-dtest-offheap/152/jdk=jdk_11_latest,label=cassandra-dtest,split=13/consoleFull

Brandon Williams added a comment - 22/Jan/24 16:21

That really complicates things. This means that even if we have some bug in streaming causing the corruption, we also have a second bug where the repair mechanism of ALL does not work. I want to believe this is something silly yet fundamental, but all the evidence just keeps mounting.

I don't know that it will help (and I'm not sure what will) but maybe if we modify the test to tell us how wrong the results are that will be useful. Currently we know something is wrong, but it's a sea of text.

Brandon Williams added a comment - 22/Jan/24 16:21 That really complicates things. This means that even if we have some bug in streaming causing the corruption, we also have a second bug where the repair mechanism of ALL does not work. I want to believe this is something silly yet fundamental, but all the evidence just keeps mounting. I don't know that it will help (and I'm not sure what will) but maybe if we modify the test to tell us how wrong the results are that will be useful. Currently we know something is wrong, but it's a sea of text.

Berenguer Blasi added a comment - 23/Jan/24 05:59

What I have seen so far perusing the 4 nodes logs is that all writes got to 1. No data reaches nodes 2 or 3. Then node 4 comes up and streams only from node 1. Then node4 is queried and it fails. The amount of token/ring recalculation traces are impossible to follow. I need to check now against a working example set of logs, let's see if I can follow 8 log files lol. Not reverting yet #justfyi

Berenguer Blasi added a comment - 23/Jan/24 05:59 What I have seen so far perusing the 4 nodes logs is that all writes got to 1. No data reaches nodes 2 or 3. Then node 4 comes up and streams only from node 1. Then node4 is queried and it fails. The amount of token/ring recalculation traces are impossible to follow. I need to check now against a working example set of logs, let's see if I can follow 8 log files lol. Not reverting yet #justfyi

Berenguer Blasi added a comment - 23/Jan/24 09:19 - edited

We seem to have a legit problem. I have found instances in butler from 4.0 and 4.1 as it was already mentioned in the ticket. I have amended fixVersions accordingly.

The problem seems to be a race revolving about nodes coming up, gossiper, SotrageService and TokenMetadata updates, recalculations, state changes, caches, token collisions, etc. The logs are too sparse to see in detail what is going on, maybe sbdy with deep knowledge on this area could pin it, but without a reproduction this is hard.

I have attached logs for a dtest-offheap failure example and a working example under folder 'works' for a dtests-novnode. Searching for StreamSession in all nodes one gets an idea of the difference in behavior. Then following how nodes 2 and 3 come up in both node1 logs makes it clear that on the failing ones state changes are ignored 'because it is not a member in token metadata'. Seeing how nodes come up, stream, tokens recalculate, etc is where the problem seems to be.

Given this is present in other versions I don't know if it should be a 5.0 blocker. On the other hand given it's severity and that we don't have enough history to tell if it's a recent failure or if it has been there for ages I have mixed thoughts. At least in jira there is only one similar ticket ~~CASSANDRA-10072~~ and looks unrelated. So this could be a recent thing...

I'll keep digging a bit more.

Berenguer Blasi added a comment - 23/Jan/24 09:19 - edited We seem to have a legit problem. I have found instances in butler from 4.0 and 4.1 as it was already mentioned in the ticket. I have amended fixVersions accordingly. The problem seems to be a race revolving about nodes coming up, gossiper, SotrageService and TokenMetadata updates, recalculations, state changes, caches, token collisions, etc. The logs are too sparse to see in detail what is going on, maybe sbdy with deep knowledge on this area could pin it, but without a reproduction this is hard. I have attached logs for a dtest-offheap failure example and a working example under folder 'works' for a dtests-novnode. Searching for StreamSession in all nodes one gets an idea of the difference in behavior. Then following how nodes 2 and 3 come up in both node1 logs makes it clear that on the failing ones state changes are ignored 'because it is not a member in token metadata'. Seeing how nodes come up, stream, tokens recalculate, etc is where the problem seems to be. Given this is present in other versions I don't know if it should be a 5.0 blocker. On the other hand given it's severity and that we don't have enough history to tell if it's a recent failure or if it has been there for ages I have mixed thoughts. At least in jira there is only one similar ticket CASSANDRA-10072 and looks unrelated. So this could be a recent thing... I'll keep digging a bit more.

Michael Semb Wever added a comment - 23/Jan/24 11:07

Given this is present in other versions I don't know if it should be a 5.0 blocker.

At least it's not belonging in fixVersion `5.0-rc`.
Please raise priority/severity appropriate (i.e. if it is to block all involved releases).

Michael Semb Wever added a comment - 23/Jan/24 11:07 Given this is present in other versions I don't know if it should be a 5.0 blocker. At least it's not belonging in fixVersion `5.0-rc`. Please raise priority/severity appropriate (i.e. if it is to block all involved releases).

Berenguer Blasi added a comment - 23/Jan/24 12:14 - edited

At least it's not belonging in fixVersion `5.0-rc`.

That's what I haven't made up my mind on yet. I only seem to be able to find recent failures. Like Dec'23 the oldest...

Berenguer Blasi added a comment - 23/Jan/24 12:14 - edited At least it's not belonging in fixVersion `5.0-rc`. That's what I haven't made up my mind on yet. I only seem to be able to find recent failures. Like Dec'23 the oldest...

Sam Tunnicliffe added a comment - 23/Jan/24 13:04

There were quite a few dtest commits, mostly for TCM, in late November/early December. It maybe worth trying some runs using a dtest sha before that, e.g. b355b84c5f7b53d390822332215e3751df562559

Sam Tunnicliffe added a comment - 23/Jan/24 13:04 There were quite a few dtest commits, mostly for TCM, in late November/early December. It maybe worth trying some runs using a dtest sha before that, e.g. b355b84c5f7b53d390822332215e3751df562559

Brandon Williams added a comment - 23/Jan/24 13:19

Unfortunately since we can't reproduce, we'll have to revert the dtests to find out. I'm +1 on doing that as a temporary exploratory measure.

Brandon Williams added a comment - 23/Jan/24 13:19 Unfortunately since we can't reproduce, we'll have to revert the dtests to find out. I'm +1 on doing that as a temporary exploratory measure.

Berenguer Blasi added a comment - 23/Jan/24 13:47 - edited

I'm leaving this running overnight in a loop. See at least if I can bisect even if it takes a week to do so...

In fact in all these years fixing tests I don't remember seeing that error but recently. So there's another data point.

Berenguer Blasi added a comment - 23/Jan/24 13:47 - edited I'm leaving this running overnight in a loop. See at least if I can bisect even if it takes a week to do so... In fact in all these years fixing tests I don't remember seeing that error but recently. So there's another data point.

Berenguer Blasi added a comment - 24/Jan/24 05:52 - edited

Nothing happened overnight, no local repro after 17h continuously running. I raised it to critical after thinking about it overnight. That effectively blocks all releases and it might be a bit cheeky from me given probably sbdy else will be able to solve it sooner than me. So feel free to correct me please.

Just fired another 500 repeats on circle but on the 'offheap' variant which seems more prone to fail, although that should be totally coincidental tbh.

Berenguer Blasi added a comment - 24/Jan/24 05:52 - edited Nothing happened overnight, no local repro after 17h continuously running. I raised it to critical after thinking about it overnight. That effectively blocks all releases and it might be a bit cheeky from me given probably sbdy else will be able to solve it sooner than me. So feel free to correct me please. Just fired another 500 repeats on circle but on the 'offheap' variant which seems more prone to fail, although that should be totally coincidental tbh.

Berenguer Blasi added a comment - 24/Jan/24 08:08

Going over all failures it seems this started failing on the 28th of Nov'23 4.0->5.0 with no failures in trunk #justfyi

Berenguer Blasi added a comment - 24/Jan/24 08:08 Going over all failures it seems this started failing on the 28th of Nov'23 4.0->5.0 with no failures in trunk #justfyi

Marcus Eriksson added a comment - 24/Jan/24 08:53

could try temporarily reverting 514e3d828a0593fa31c65aff731a8c1eeb87c4e1 from ccm - it affects CASSANDRA_TOKEN_PREGENERATION_DISABLED which the bootstrap tests uses.

Marcus Eriksson added a comment - 24/Jan/24 08:53 could try temporarily reverting 514e3d828a0593fa31c65aff731a8c1eeb87c4e1 from ccm - it affects CASSANDRA_TOKEN_PREGENERATION_DISABLED which the bootstrap tests uses.

Berenguer Blasi added a comment - 24/Jan/24 09:14

^Actually that resonates very well with my gut intuition from reading logs these days. And I could even modify the test to start nodes one by one not to mess with ccm. Also it seems doing offheap and test_shutdown_wiped_node_cannot_join does repro in CI so we're making progress. Now we can try to bisect at least

Berenguer Blasi added a comment - 24/Jan/24 09:14 ^Actually that resonates very well with my gut intuition from reading logs these days. And I could even modify the test to start nodes one by one not to mess with ccm. Also it seems doing offheap and test_shutdown_wiped_node_cannot_join does repro in CI so we're making progress. Now we can try to bisect at least

Berenguer Blasi added a comment - 24/Jan/24 10:59

It seems starting the nodes one by one solves the issue. All praise and glory to marcuse as I would have gone down the dtest/C* rabbit hole and never thought of CCM.

I have fired a new run with fixed tests + their repeats to confirm. If that passes the question will be whether this is accepted behavior?

Berenguer Blasi added a comment - 24/Jan/24 10:59 It seems starting the nodes one by one solves the issue . All praise and glory to marcuse as I would have gone down the dtest/C* rabbit hole and never thought of CCM. I have fired a new run with fixed tests + their repeats to confirm. If that passes the question will be whether this is accepted behavior?

Brandon Williams added a comment - 24/Jan/24 12:28

Ah, this is very promising!

I want to believe this is something silly yet fundamental

CCM totally fits that bill.

If that passes the question will be whether this is accepted behavior?

I am not sure, can you link the change?

Brandon Williams added a comment - 24/Jan/24 12:28 Ah, this is very promising! I want to believe this is something silly yet fundamental CCM totally fits that bill. If that passes the question will be whether this is accepted behavior? I am not sure, can you link the change?

Berenguer Blasi added a comment - 24/Jan/24 13:14

I am not sure, can you link the change?

The change is in the PR: https://github.com/apache/cassandra-dtest/pull/250/files So it basically starts the nodes one by one effectively doing the opposite this CCM change describes. CI is running so this is a theory still. Should we support simultaneous node starts when CASSANDRA_TOKEN_PREGENERATION_DISABLED=true? how severe is that?

Berenguer Blasi added a comment - 24/Jan/24 13:14 I am not sure, can you link the change? The change is in the PR: https://github.com/apache/cassandra-dtest/pull/250/files So it basically starts the nodes one by one effectively doing the opposite this CCM change describes . CI is running so this is a theory still. Should we support simultaneous node starts when CASSANDRA_TOKEN_PREGENERATION_DISABLED=true? how severe is that?

Brandon Williams added a comment - 24/Jan/24 13:21

I see. It's not clear why this CCM change was made and it reads a little strangely to me that we 'can_generate_tokens' if CASSANDRA_TOKEN_PREGENERATION is enabled (not disabled.) It's also not clear to me why this ends up causing massive corruption.

Brandon Williams added a comment - 24/Jan/24 13:21 I see. It's not clear why this CCM change was made and it reads a little strangely to me that we 'can_generate_tokens' if CASSANDRA_TOKEN_PREGENERATION is enabled (not disabled.) It's also not clear to me why this ends up causing massive corruption.

Marcus Eriksson added a comment - 24/Jan/24 13:38 - edited

we were getting test timeouts with TCM without this CCM change

problem was

        node1.import_config_files()

which would add both node1 and node2 to seeds - then if we only start node1 it will wait for node2 to start (or, it times out and starts properly eventually, but not soon enough).

it reads a little strangely to me that we 'can_generate_tokens' if CASSANDRA_TOKEN_PREGENERATION is enabled (not disabled.)

the name of the parameter is CASSANDRA_TOKEN_PREGENERATION_DISABLED - in this case it reads ok to me?

Marcus Eriksson added a comment - 24/Jan/24 13:38 - edited we were getting test timeouts with TCM without this CCM change problem was node1.import_config_files() which would add both node1 and node2 to seeds - then if we only start node1 it will wait for node2 to start (or, it times out and starts properly eventually, but not soon enough). it reads a little strangely to me that we 'can_generate_tokens' if CASSANDRA_TOKEN_PREGENERATION is enabled (not disabled.) the name of the parameter is CASSANDRA_TOKEN_PREGENERATION_DISABLED - in this case it reads ok to me?

Brandon Williams added a comment - 24/Jan/24 13:43

I see, thanks for the explanation. Given that, I think we're probably okay with starting one node at a time here, since the intention isn't to test starting multiple nodes, and that happens in plenty of other tests.

the name of the parameter is CASSANDRA_TOKEN_PREGENERATION_DISABLED - in this case it reads ok to me?

Well, it's 'not ...DISABLED' though, which is why I said enabled. So "if pregeneration is enabled, we can generate tokens" is how it reads to me.

Brandon Williams added a comment - 24/Jan/24 13:43 I see, thanks for the explanation. Given that, I think we're probably okay with starting one node at a time here, since the intention isn't to test starting multiple nodes, and that happens in plenty of other tests. the name of the parameter is CASSANDRA_TOKEN_PREGENERATION_DISABLED - in this case it reads ok to me? Well, it's 'not ...DISABLED' though, which is why I said enabled. So "if pregeneration is enabled, we can generate tokens" is how it reads to me.

Brandon Williams added a comment - 24/Jan/24 13:44 - edited

I am still not sure how we get such a frightening result from this, though. It at least seems like we need some kind of safeguard to protect users from doing this.

Brandon Williams added a comment - 24/Jan/24 13:44 - edited I am still not sure how we get such a frightening result from this, though. It at least seems like we need some kind of safeguard to protect users from doing this.

Berenguer Blasi added a comment - 24/Jan/24 15:13

Ci is still running...

Berenguer Blasi added a comment - 24/Jan/24 15:13 Ci is still running...

Brandon Williams added a comment - 24/Jan/24 15:46

So "if pregeneration is enabled, we can generate tokens" is how it reads to me.

I guess that's only weird to me because of 'pregeneration' and 'generation' being used but meaning the same thing, but I understand how we got there.

Brandon Williams added a comment - 24/Jan/24 15:46 So "if pregeneration is enabled, we can generate tokens" is how it reads to me. I guess that's only weird to me because of 'pregeneration' and 'generation' being used but meaning the same thing, but I understand how we got there.

Berenguer Blasi added a comment - 24/Jan/24 16:20

Seems CI is green for the repeats of all the tests that were failing. I guess we review/merge this and the only outstanding issue is whether to open a ticket on what to do with simultaneous starts with that setup and it's severity?

Berenguer Blasi added a comment - 24/Jan/24 16:20 Seems CI is green for the repeats of all the tests that were failing. I guess we review/merge this and the only outstanding issue is whether to open a ticket on what to do with simultaneous starts with that setup and it's severity?

Brandon Williams added a comment - 24/Jan/24 16:24

I would still like to understand why this blows up so badly, and how we can protect against that so that users don't encounter it.

Brandon Williams added a comment - 24/Jan/24 16:24 I would still like to understand why this blows up so badly, and how we can protect against that so that users don't encounter it.

Berenguer Blasi added a comment - 25/Jan/24 07:10

Please check my reasoning:

CASSANDRA_TOKEN_PREGENERATION_DISABLED is a CCM optimization for speed and not a user facing/product feature. There's no mention of it in the C* code, only in CCM. There are multiple code snippets and comments in CCM already i.e. here and here which do already protect against this problem coming from the past. The mention of token collisions "gives identical tokens to several nodes" is exactly what can be seen in the logs:

INFO  [GossipStage:1] 2024-01-18 10:53:33,257 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token -1513687318797282039.  Ignoring /127.0.0.2:7000
INFO  [GossipStage:1] 2024-01-18 10:53:33,257 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token -2578028308570345435.  Ignoring /127.0.0.2:7000
INFO  [GossipStage:1] 2024-01-18 10:53:33,257 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token -3836448899713998469.  Ignoring /127.0.0.2:7000
INFO  [GossipStage:1] 2024-01-18 10:53:33,257 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token -5958143799716233579.  Ignoring /127.0.0.2:7000
INFO  [GossipStage:1] 2024-01-18 10:53:33,257 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token -7015520316645191796.  Ignoring /127.0.0.2:7000
INFO  [GossipStage:1] 2024-01-18 10:53:33,257 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token -8365722143219869680.  Ignoring /127.0.0.2:7000
INFO  [GossipStage:1] 2024-01-18 10:53:33,257 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token -8792008644500999070.  Ignoring /127.0.0.2:7000
INFO  [GossipStage:1] 2024-01-18 10:53:33,257 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token 2183381311695650387.  Ignoring /127.0.0.2:7000
INFO  [GossipStage:1] 2024-01-18 10:53:33,257 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token 306316982182664358.  Ignoring /127.0.0.2:7000
INFO  [GossipStage:1] 2024-01-18 10:53:33,258 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token 3352336234874664766.  Ignoring /127.0.0.2:7000
INFO  [GossipStage:1] 2024-01-18 10:53:33,258 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token 4401460035203855854.  Ignoring /127.0.0.2:7000
INFO  [GossipStage:1] 2024-01-18 10:53:33,258 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token 5337373207995371274.  Ignoring /127.0.0.2:7000
INFO  [GossipStage:1] 2024-01-18 10:53:33,258 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token 5753010867567011154.  Ignoring /127.0.0.2:7000
INFO  [GossipStage:1] 2024-01-18 10:53:33,258 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token 6964691214238992299.  Ignoring /127.0.0.2:7000
INFO  [GossipStage:1] 2024-01-18 10:53:33,258 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token 7362014066587705090.  Ignoring /127.0.0.2:7000
INFO  [GossipStage:1] 2024-01-18 10:53:33,258 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token 8620434782533148178.  Ignoring /127.0.0.2:7000

If we revert the CCM change we'll suffer from speed increases and we won't test the different pre/post TCM behaviours. Given the usage of CASSANDRA_TOKEN_PREGENERATION_DISABLED is pretty isolated and constrained mainly to the bootstrap test I propose we:

Merge the current PR with the version guard Markus mentions where trunk would not be affected so it exercises TCM
Given how isolated this property's usage is anybody adding a new test should find the added comments. If that were not the case we have now a Jira ticket that should pop up pretty quick on any search of the failure stack trace.

Wdyt?

Berenguer Blasi added a comment - 25/Jan/24 07:10 Please check my reasoning: CASSANDRA_TOKEN_PREGENERATION_DISABLED is a CCM optimization for speed and not a user facing/product feature. There's no mention of it in the C* code, only in CCM. There are multiple code snippets and comments in CCM already i.e. here and here which do already protect against this problem coming from the past. The mention of token collisions "gives identical tokens to several nodes" is exactly what can be seen in the logs: INFO [GossipStage:1] 2024-01-18 10:53:33,257 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token -1513687318797282039. Ignoring /127.0.0.2:7000 INFO [GossipStage:1] 2024-01-18 10:53:33,257 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token -2578028308570345435. Ignoring /127.0.0.2:7000 INFO [GossipStage:1] 2024-01-18 10:53:33,257 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token -3836448899713998469. Ignoring /127.0.0.2:7000 INFO [GossipStage:1] 2024-01-18 10:53:33,257 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token -5958143799716233579. Ignoring /127.0.0.2:7000 INFO [GossipStage:1] 2024-01-18 10:53:33,257 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token -7015520316645191796. Ignoring /127.0.0.2:7000 INFO [GossipStage:1] 2024-01-18 10:53:33,257 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token -8365722143219869680. Ignoring /127.0.0.2:7000 INFO [GossipStage:1] 2024-01-18 10:53:33,257 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token -8792008644500999070. Ignoring /127.0.0.2:7000 INFO [GossipStage:1] 2024-01-18 10:53:33,257 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token 2183381311695650387. Ignoring /127.0.0.2:7000 INFO [GossipStage:1] 2024-01-18 10:53:33,257 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token 306316982182664358. Ignoring /127.0.0.2:7000 INFO [GossipStage:1] 2024-01-18 10:53:33,258 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token 3352336234874664766. Ignoring /127.0.0.2:7000 INFO [GossipStage:1] 2024-01-18 10:53:33,258 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token 4401460035203855854. Ignoring /127.0.0.2:7000 INFO [GossipStage:1] 2024-01-18 10:53:33,258 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token 5337373207995371274. Ignoring /127.0.0.2:7000 INFO [GossipStage:1] 2024-01-18 10:53:33,258 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token 5753010867567011154. Ignoring /127.0.0.2:7000 INFO [GossipStage:1] 2024-01-18 10:53:33,258 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token 6964691214238992299. Ignoring /127.0.0.2:7000 INFO [GossipStage:1] 2024-01-18 10:53:33,258 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token 7362014066587705090. Ignoring /127.0.0.2:7000 INFO [GossipStage:1] 2024-01-18 10:53:33,258 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token 8620434782533148178. Ignoring /127.0.0.2:7000 If we revert the CCM change we'll suffer from speed increases and we won't test the different pre/post TCM behaviours. Given the usage of CASSANDRA_TOKEN_PREGENERATION_DISABLED is pretty isolated and constrained mainly to the bootstrap test I propose we: Merge the current PR with the version guard Markus mentions where trunk would not be affected so it exercises TCM Given how isolated this property's usage is anybody adding a new test should find the added comments. If that were not the case we have now a Jira ticket that should pop up pretty quick on any search of the failure stack trace. Wdyt?

Brandon Williams added a comment - 25/Jan/24 15:05

Ok, I see what happened here and there is no cause for concern of unknown sharp edges: the non-bootstrapping nodes were started with the same tokens (and directly injected into the ring as seeds), so they all had a different view of who actually 'won' the conflicts, and this of course wreaked havoc on consistency checks. I'm +1 on the proposed fix to this.

More concretely, in the jenkinslogs.zip that was attached, node2 and node3 were assigned the same tokens. node1 determined that node3 won, while nodes 2 and 3 both chose themselves as the winner. Without a proper view of the ring, this explains why reading at ALL didn't resolve anything.

Brandon Williams added a comment - 25/Jan/24 15:05 Ok, I see what happened here and there is no cause for concern of unknown sharp edges: the non-bootstrapping nodes were started with the same tokens (and directly injected into the ring as seeds), so they all had a different view of who actually 'won' the conflicts, and this of course wreaked havoc on consistency checks. I'm +1 on the proposed fix to this. More concretely, in the jenkinslogs.zip that was attached, node2 and node3 were assigned the same tokens. node1 determined that node3 won, while nodes 2 and 3 both chose themselves as the winner. Without a proper view of the ring, this explains why reading at ALL didn't resolve anything.

Berenguer Blasi added a comment - 26/Jan/24 09:42 - edited

Thx for connecting the dots brandon.williams

The version guard has been added. Trunk CI is green. 5.0 CI is green. Given how CI intensive this is I'd say we can skip 4.0 and 4.1 CI runs safely.

I'll merge on Monday to give time for people to re-review.

Berenguer Blasi added a comment - 26/Jan/24 09:42 - edited Thx for connecting the dots brandon.williams The version guard has been added. Trunk CI is green . 5.0 CI is green . Given how CI intensive this is I'd say we can skip 4.0 and 4.1 CI runs safely. I'll merge on Monday to give time for people to re-review.

Stefan Miklosovic added a comment - 26/Jan/24 10:10 - edited

bereng could you drop an e-mail to ML about 4.1.4 staging artifact that this is not a blocker? There is a lot of people from the community who wait for 4.1.4 to happen and I do not want to keep them in the woods. If this is not a blocker we can start 4.1.4 vote.

Stefan Miklosovic added a comment - 26/Jan/24 10:10 - edited bereng could you drop an e-mail to ML about 4.1.4 staging artifact that this is not a blocker? There is a lot of people from the community who wait for 4.1.4 to happen and I do not want to keep them in the woods. If this is not a blocker we can start 4.1.4 vote.

People

Assignee:: Berenguer Blasi

Reporter:: Michael Semb Wever

Authors:: Berenguer Blasi

Reviewers:: Brandon Williams, Marcus Eriksson

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 25/Nov/23 12:03

Updated:: 18/Jul/24 21:17

Resolved:: 29/Jan/24 06:35

Apache Cassandra

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates