Uploaded image for project: 'Apache Cassandra'
  1. Apache Cassandra
  2. CASSANDRA-19097

Test Failure: bootstrap_test.TestBootstrap.*

Details

    • Bug
    • Status: Resolved
    • Normal
    • Resolution: Fixed
    • 4.0.12, 4.1.4, 5.0-rc1, 5.0, 5.1
    • CI
    • None

    Description

      test_killed_wiped_node_cannot_join
      test_read_from_bootstrapped_node
      test_shutdown_wiped_node_cannot_join

      Seen in dtests_offheap in CASSANDRA-19034
      https://app.circleci.com/pipelines/github/michaelsembwever/cassandra/258/workflows/cea7d697-ca33-40bb-8914-fb9fc662820a/jobs/21162/parallel-runs/38

      self = <bootstrap_test.TestBootstrap object at 0x7fc471171d50>
      
          def test_killed_wiped_node_cannot_join(self):
      >       self._wiped_node_cannot_join_test(gently=False)
      
      bootstrap_test.py:608: 
      _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
      
      self = <bootstrap_test.TestBootstrap object at 0x7fc471171d50>, gently = False
      
          def _wiped_node_cannot_join_test(self, gently):
              """
              @jira_ticket CASSANDRA-9765
              Test that if we stop a node and wipe its data then the node cannot join
              when it is not a seed. Test both a nice shutdown or a forced shutdown, via
              the gently parameter.
              """
              cluster = self.cluster
              cluster.set_environment_variable('CASSANDRA_TOKEN_PREGENERATION_DISABLED', 'True')
              cluster.populate(3)
              cluster.start()
          
              stress_table = 'keyspace1.standard1'
          
              # write some data
              node1 = cluster.nodelist()[0]
              node1.stress(['write', 'n=10K', 'no-warmup', '-rate', 'threads=8'])
          
              session = self.patient_cql_connection(node1)
              original_rows = list(session.execute("SELECT * FROM {}".format(stress_table,)))
          
              # Add a new node, bootstrap=True ensures that it is not a seed
              node4 = new_node(cluster, bootstrap=True)
              node4.start(wait_for_binary_proto=True)
          
              session = self.patient_cql_connection(node4)
      >       assert original_rows == list(session.execute("SELECT * FROM {}".format(stress_table,)))
      E       assert [Row(key=b'PP...e9\xbb'), ...] == [Row(key=b'PP...e9\xbb'), ...]
      E         At index 587 diff: Row(key=b'OP2656L630', C0=b"E02\xd2\x8clBv\tr\n\xe3\x01\xdd\xf2\x8a\x91\x7f-\x9dm'\xa5\xe7PH\xef\xc1xlO\xab+d", C1=b"\xb2\xc0j\xff\xcb'\xe3\xcc\x0b\x93?\x18@\xc4\xc7tV\xb7q\xeeF\x82\xa4\xd3\xdcFl\xd9\x87 \x9a\xde\xdc\xa3", C2=b'\xed\xf8\x8d%\xa4\xa6LPs;\x98f\xdb\xca\x913\xba{M\x8d6XW\x01\xea-\xb5<J\x1eo\xa0F\xbe', C3=b'\x9ec\xcf\xc7\xec\xa5\x85Z]\xa6\x19\xeb\xc4W\x1d%lyZj\xb9\x94I\x90\xebZ\xdba\xdd\xdc\x9e\x82\x95\x1c', C4=b'\xab\x9e\x13\x8b\xc6\x15D\x9b\xccl\xdcX\xb23\xd0\x8b\xa3\xba7\xc1c\xf7F\x1d\xf8e\xbd\x89\xcb\xd8\xd1)f\xdd') != Row(key=b'4LN78NONP0', C0=b"\xdf\x90\xb3/u\xc9/C\xcdOYG3\x070@#\xc3k\xaa$M'\x19\xfb\xab\xc0\x10]\xa6\xac\x1d\x81\xad", C1=b'\x8a\xb7j\x95\xf9\xbd?&\x11\xaaH\xcd\x87\xaa\xd2\x85\x08X\xea9\x94\xae8U\x92\xad\xb0\x1b9\xff\x87Z\xe81', C2=b'6\x1d\xa1-\xf77\xc7\xde+`\xb7\x89\xaa\xcd\xb5_\xe5\xb3\x04\xc7\xb1\x95e\x81s\t1\x8b\x16sc\x0eMm', C3=b'\xfbi\x08;\xc9\x94\x15}r\xfe\x1b\xae5\xf6v\x83\xae\xff\x82\x9b`J\xc2D\xa6k+\xf3\xd3\xff{C\xd0;', C4=b'\x8f\x87\x18\x0f\xfa\xadK"\x9e\x96\x87:tiu\xa5\x99\xe1_Ax\xa3\x12\xb4Z\xc9v\xa5\xad\xb8{\xc0\xa3\x93')
      E         Left contains 2830 more items, first extra item: Row(key=b'5N7N172K30', C0=b'Y\x81\xa6\x02\x89\xa0hyp\x00O\xe9kFp$\x86u\xea\n\x7fK\x99\xe1\xf6G\xf77\xf7\xd7\xe1\xc7L\x...0\x87a\x03\xee', C4=b'\xe8\xd8\x17\xf3\x14\x16Q\x9d\\jb\xde=\x81\xc1B\x9c;T\xb1\xa2O-\x87zF=\x04`\x04\xbd\xc9\x95\xad')
      E         Full diff:
      E           [
      …
      

      Attachments

        1. jenkinslogs.zip
          692 kB
          Berenguer Blasi

        Activity

          jlewandowski Jacek Lewandowski added a comment - Let me add to that: test_shutdown_wiped_node_cannot_join https://app.circleci.com/pipelines/github/jacek-lewandowski/cassandra/1119/workflows/932a44e3-c25e-4c5e-b677-66579380faeb/jobs/50782/tests
          jlewandowski Jacek Lewandowski added a comment - Those tests seem to fail not only on 5.0, I can see they failing on 4.0 as well: for example here: https://app.circleci.com/pipelines/github/jacek-lewandowski/cassandra/1125/workflows/a199ad8c-1a83-43c9-9218-a90c3472d113/jobs/51601 , or https://app.circleci.com/pipelines/github/jacek-lewandowski/cassandra/1125/workflows/a199ad8c-1a83-43c9-9218-a90c3472d113/jobs/51579 , the patch which was tested there is very unlikely to be the cause

          There were dtest changes with TCM too. Both those runs are recent.

          mck Michael Semb Wever added a comment - There were dtest changes with TCM too. Both those runs are recent.

          I haven't been able reproduce these locally, and here is 500 runs on both j11 and j17.

          brandon.williams Brandon Williams added a comment - I haven't been able reproduce these locally, and here is 500 runs on both j11 and j17.
          bereng Berenguer Blasi added a comment - Seen here https://ci-cassandra.apache.org/job/Cassandra-5.0/138/testReport/junit/dtest-upgrade.upgrade_tests.bootstrap_upgrade_test/TestBootstrapUpgrade/test_decommissioned_wiped_node_can_join/
          paulo Paulo Motta added a comment -

          Seen test_read_from_bootstrapped_node failure in 5.0-18999-j17

          paulo Paulo Motta added a comment - Seen test_read_from_bootstrapped_node failure in 5.0-18999-j17

          Confirmed it happens across all versions. This is a recent 4.1 one but I haven't managed to repro yet...

          bereng Berenguer Blasi added a comment - Confirmed it happens across all versions. This is a recent 4.1 one but I haven't managed to repro yet...
          bereng Berenguer Blasi added a comment - - edited

          Writes and reads are happening at LOCAL_ONE on those tests. These are 4-node tests so switching reads to CL.ALL is the best bet so far. We'll only know after the merge as I didn't manage to repro. If we're happy with the suggested PR I'd say we merge and keep an eye on them.

          All failures both for upgrade and non-upgrade tests, which are all based off the same base class, match the test methods I had to touch. I did run a 500 repeats CI run for one of the test methods to spare CI cycles , all other methods are fixed in the same way.

          bereng Berenguer Blasi added a comment - - edited Writes and reads are happening at LOCAL_ONE on those tests. These are 4-node tests so switching reads to CL.ALL is the best bet so far. We'll only know after the merge as I didn't manage to repro. If we're happy with the suggested PR I'd say we merge and keep an eye on them. All failures both for upgrade and non-upgrade tests, which are all based off the same base class, match the test methods I had to touch. I did run a 500 repeats CI run for one of the test methods to spare CI cycles , all other methods are fixed in the same way.

          We have approximately zero traction without a reproduction, so +1 to the PR, let's see how it goes.

          brandon.williams Brandon Williams added a comment - We have approximately zero traction without a reproduction, so +1 to the PR, let's see how it goes.

          Tested locally test passes 4.0->trunk and committed. Thx for the review!

          bereng Berenguer Blasi added a comment - Tested locally test passes 4.0->trunk and committed. Thx for the review!

          Writes and reads are happening at LOCAL_ONE on those tests.

          Hmm, I think that was probably intentional to test the consistency of the newly added nodes, and reading at ALL will hide problems there. So I think this is useful to see if the issue persists since we cannot reproduce but I don't think we should leave this change in permanently.

          brandon.williams Brandon Williams added a comment - Writes and reads are happening at LOCAL_ONE on those tests. Hmm, I think that was probably intentional to test the consistency of the newly added nodes, and reading at ALL will hide problems there. So I think this is useful to see if the issue persists since we cannot reproduce but I don't think we should leave this change in permanently.

          ^Icwym. I think you're right. We could be having range movement/streaming problems that would get hidden. I'll add a note to myself to check for failures say next Tuesday as these errors happened often. Then I'll revert and on the next PR add a note explaining what you just said.

          bereng Berenguer Blasi added a comment - ^Icwym. I think you're right. We could be having range movement/streaming problems that would get hidden. I'll add a note to myself to check for failures say next Tuesday as these errors happened often. Then I'll revert and on the next PR add a note explaining what you just said.
          bereng Berenguer Blasi added a comment - - edited

          I couldn't find the exact dtest SHA this run used. But if circle times are correct this shows a failure even after reading at ALL. So it seems we need to revert this and that we have some bootstrap legit problem.

          bereng Berenguer Blasi added a comment - - edited I couldn't find the exact dtest SHA this run used. But if circle times are correct this shows a failure even after reading at ALL. So it seems we need to revert this and that we have some bootstrap legit problem.

          But if circle times are correct this shows a failure even after reading at ALL.

          If that is true the corruption runs deeper than just bootstrap, it affects the whole cluster if ALL doesn't help. Let's stick with your plan of waiting until Tuesday and see if this persists or anything else shakes out.

          brandon.williams Brandon Williams added a comment - But if circle times are correct this shows a failure even after reading at ALL. If that is true the corruption runs deeper than just bootstrap, it affects the whole cluster if ALL doesn't help. Let's stick with your plan of waiting until Tuesday and see if this persists or anything else shakes out.

          ^Yeah as I am reviewing things and as jenkins runs some more SHAs, this coming up often as it does, it should give us some idea of the impact of reading at ALL...

          bereng Berenguer Blasi added a comment - ^Yeah as I am reviewing things and as jenkins runs some more SHAs, this coming up often as it does, it should give us some idea of the impact of reading at ALL...

          Reopened to make sure we don't forget.

          brandon.williams Brandon Williams added a comment - Reopened to make sure we don't forget.
          bereng Berenguer Blasi added a comment - - edited

          Another one that failed with the ALL read. Confirmed as the failure reports prints the new code https://ci-cassandra.apache.org/job/Cassandra-5.0-dtest-offheap/152/jdk=jdk_11_latest,label=cassandra-dtest,split=13/consoleFull

          bereng Berenguer Blasi added a comment - - edited Another one that failed with the ALL read. Confirmed as the failure reports prints the new code https://ci-cassandra.apache.org/job/Cassandra-5.0-dtest-offheap/152/jdk=jdk_11_latest,label=cassandra-dtest,split=13/consoleFull

          That really complicates things. This means that even if we have some bug in streaming causing the corruption, we also have a second bug where the repair mechanism of ALL does not work. I want to believe this is something silly yet fundamental, but all the evidence just keeps mounting.

          I don't know that it will help (and I'm not sure what will) but maybe if we modify the test to tell us how wrong the results are that will be useful. Currently we know something is wrong, but it's a sea of text.

          brandon.williams Brandon Williams added a comment - That really complicates things. This means that even if we have some bug in streaming causing the corruption, we also have a second bug where the repair mechanism of ALL does not work. I want to believe this is something silly yet fundamental, but all the evidence just keeps mounting. I don't know that it will help (and I'm not sure what will) but maybe if we modify the test to tell us how wrong the results are that will be useful. Currently we know something is wrong, but it's a sea of text.

          What I have seen so far perusing the 4 nodes logs is that all writes got to 1. No data reaches nodes 2 or 3. Then node 4 comes up and streams only from node 1. Then node4 is queried and it fails. The amount of token/ring recalculation traces are impossible to follow. I need to check now against a working example set of logs, let's see if I can follow 8 log files lol. Not reverting yet #justfyi

          bereng Berenguer Blasi added a comment - What I have seen so far perusing the 4 nodes logs is that all writes got to 1. No data reaches nodes 2 or 3. Then node 4 comes up and streams only from node 1. Then node4 is queried and it fails. The amount of token/ring recalculation traces are impossible to follow. I need to check now against a working example set of logs, let's see if I can follow 8 log files lol. Not reverting yet #justfyi
          bereng Berenguer Blasi added a comment - - edited

          We seem to have a legit problem. I have found instances in butler from 4.0 and 4.1 as it was already mentioned in the ticket. I have amended fixVersions accordingly.

          The problem seems to be a race revolving about nodes coming up, gossiper, SotrageService and TokenMetadata updates, recalculations, state changes, caches, token collisions, etc. The logs are too sparse to see in detail what is going on, maybe sbdy with deep knowledge on this area could pin it, but without a reproduction this is hard.

          I have attached logs for a dtest-offheap failure example and a working example under folder 'works' for a dtests-novnode. Searching for StreamSession in all nodes one gets an idea of the difference in behavior. Then following how nodes 2 and 3 come up in both node1 logs makes it clear that on the failing ones state changes are ignored 'because it is not a member in token metadata'. Seeing how nodes come up, stream, tokens recalculate, etc is where the problem seems to be.

          Given this is present in other versions I don't know if it should be a 5.0 blocker. On the other hand given it's severity and that we don't have enough history to tell if it's a recent failure or if it has been there for ages I have mixed thoughts. At least in jira there is only one similar ticket CASSANDRA-10072 and looks unrelated. So this could be a recent thing...

          I'll keep digging a bit more.

          bereng Berenguer Blasi added a comment - - edited We seem to have a legit problem. I have found instances in butler from 4.0 and 4.1 as it was already mentioned in the ticket. I have amended fixVersions accordingly. The problem seems to be a race revolving about nodes coming up, gossiper, SotrageService and TokenMetadata updates, recalculations, state changes, caches, token collisions, etc. The logs are too sparse to see in detail what is going on, maybe sbdy with deep knowledge on this area could pin it, but without a reproduction this is hard. I have attached logs for a dtest-offheap failure example and a working example under folder 'works' for a dtests-novnode. Searching for StreamSession in all nodes one gets an idea of the difference in behavior. Then following how nodes 2 and 3 come up in both node1 logs makes it clear that on the failing ones state changes are ignored 'because it is not a member in token metadata'. Seeing how nodes come up, stream, tokens recalculate, etc is where the problem seems to be. Given this is present in other versions I don't know if it should be a 5.0 blocker. On the other hand given it's severity and that we don't have enough history to tell if it's a recent failure or if it has been there for ages I have mixed thoughts. At least in jira there is only one similar ticket CASSANDRA-10072 and looks unrelated. So this could be a recent thing... I'll keep digging a bit more.

          Given this is present in other versions I don't know if it should be a 5.0 blocker.

          At least it's not belonging in fixVersion `5.0-rc`.
          Please raise priority/severity appropriate (i.e. if it is to block all involved releases).

          mck Michael Semb Wever added a comment - Given this is present in other versions I don't know if it should be a 5.0 blocker. At least it's not belonging in fixVersion `5.0-rc`. Please raise priority/severity appropriate (i.e. if it is to block all involved releases).
          bereng Berenguer Blasi added a comment - - edited

          At least it's not belonging in fixVersion `5.0-rc`.

          That's what I haven't made up my mind on yet. I only seem to be able to find recent failures. Like Dec'23 the oldest...

          bereng Berenguer Blasi added a comment - - edited At least it's not belonging in fixVersion `5.0-rc`. That's what I haven't made up my mind on yet. I only seem to be able to find recent failures. Like Dec'23 the oldest...

          There were quite a few dtest commits, mostly for TCM, in late November/early December. It maybe worth trying some runs using a dtest sha before that, e.g. b355b84c5f7b53d390822332215e3751df562559

          samt Sam Tunnicliffe added a comment - There were quite a few dtest commits, mostly for TCM, in late November/early December. It maybe worth trying some runs using a dtest sha before that, e.g.  b355b84c5f7b53d390822332215e3751df562559

          Unfortunately since we can't reproduce, we'll have to revert the dtests to find out. I'm +1 on doing that as a temporary exploratory measure.

          brandon.williams Brandon Williams added a comment - Unfortunately since we can't reproduce, we'll have to revert the dtests to find out. I'm +1 on doing that as a temporary exploratory measure.
          bereng Berenguer Blasi added a comment - - edited

          I'm leaving this running overnight in a loop. See at least if I can bisect even if it takes a week to do so...

          In fact in all these years fixing tests I don't remember seeing that error but recently. So there's another data point.

          bereng Berenguer Blasi added a comment - - edited I'm leaving this running overnight in a loop. See at least if I can bisect even if it takes a week to do so... In fact in all these years fixing tests I don't remember seeing that error but recently. So there's another data point.
          bereng Berenguer Blasi added a comment - - edited

          Nothing happened overnight, no local repro after 17h continuously running. I raised it to critical after thinking about it overnight. That effectively blocks all releases and it might be a bit cheeky from me given probably sbdy else will be able to solve it sooner than me. So feel free to correct me please.

          Just fired another 500 repeats on circle but on the 'offheap' variant which seems more prone to fail, although that should be totally coincidental tbh.

          bereng Berenguer Blasi added a comment - - edited Nothing happened overnight, no local repro after 17h continuously running. I raised it to critical after thinking about it overnight. That effectively blocks all releases and it might be a bit cheeky from me given probably sbdy else will be able to solve it sooner than me. So feel free to correct me please. Just fired another 500 repeats on circle but on the 'offheap' variant which seems more prone to fail, although that should be totally coincidental tbh.

          Going over all failures it seems this started failing on the 28th of Nov'23 4.0->5.0 with no failures in trunk #justfyi

          bereng Berenguer Blasi added a comment - Going over all failures it seems this started failing on the 28th of Nov'23 4.0->5.0 with no failures in trunk #justfyi

          could try temporarily reverting 514e3d828a0593fa31c65aff731a8c1eeb87c4e1 from ccm - it affects CASSANDRA_TOKEN_PREGENERATION_DISABLED which the bootstrap tests uses.

          marcuse Marcus Eriksson added a comment - could try temporarily reverting 514e3d828a0593fa31c65aff731a8c1eeb87c4e1 from ccm - it affects CASSANDRA_TOKEN_PREGENERATION_DISABLED which the bootstrap tests uses.

          ^Actually that resonates very well with my gut intuition from reading logs these days. And I could even modify the test to start nodes one by one not to mess with ccm. Also it seems doing offheap and test_shutdown_wiped_node_cannot_join does repro in CI so we're making progress. Now we can try to bisect at least

          bereng Berenguer Blasi added a comment - ^Actually that resonates very well with my gut intuition from reading logs these days. And I could even modify the test to start nodes one by one not to mess with ccm. Also it seems doing offheap and test_shutdown_wiped_node_cannot_join does repro in CI so we're making progress. Now we can try to bisect at least

          It seems starting the nodes one by one solves the issue. All praise and glory to marcuse as I would have gone down the dtest/C* rabbit hole and never thought of CCM.

          I have fired a new run with fixed tests + their repeats to confirm. If that passes the question will be whether this is accepted behavior?

          bereng Berenguer Blasi added a comment - It seems starting the nodes one by one solves the issue . All praise and glory to marcuse as I would have gone down the dtest/C* rabbit hole and never thought of CCM. I have fired a new run with fixed tests + their repeats to confirm. If that passes the question will be whether this is accepted behavior?

          Ah, this is very promising!

          I want to believe this is something silly yet fundamental

          CCM totally fits that bill.

          If that passes the question will be whether this is accepted behavior?

          I am not sure, can you link the change?

          brandon.williams Brandon Williams added a comment - Ah, this is very promising! I want to believe this is something silly yet fundamental CCM totally fits that bill. If that passes the question will be whether this is accepted behavior? I am not sure, can you link the change?

          I am not sure, can you link the change?

          The change is in the PR: https://github.com/apache/cassandra-dtest/pull/250/files So it basically starts the nodes one by one effectively doing the opposite this CCM change describes. CI is running so this is a theory still. Should we support simultaneous node starts when CASSANDRA_TOKEN_PREGENERATION_DISABLED=true? how severe is that?

          bereng Berenguer Blasi added a comment - I am not sure, can you link the change? The change is in the PR: https://github.com/apache/cassandra-dtest/pull/250/files So it basically starts the nodes one by one effectively doing the opposite this CCM change describes . CI is running so this is a theory still. Should we support simultaneous node starts when CASSANDRA_TOKEN_PREGENERATION_DISABLED=true? how severe is that?

          I see. It's not clear why this CCM change was made and it reads a little strangely to me that we 'can_generate_tokens' if CASSANDRA_TOKEN_PREGENERATION is enabled (not disabled.) It's also not clear to me why this ends up causing massive corruption.

          brandon.williams Brandon Williams added a comment - I see. It's not clear why this CCM change was made and it reads a little strangely to me that we 'can_generate_tokens' if CASSANDRA_TOKEN_PREGENERATION is enabled (not disabled.) It's also not clear to me why this ends up causing massive corruption.
          marcuse Marcus Eriksson added a comment - - edited

          we were getting test timeouts with TCM without this CCM change

          problem was

                  node1.import_config_files()  
          

          which would add both node1 and node2 to seeds - then if we only start node1 it will wait for node2 to start (or, it times out and starts properly eventually, but not soon enough).

          it reads a little strangely to me that we 'can_generate_tokens' if CASSANDRA_TOKEN_PREGENERATION is enabled (not disabled.)

          the name of the parameter is CASSANDRA_TOKEN_PREGENERATION_DISABLED - in this case it reads ok to me?

          marcuse Marcus Eriksson added a comment - - edited we were getting test timeouts with TCM without this CCM change problem was node1.import_config_files() which would add both node1 and node2 to seeds - then if we only start node1 it will wait for node2 to start (or, it times out and starts properly eventually, but not soon enough). it reads a little strangely to me that we 'can_generate_tokens' if CASSANDRA_TOKEN_PREGENERATION is enabled (not disabled.) the name of the parameter is CASSANDRA_TOKEN_PREGENERATION_DISABLED - in this case it reads ok to me?

          I see, thanks for the explanation. Given that, I think we're probably okay with starting one node at a time here, since the intention isn't to test starting multiple nodes, and that happens in plenty of other tests.

          the name of the parameter is CASSANDRA_TOKEN_PREGENERATION_DISABLED - in this case it reads ok to me?

          Well, it's 'not ...DISABLED' though, which is why I said enabled. So "if pregeneration is enabled, we can generate tokens" is how it reads to me.

          brandon.williams Brandon Williams added a comment - I see, thanks for the explanation. Given that, I think we're probably okay with starting one node at a time here, since the intention isn't to test starting multiple nodes, and that happens in plenty of other tests. the name of the parameter is CASSANDRA_TOKEN_PREGENERATION_DISABLED - in this case it reads ok to me? Well, it's 'not ...DISABLED' though, which is why I said enabled. So "if pregeneration is enabled, we can generate tokens" is how it reads to me.
          brandon.williams Brandon Williams added a comment - - edited

          I am still not sure how we get such a frightening result from this, though. It at least seems like we need some kind of safeguard to protect users from doing this.

          brandon.williams Brandon Williams added a comment - - edited I am still not sure how we get such a frightening result from this, though. It at least seems like we need some kind of safeguard to protect users from doing this.

          Ci is still running...

          bereng Berenguer Blasi added a comment - Ci is still running...

          So "if pregeneration is enabled, we can generate tokens" is how it reads to me.

          I guess that's only weird to me because of 'pregeneration' and 'generation' being used but meaning the same thing, but I understand how we got there.

          brandon.williams Brandon Williams added a comment - So "if pregeneration is enabled, we can generate tokens" is how it reads to me. I guess that's only weird to me because of 'pregeneration' and 'generation' being used but meaning the same thing, but I understand how we got there.

          Seems CI is green for the repeats of all the tests that were failing. I guess we review/merge this and the only outstanding issue is whether to open a ticket on what to do with simultaneous starts with that setup and it's severity?

          bereng Berenguer Blasi added a comment - Seems CI is green for the repeats of all the tests that were failing. I guess we review/merge this and the only outstanding issue is whether to open a ticket on what to do with simultaneous starts with that setup and it's severity?

          I would still like to understand why this blows up so badly, and how we can protect against that so that users don't encounter it.

          brandon.williams Brandon Williams added a comment - I would still like to understand why this blows up so badly, and how we can protect against that so that users don't encounter it.

          Please check my reasoning:

          CASSANDRA_TOKEN_PREGENERATION_DISABLED is a CCM optimization for speed and not a user facing/product feature. There's no mention of it in the C* code, only in CCM. There are multiple code snippets and comments in CCM already i.e. here and here which do already protect against this problem coming from the past. The mention of token collisions "gives identical tokens to several nodes" is exactly what can be seen in the logs:

          INFO  [GossipStage:1] 2024-01-18 10:53:33,257 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token -1513687318797282039.  Ignoring /127.0.0.2:7000
          INFO  [GossipStage:1] 2024-01-18 10:53:33,257 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token -2578028308570345435.  Ignoring /127.0.0.2:7000
          INFO  [GossipStage:1] 2024-01-18 10:53:33,257 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token -3836448899713998469.  Ignoring /127.0.0.2:7000
          INFO  [GossipStage:1] 2024-01-18 10:53:33,257 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token -5958143799716233579.  Ignoring /127.0.0.2:7000
          INFO  [GossipStage:1] 2024-01-18 10:53:33,257 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token -7015520316645191796.  Ignoring /127.0.0.2:7000
          INFO  [GossipStage:1] 2024-01-18 10:53:33,257 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token -8365722143219869680.  Ignoring /127.0.0.2:7000
          INFO  [GossipStage:1] 2024-01-18 10:53:33,257 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token -8792008644500999070.  Ignoring /127.0.0.2:7000
          INFO  [GossipStage:1] 2024-01-18 10:53:33,257 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token 2183381311695650387.  Ignoring /127.0.0.2:7000
          INFO  [GossipStage:1] 2024-01-18 10:53:33,257 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token 306316982182664358.  Ignoring /127.0.0.2:7000
          INFO  [GossipStage:1] 2024-01-18 10:53:33,258 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token 3352336234874664766.  Ignoring /127.0.0.2:7000
          INFO  [GossipStage:1] 2024-01-18 10:53:33,258 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token 4401460035203855854.  Ignoring /127.0.0.2:7000
          INFO  [GossipStage:1] 2024-01-18 10:53:33,258 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token 5337373207995371274.  Ignoring /127.0.0.2:7000
          INFO  [GossipStage:1] 2024-01-18 10:53:33,258 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token 5753010867567011154.  Ignoring /127.0.0.2:7000
          INFO  [GossipStage:1] 2024-01-18 10:53:33,258 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token 6964691214238992299.  Ignoring /127.0.0.2:7000
          INFO  [GossipStage:1] 2024-01-18 10:53:33,258 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token 7362014066587705090.  Ignoring /127.0.0.2:7000
          INFO  [GossipStage:1] 2024-01-18 10:53:33,258 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token 8620434782533148178.  Ignoring /127.0.0.2:7000
          

          If we revert the CCM change we'll suffer from speed increases and we won't test the different pre/post TCM behaviours. Given the usage of CASSANDRA_TOKEN_PREGENERATION_DISABLED is pretty isolated and constrained mainly to the bootstrap test I propose we:

          • Merge the current PR with the version guard Markus mentions where trunk would not be affected so it exercises TCM
          • Given how isolated this property's usage is anybody adding a new test should find the added comments. If that were not the case we have now a Jira ticket that should pop up pretty quick on any search of the failure stack trace.

          Wdyt?

          bereng Berenguer Blasi added a comment - Please check my reasoning: CASSANDRA_TOKEN_PREGENERATION_DISABLED is a CCM optimization for speed and not a user facing/product feature. There's no mention of it in the C* code, only in CCM. There are multiple code snippets and comments in CCM already i.e. here and here which do already protect against this problem coming from the past. The mention of token collisions "gives identical tokens to several nodes" is exactly what can be seen in the logs: INFO [GossipStage:1] 2024-01-18 10:53:33,257 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token -1513687318797282039. Ignoring /127.0.0.2:7000 INFO [GossipStage:1] 2024-01-18 10:53:33,257 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token -2578028308570345435. Ignoring /127.0.0.2:7000 INFO [GossipStage:1] 2024-01-18 10:53:33,257 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token -3836448899713998469. Ignoring /127.0.0.2:7000 INFO [GossipStage:1] 2024-01-18 10:53:33,257 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token -5958143799716233579. Ignoring /127.0.0.2:7000 INFO [GossipStage:1] 2024-01-18 10:53:33,257 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token -7015520316645191796. Ignoring /127.0.0.2:7000 INFO [GossipStage:1] 2024-01-18 10:53:33,257 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token -8365722143219869680. Ignoring /127.0.0.2:7000 INFO [GossipStage:1] 2024-01-18 10:53:33,257 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token -8792008644500999070. Ignoring /127.0.0.2:7000 INFO [GossipStage:1] 2024-01-18 10:53:33,257 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token 2183381311695650387. Ignoring /127.0.0.2:7000 INFO [GossipStage:1] 2024-01-18 10:53:33,257 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token 306316982182664358. Ignoring /127.0.0.2:7000 INFO [GossipStage:1] 2024-01-18 10:53:33,258 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token 3352336234874664766. Ignoring /127.0.0.2:7000 INFO [GossipStage:1] 2024-01-18 10:53:33,258 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token 4401460035203855854. Ignoring /127.0.0.2:7000 INFO [GossipStage:1] 2024-01-18 10:53:33,258 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token 5337373207995371274. Ignoring /127.0.0.2:7000 INFO [GossipStage:1] 2024-01-18 10:53:33,258 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token 5753010867567011154. Ignoring /127.0.0.2:7000 INFO [GossipStage:1] 2024-01-18 10:53:33,258 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token 6964691214238992299. Ignoring /127.0.0.2:7000 INFO [GossipStage:1] 2024-01-18 10:53:33,258 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token 7362014066587705090. Ignoring /127.0.0.2:7000 INFO [GossipStage:1] 2024-01-18 10:53:33,258 StorageService.java:3139 - Nodes /127.0.0.2:7000 and /127.0.0.3:7000 have the same token 8620434782533148178. Ignoring /127.0.0.2:7000 If we revert the CCM change we'll suffer from speed increases and we won't test the different pre/post TCM behaviours. Given the usage of CASSANDRA_TOKEN_PREGENERATION_DISABLED is pretty isolated and constrained mainly to the bootstrap test I propose we: Merge the current PR with the version guard Markus mentions where trunk would not be affected so it exercises TCM Given how isolated this property's usage is anybody adding a new test should find the added comments. If that were not the case we have now a Jira ticket that should pop up pretty quick on any search of the failure stack trace. Wdyt?

          Ok, I see what happened here and there is no cause for concern of unknown sharp edges: the non-bootstrapping nodes were started with the same tokens (and directly injected into the ring as seeds), so they all had a different view of who actually 'won' the conflicts, and this of course wreaked havoc on consistency checks. I'm +1 on the proposed fix to this.

          More concretely, in the jenkinslogs.zip that was attached, node2 and node3 were assigned the same tokens. node1 determined that node3 won, while nodes 2 and 3 both chose themselves as the winner. Without a proper view of the ring, this explains why reading at ALL didn't resolve anything.

          brandon.williams Brandon Williams added a comment - Ok, I see what happened here and there is no cause for concern of unknown sharp edges: the non-bootstrapping nodes were started with the same tokens (and directly injected into the ring as seeds), so they all had a different view of who actually 'won' the conflicts, and this of course wreaked havoc on consistency checks. I'm +1 on the proposed fix to this. More concretely, in the jenkinslogs.zip that was attached, node2 and node3 were assigned the same tokens. node1 determined that node3 won, while nodes 2 and 3 both chose themselves as the winner. Without a proper view of the ring, this explains why reading at ALL didn't resolve anything.
          bereng Berenguer Blasi added a comment - - edited

          Thx for connecting the dots brandon.williams

          The version guard has been added. Trunk CI is green. 5.0 CI is green. Given how CI intensive this is I'd say we can skip 4.0 and 4.1 CI runs safely.

          I'll merge on Monday to give time for people to re-review.

          bereng Berenguer Blasi added a comment - - edited Thx for connecting the dots brandon.williams The version guard has been added. Trunk CI is green . 5.0 CI is green . Given how CI intensive this is I'd say we can skip 4.0 and 4.1 CI runs safely. I'll merge on Monday to give time for people to re-review.
          smiklosovic Stefan Miklosovic added a comment - - edited

          bereng could you drop an e-mail to ML about 4.1.4 staging artifact that this is not a blocker? There is a lot of people from the community who wait for 4.1.4 to happen and I do not want to keep them in the woods. If this is not a blocker we can start 4.1.4 vote.

          smiklosovic Stefan Miklosovic added a comment - - edited bereng could you drop an e-mail to ML about 4.1.4 staging artifact that this is not a blocker? There is a lot of people from the community who wait for 4.1.4 to happen and I do not want to keep them in the woods. If this is not a blocker we can start 4.1.4 vote.

          People

            bereng Berenguer Blasi
            mck Michael Semb Wever
            Berenguer Blasi
            Brandon Williams, Marcus Eriksson
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: