Bootstrap of a new node to expand an existing cluster is completing the bootstrapping "successfully", joining the cluster as UN in nodetool status, but has no data and no active streams. Writes and reads start being served.
Environment: AWS, IPV6, three datacenters, asia / europe / us
We have previously scaled the europe and us datacenters from 5 nodes to 25 nodes (one node at a time) without incident.
Asia (tokyo) is a different story. We have seen multiple failure scenarios, but the most troubling is a string of attempted node bootstrappings where the bootstrap completes and the node joins the ring, but there is no data.
We were able to expand Asia by four nodes by increasing ring delay to 100 seconds, but that has not worked anymore.
Attached Log: Our autoscaling + Ansible initial setup creates the node, but the ansible has not run yet, so the autostarted cassandra fails to load, but it has no security group yet so it did not communicate with any other node. That is the 15:15:XX series log messages at the very top.
Then 15:20:XX series messages begin after ansible has completed setup of the box, and the data dirs and commit log dirs have been scrubbed.
This same process ran for EU and US expansions without incident.
From what I can tell of the log (DEBUG was enabled):
Ring information collection begins, so some sort of gossip/cluster communication is healthy:
INFO [main] 2019-06-12 15:20:05,468 StorageService.java:1142 - JOINING: waiting for ring information
But almost all of those checks output:
DEBUG [GossipStage:1] 2019-06-12 15:20:07,673 MigrationManager.java:96 - Not pulling schema because versions match or shouldPullSchemaFrom returned false
Which seems weird, as we shall see soon. After all the nodes have reported in a similar way, most not pulling because of the above message, and a couple that were interpreted as DOWN, it then does:
INFO [main] 2019-06-12 15:21:45,486 StorageService.java:1142 - JOINING: schema complete, ready to bootstrap
INFO [main] 2019-06-12 15:21:45,487 StorageService.java:1142 - JOINING: waiting for pending range calculation
INFO [main] 2019-06-12 15:21:45,487 StorageService.java:1142 - JOINING: calculation complete, ready to bootstrap
We then get a huge number of
org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find cfId=dd5d7fa0-1e50-11e9-a62d-0d41c97b2404
INFO [main] 2019-06-12 15:23:25,515 StorageService.java:1142 - JOINING: Starting to bootstrap...
INFO [main] 2019-06-12 15:23:25,525 StreamResultFuture.java:87 - Stream #05af9ee0-8d26-11e9-85c1-bd5476090c54 Executing streaming plan for Bootstrap
INFO [main] 2019-06-12 15:23:25,526 StorageService.java:1199 - Bootstrap completed! for the tokens [-7314981925085449175, ...<tokens>... 5499447097629838103]
Here are the log messages for MIgrationManager for schema gossiping:
DEBUG [main] 2019-06-12 15:20:05,423 MigrationManager.java:493 - Gossiping my schema version 59adb24e-f3cd-3e02-97f0-5b395827453f
DEBUG [MigrationStage:1] 2019-06-12 15:23:25,694 MigrationManager.java:493 - Gossiping my schema version 3d1a9d9e-1120-37ae-abe0-e064cd147a99
DEBUG [MigrationStage:1] 2019-06-12 15:23:25,775 MigrationManager.java:493 - Gossiping my schema version 0bf74f5a-4b39-3525-b217-e9ccf7a1b6cb
DEBUG [MigrationStage:1] 2019-06-12 15:23:25,905 MigrationManager.java:493 - Gossiping my schema version b145475a-02dc-370c-8af7-a9aba2d61362
DEBUG [InternalResponseStage:12] 2019-06-12 15:24:26,445 MigrationManager.java:493 - Gossiping my schema version 9c2ed14a-8db5-39b3-af48-6cdb5463c772
the schema version ending in -6cdb5463c772 is the proper version in the other nodes per gossipinfo. But as can be seen, the bootstrap completion message (15:23:25,526) is logged before four or five intermediate schema versions are created, which seem to be due to system_distributed and other keyspaces being created.
The Bootstrap completed! message comes from a Future in StorageService line 1199 for cassandra 2.2.13 source:
That future seems to complete a bit too quickly. Perhaps the schema did not synchronize quickly enough and therefore a list of migration/bootstrap tasks not created in time for the bootstrapper future to handle. We suspect the AWS asia datacenter networking/WAN links are very slow, or IPV6 is introducing other delays.
But the code should not allow a bootstrap to silently/falsely complete "successfully".
Smells like a race condition.