When joining just one node with very little data we often get "Migration task failed to complete" per https://github.com/apache/cassandra/commit/ae315b5ec944571342146867c51b2ceb50f3845e
We increased the timeout on MIGRATION_TASK_WAIT_IN_SECONDS *to 15 minutes thinking there was some sort of auto retry mechanism in the underlying messaging. However all that does is increase the time to failure. When these migration tasks fail, the bootstrap is marked complete but it clearly wasn't complete because usage of the data results in a cassandra.db.*UnknownColumnFamilyException. Also, it is evident in the logs that no data was streamed from the seed node to the newly bootstrapping node. We have had numerous tests showing that if a migration task times out, the node exits joining mode, the bootstrap logs complete, but it hasn't streamed any data and the only course of action seems to be a Cassandra restart. Our replication factor is set such that the bootstrapping node needs to get all the data. If we were to leave the Cassandra node running, would it eventually send another migration task and stream the necessary data?
On closer inspection of the code, it seems that the MigrationTask.java runMayThrow sends the migration request message using sendRR, which is a fire and forget. So, if the callback is not hit, it can leave you in a state where the CountdownLatch.countDown() is never invoked. So, I suppose that is the point of the timeout when waiting for the latch. But wouldn't it be better to resend the migration task? I certainly haven't learned all the messaging service but it seems that dropping a packet here and there could cause bootstrap to succeed in this misleading way. Would it make sense for the MigrationTask runMayThrow to create a IAsyncCallbackWithFailure for the callback and implement the OnFailure to also CountdownLatch.countDown() and generate another migration task? Or perhaps allow users of Cassandra to configure something like a MIGRATION_TASK_RETRY_ATTEMPTS?
When the MigrationTask does fail to complete, we see the log 3 times. Is this the resend of the same migration task …which is just schema version exchange? In which case if all 3 failed it means all attempts failed to reach the seed endpoint or the response failed to reach the bootstrapping endpoint. Are we correct in assuming this is a network error and there are no scenarios where the seed node would ignore the migration task from the bootstrapping node?