For the Kudu Java client we have TestLeaderFailover test which verifies how the client handles the tablet server fail-over scenario. However, the test covers only one fail-over event and mainly performs write operations while the backend handles the 'unexpected crash' of the tablet server.
It would be nice to add more tests which cover the client's fail-over behavior:
- Add the mixed workload scenario, i.e. combine inserts/scans during the fail-over. Running the scans would not only verify that the data eventually reaches the destination, but verify that the client automatically retries the scan operations and eventually succeeds reading the data from the cluster.
- Induce more fail-over events while running the scenario, i.e. pause and then resume the tservers processes many more times and run the test longer. This is to spot possible bugs during the transition processes and occurrence of multiple fail-over events.
- In the mixed workload scenarios, run scan operations in READ_AT_SNAPSHOT mode with different selectors: LEADER_ONLY and CLOSEST_REPLICA. That's to cover the retry code paths for both cases (as of now, I could see only the LEADER_ONLY path covered, but I might be mistaken).
The general idea is to make sure the Java client during fail-over events:
- Retries write and read operations automatically on an error happened due to a fail-over event.
- Does not silently lose any data: if the client cannot send the data due to timeout or running out of retry attempts, it should report on that.