Affects Version/s: None
Fix Version/s: None
I've noticed a pattern of failure behavior in jenkins runs of AutoAddReplicasIntegrationTest (which mostly manifests in the subclass HdfsAutoAddReplicasIntegrationTest, probably due to timing) which indicates either:
- the test is too contrived, and expects autoAddReplicas to kick in in a situation where the current impl of NodeLostTrigger isn't smart enough to handle
- NodeLostTrigger should be smart enough to handle this, but isn't.
The test failure is currently somewhat finicky to reproduce, and depends on a node being stoped, restarted, and stopped again – while an affected collection is changed from autoAddReplicas=false to autoAddReplicas=true before the second "stop"
Regardless of which of the 2 above is true: the test itself is somewhat convoluted. It creates a sequence of events (some randomized, some static) and asserting specific outcomes after each – but the timing of scheduled triggers like NodeLostTrigger , and the interplay of things like "pick a random node to shutdown" with a subsequent "explicitly shut down node2" (even if it was the node randomly shut down earlier) is confusing.
I'm creating this issue to track two tightly dependent objectives:
- refactoring this test to:
- better isolate the specific things it's trying to test in individual test methods.
- have a singular test method that triggers the specific sequence of events that is currently problematic (ideally in such a way that it reliably fails).
- AwaitsFix this new test method until someone with a better understand of the autoAddReplicas / NodeLostTrigger code can assess if the test is faulty or the code being tested is faulty.