[SOLR-13811] possible autoAddReplicas bug and/or (Hdfs)AutoAddReplicasIntegrationTest refactoring / fixes - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

I've noticed a pattern of failure behavior in jenkins runs of AutoAddReplicasIntegrationTest (which mostly manifests in the subclass HdfsAutoAddReplicasIntegrationTest, probably due to timing) which indicates either:

the test is too contrived, and expects autoAddReplicas to kick in in a situation where the current impl of NodeLostTrigger isn't smart enough to handle
NodeLostTrigger should be smart enough to handle this, but isn't.

The test failure is currently somewhat finicky to reproduce, and depends on a node being stoped, restarted, and stopped again – while an affected collection is changed from autoAddReplicas=false to autoAddReplicas=true before the second "stop"

Regardless of which of the 2 above is true: the test itself is somewhat convoluted. It creates a sequence of events (some randomized, some static) and asserting specific outcomes after each – but the timing of scheduled triggers like NodeLostTrigger , and the interplay of things like "pick a random node to shutdown" with a subsequent "explicitly shut down node2" (even if it was the node randomly shut down earlier) is confusing.

I'm creating this issue to track two tightly dependent objectives:

refactoring this test to:
- better isolate the specific things it's trying to test in individual test methods.
- have a singular test method that triggers the specific sequence of events that is currently problematic (ideally in such a way that it reliably fails).
AwaitsFix this new test method until someone with a better understand of the autoAddReplicas / NodeLostTrigger code can assess if the test is faulty or the code being tested is faulty.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

apache_Lucene-Solr-NightlyTests-8.x_221.log.txt
02/Oct/19 17:28
4.51 MB
Chris M. Hostetter
hoss_local_failure_after_refactoring.log.txt
02/Oct/19 17:28
3.11 MB
Chris M. Hostetter

Activity

People

Assignee:: Unassigned

Reporter:: Chris M. Hostetter

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 02/Oct/19 16:45

Updated:: 13/Aug/21 18:43

Resolved:: 13/Aug/21 18:43