[SOLR-13599] ReplicationFactorTest high failure rate on Windows jenkins VMs after 2019-06-22 OS/java upgrades - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Cannot Reproduce
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

We've started seeing some weirdly consistent (but not reliably reproducible) failures from ReplicationFactorTest when running on Uwe's Windows jenkins machines.

The failures all seem to have started on June 22 – when Uwe upgraded his Windows VMs to upgrade the Java version, but happen across all versions of java tested, and on both the master and branch_8x.

While this test failed a total of 5 times, in different ways, on various jenkins boxes between 2019-01-01 and 2019-06-21, it seems to have failed on all but 1 or 2 of Uwe's "Windows" jenkins builds since that 2019-06-22, and when it fails the reproduceJenkinsFailures.py logic used in Uwe's jenkins builds frequently fails anywhere from 1-4 additional times.

All of these failures occur in the exact same place, with the exact same assertion: that the expected replicationFactor of 2 was not achieved, and an rf=1 (ie: only the master) was returned, when sending a batch of documents to a collection with 1 shard, 3 replicas; while 1 of the replicas was partitioned off due to a closed proxy.

In the handful of logs I've examined closely, the 2nd "live" replica does in fact log that it recieved & processed the update, but with a QTime of over 30 seconds, and it then it immediately logs an org.eclipse.jetty.io.EofException: Reset cancel_stream_error Exception – meanwhile, the leader has one (updateExecutor thread logging copious amount of java.net.ConnectException: Connection refused: no further information regarding the replica that was partitioned off, before a second updateExecutor thread ultimately logs java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: idle_timeout regarding the "live" replica.

What makes this perplexing is that this is not the first time in the test that documents were added to this collection while one replica was partitioned off, but it is the first time that all 3 of the following are true at the same time:

the collection has recovered after some replicas were partitioned and re-connected
a batch of multiple documents is being added
one replica has been "re" partitioned.

...prior to the point when this failure happens, only individual document adds were tested while replicas where partitioned. Batches of adds were only tested when all 3 replicas were "live" after the proxies were re-opened and the collection had fully recovered. The failure also comes from the first update to happen after a replica's proxy port has been "closed" for the second time.

While this conflagration of events might concievible trigger some weird bug, what makes these failures particularly perplexing is that:

the failures only happen on Windows
the failures only started after the Windows VM update on June-22.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

thetaphi_Lucene-Solr-master-Windows_8025.log.txt
02/Jul/19 18:36
6.93 MB
Chris M. Hostetter

Issue Links

supercedes

SOLR-13598 ReplicationFactorTest.test failures. Expected rf=2 ... got 1

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Chris M. Hostetter

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 02/Jul/19 18:34

Updated:: 27/Jul/19 01:30

Resolved:: 27/Jul/19 01:30