I was analyzing another "shards-out-of-sync" failure on trunk.
It looks like that certain update are just not being forwarded from the leader to a certain replica.
Working theory: the max connections per host of the HttpClient is being hit, starving updates from certain update threads.
This could account for why shutdownNow on the update executor service is having such an impact. In an orderly shutdown, all scheduled jobs will still be run (I think), which means that connections will be released, and the updates that were being starved will get to proceed. But it's for exactly this reason that we should probably keep the shutdownNow... it mimics much better what will happen in real world situations when a node goes down.
From this, it looks like max connections per host is 20:
13404 INFO (TEST-HdfsChaosMonkeyNothingIsSafeTest.test-seed#[A22375CC545D2B82]) [ ] o.a.s.h.c.HttpShardHandlerFactory created with socketTimeout : 90000,urlScheme : ,connTimeout : 15000,maxConnectionsPerHost : 20,maxConnections : 10000,corePoolSize : 0,maximumPoolSize : 2147483647,maxThreadIdleTime : 5,sizeOfQueue : -1,fairnessPolicy : false,useRetries : false,
edit: oops the above is for search not updates. The default for updates looks like it's 100, so harder to hit. Although if we have a mix of streaming and non-streaming, and connections are not reused immediately, perhaps still possible. Still digging along this line of logic.
The test used 12 nodes (and 2 shards)... increasing the chance of hitting the max connections (since all nodes run on the same host).