I found this while digging into the failure on https://builds.apache.org/job/Lucene-Solr-Tests-trunk-Java8/69/
The following sequence of events lead to deadlock:
- testasynccollectioncreation_shard1_0_replica2 (core_node5) becomes active
- OCP asks sub-shard leader testasynccollectioncreation_shard1_0_replica1 to wait until replica2 is in recovery
At this point, the test just keeps pinging for status until timeout and fails.