This took some time to track down.
- Overseer collection processor calls commit after all replicas are created and before it switches shard states
- ZkController is modified to have only sub shard leaders skip log recovery on startup. Sub shard replicas will recover from leader before they publish themselves as active
- Changes in DistributedUpdateProcessor to detect when the current node is a sub shard leader and it should forward updates to its replicas
- ShardSplitTests tests for shard consistency as well before calling a global commit and testing for final correctness
- Fixed SolrCmdDistributor.syncRequest to log the correct exception
All tests pass.
This patch fixes three (related) bugs in total.
- We don't commit on sub shards because of which sub shards don't have all docs visible
- Sub shard replicas skip recovering from leader because of which they show no docs upon creation
- Sub shard replicas can lose updates from leader between the time they are created and the time the shard becomes active
The first two have easy fixes. The last one required invasive changes.
I'll put up this patch in case someone wants to review and commit it tomorrow morning my time (IST).
I'd like to include this in 4.4 since it fixes major bugs but I'm not sure if we have enough time to let this bake. Perhaps if we cut the RC on Monday instead of Friday, we can let jenkins test it for a while? Review comments are welcome.