I have a clodu deployment of 4.5 on EC2. Architecture is 3 dedicated ZK nodes, and a pair of solr nodes. I'll apologize in advance that this error report is not going to have a lot of detail, I'm really hoping that the scenario/description will trigger some "likely" possible explanation.
The situation I got into was that the server had decided to fail over, so my app servers were all taking to what should have been the primary for most of the shards/collections, but actually was the replica.
Here's where it gets odd - no errors being returned to the client code for any of the searches or document updates - and the current primary server was definitely receiving all of the updates - even though they were being submitted to the inactive/replica node. (clients talking to solr-p1, which was not primary at the time, and writes were being passed through to solr-r1, which was primary at the time.)
All sounds good so far right? Except - the replica server at the time, through which the writes were passing - never got any of those content updates. It had an old unmodified copy of the index.
I restarted solr-p1 (was the replica at the time) - no change in behavior. Behavior did not change until I killed and restarted the current primary (solr-r1) to force it to fail over.
At that point, everything was all happy again and working properly.
Until this morning, when one of the developers provisioned a new collection, which happened to put it's primary on solr-r1. Again, clients all pointing at solr-p1. The developer reported that the documents were going into the index, but not visible on the replica server.