I have a 1_shard / m_replicas SolrCloud cluster with Solr 6.6.0 and run batches of 5 - 10k in-place updates from time to time.
Once I noticed that job "hangs" - it started and couldn't finish for a a while.
Logs were full of messages like:
Further analysis shows that:
- There are 100-500 updates for non-existed documents among other updates (something that I have to deal with)
- Leader receives bunch of updates and executes this updates one by one. JavabinLoader which is used by processing documents reuses same instance of AddUpdateCommand for every update and just clearing its state at the end. Field AddUpdateCommand#prevVersion is not cleared.
- In case of update is in-place update, but specified document does not exist, this update is processed as a regular atomic update (i.e. new doc is created), but prevVersion is used as a distrib.inplace.prevversion parameter in sequential calls to every slave in DistributedUpdateProcessor. prevVersion wasn't cleared, so it may contain version from previous processed update.
- Slaves checks it's own version of documents which is 0 (cause doc does not exist), slave thinks that some updates were missed and spends 5 seconds in DistributedUpdateProcessor#waitForDependentUpdates waiting for missed updates (no luck) and also tries to get "correct" version from leader (no luck as well)
- So update for non existed document costs m * 5 sec each
I workarounded this by explicit check of doc existence, but it probably should be fixed.
Obviously first guess is that prevVersion should be cleared in AddUpdateCommand#clear, but have no clue how to test it.