Description
I have a 1_shard / m_replicas SolrCloud cluster with Solr 6.6.0 and run batches of 5 - 10k in-place updates from time to time.
Once I noticed that job "hangs" - it started and couldn't finish for a a while.
Logs were full of messages like:
Missing update, on which current in-place update depends on, hasn't arrived. id=__, looking for version=___, last found version=0"
Tried to fetch document ___ from the leader, but the leader says document has been deleted. Deleting the document here and skipping this update: Last found version: 0, was looking for: ___",24,0,"but the leader says document has been deleted. Deleting the document here and skipping this update: Last found version: 0
Further analysis shows that:
- There are 100-500 updates for non-existed documents among other updates (something that I have to deal with)
- Leader receives bunch of updates and executes this updates one by one. JavabinLoader which is used by processing documents reuses same instance of AddUpdateCommand for every update and just clearing its state at the end. Field AddUpdateCommand#prevVersion is not cleared.
- In case of update is in-place update, but specified document does not exist, this update is processed as a regular atomic update (i.e. new doc is created), but prevVersion is used as a distrib.inplace.prevversion parameter in sequential calls to every slave in DistributedUpdateProcessor. prevVersion wasn't cleared, so it may contain version from previous processed update.
- Slaves checks it's own version of documents which is 0 (cause doc does not exist), slave thinks that some updates were missed and spends 5 seconds in DistributedUpdateProcessor#waitForDependentUpdates waiting for missed updates (no luck) and also tries to get "correct" version from leader (no luck as well)
- So update for non existed document costs m * 5 sec each
I workarounded this by explicit check of doc existence, but it probably should be fixed.
Obviously first guess is that prevVersion should be cleared in AddUpdateCommand#clear, but have no clue how to test it.
+++ solr/core/src/java/org/apache/solr/update/AddUpdateCommand.java (revision ) @@ -78,6 +78,7 @@ updateTerm = null; isLastDocInBatch = false; version = 0; + prevVersion = -1; }
Attachments
Attachments
Issue Links
- is related to
-
SOLR-5944 Support updates of numeric DocValues
- Closed
- relates to
-
SOLR-11475 Endless loop and OOM in PeerSync
- Open
- links to