Thank you, Mike Drob, I did not know about ThworingRunnable.
We might possibly want to hide this new functionality behind a version check? Does the patch apply relatively easily to 6.5 as well?
The patch relies on some changes of
SOLR-5944, which is AFAIK will be backported too, however, I can create a 6.x patch too.
Can you help me understand the full scope of the problem here - child docs are only in danger of spurious delete until the next commit point, right?
So the reordered DBQ could happen if an update with an earlier version arrives after a DBQ with a later version to the replicas, or vica-versa. Solr handles the two cases the following:
- If a DBQ arrives that has lower version than the latest updates, the DBQ gets an additional version filter to protect documents added earlier, with higher version.
- If the DBQ is not by ID (or something limiting), but for example range or any, it will delete child-docs added with higher versioned parent doc. This is what the jira is originally about and testLogReplayWithReorderedDBQByAsterixAndChildDocs tests the case.
- If an update arrives that has lower version than the latest DBQs, the DirectUpdateHandler2 goes on an add-and-delete path, where the earlier DBQs with higher versions are replayed after the update.
Now, the doNormalUpdate(cmd) was checking if the document is block document (has children) and does two main differences based on that:
- Calls updateDocuments (plural) that accepts an Iterable and inserts every child document
- Builds idTerm by _root_ field, instead of id-field, so before adding the document, lucene would delete the parent AND the child documents as well.
On the other hand, addAndDelete() did not do any differentiation for block docs, resulting the child-nodes ignored during the inserts and overwrites.
So basically any reordered DBQ caused:
- Losing child-docs when new document was inserted (testLogReplayWithReorderedDBQInsertingChildnodes)
- Making the child-docs untouched on update. This caused replica numDocs inconsistency when the update contained different count of child-docs. (testLogReplayWithReorderedDBQUpdateWithDifferentChildCount)
So basically, any child-docs replication was dropped if there was a reordered DBQ.
So if they make it to disk, even though they don't have versions, they are still safe from disappearing in the future.
AFAIK, the reordering cannot happen on the leader, this does not affects leader version, only replicas. I assume any peersync would fail due to fingerprint check, and would eventually replicate the correct index. Yonik Seeley, could you, please, verify my assumption?