Description
There is a scenario that may result in the dump phase missing some documents, meaning that the index would not include some of the documents. The scenario is as follows:
- The download process creates a checkpoint at time t0. This means that we need to download all documents that existed at time t0. Note that it's ok to download documents that were created or modified after t0, as the DocumentNodeStore.getNode(path, revision) will only consider the state as it was at the moment of the checkpoint, so at t0.
- At the start of the download, the range of modified values is between 10 and 90.
- The query downloading in descending order downloads the documents from 90 down-to 80.
- At this moment, the connection to mongo is lost and the descending download thread tries to reconnect.
- During this period, a document D with _modified=70 is updated. The update changes the _modified value to 93. Note that this document had not yet been downloaded by the descending download thread.
- The descending download thread resumes downloading from 80 down. But this creates a new cursor in Mongo after D was updated to have _modified=93. Therefore, D will not be downloaded, even though it existed at time T0 and should be included.
If the connection does not fail, Mongo uses a single cursor to traverse the repository so it will return a consistent view of the repository as it existed at the start of the download. But if the connection fails, the new query creates a new cursor, which sees the state as it is when the second cursor is created, which is no longer consistent with what was downloaded by the first query.
Therefore, the following conditions are required for documents to be missed:
- parallel download is enabled. If downloading only in ascending order, this problem will not occur because the reconnections will download the documents with the highest _modified values, so all documents that were modified after the start of the download will be downloaded. This will likely result in duplicates, but the merge sort phase of the indexing job eliminates duplicates so it is not a problem.
- The descending download thread experiences connection failures.
- The instance is being actively updated during the download.