We have run into some issue when trying to transition from an active/active Mongo NodeStore cluster to a single Segment-Tar server with cold standby. The issue itself manifests when the standby server tries to pull changes from the primary after the first round of online revision GC.
Let me summarize the way we ended up with the current state, and my hypothesis about what happened, based on my debugging so far:
- We started with a Mongo NodeStore and an external FileDataStore as the blob store. The FileDataStore was set up with minRecordLength=4096. The Mongo store stores blobs below minRecordLength as special "in-memory" blobIDs where the data itself is baked into the ID string in hex.
- We have executed a sidegrade of the Mongo store into a Segment-Tar store. Our datastore is over 1TB in size, so copying the binaries wasn't an option. The new repository is simply reusing the existing datastore. The "in-memory" blobIDs still look like external blobIDs to the sidegrade process, so they were copied into the Segment-Tar repository as-is, instead of being converted into the efficient in-line format.
- The server started up without issues on the new Segment-Tar store. The migrated "in-memory" blob IDs seem to work fine, if a bit sub-optimal.
- At this point, we have created a cold standby instance by copying the files of the stopped primary instance and making the necessary config changes on both servers.
- Everything worked fine until the primary server started its first round of online revision GC. After that process completed, the standby node started throwing exceptions about missing segments, and eventually stopped altogether. In the meantime, the following warning showed up in the primary log:
This is what seems to be happening:
- The revision GC creates brand new segments, and the standby instance starts pulling them into its own store.
- When the standby sees an "in-memory" blobID, it decides that it doesn't have this blob in its own blobstore, so it proceeds to ask for the bytes of the blob from the primary, even though they are encoded in the ID itself.
- The longest blobID can be more than 8K in size (the 4K blob gets doubled by hex encoding). When such a long blobID is submitted to the primary, the request gets rejected because of excessive length. The secondary keeps waiting until the request times out, and no progress is made in syncing.
The issue doesn't pop up with repositories that started as Segment-Tar since Segment-Tar always inlines blobs below some hardcoded threshold (16K if I remember correctly).
I think there could be multiple ways to approach this, not mutually exclusive:
- Special-case the "in-memory" BlobIDs during sidegrade and replace them with the "native" segment values. If hardcoding knowledge about this implementation detail isn't desired, there could be a new option for the sidegrade process, to force "inlining" of blobs below a certain threshold, even if they aren't in-line in the source repo.
- Special-case the "in-memory" BlobIDs in StandbyDiff so they aren't requested from the primary, but are either kept as-is or get converted to the "native" format.
- Increase the network package size limit in the sync protocol, or allow it to be configured. This is the least efficient option, but with the least impact on the code.
I can work on detailed reproduction steps if needed, but I'd rather not do it beforehand because this is rather cumbersome to reproduce