Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Won't Fix
-
None
-
None
-
None
Description
CDCR uses TLOGs for a queueing mechanism. If the connection between DCs goes down for any reason and is not caught, the tlogs will grow forever, which can lead to disk full situations and all that entails.
Aside from that problem, it's not clear that reprocessing a zillion updates is faster than a full replication anyway.
Since the full-index replication was added, we can avoid runaway tlogs by somehow noticing we haven't been connected to the remote DC for a long time, purge the tlogs (keeping just enough for peer sync of course) and do a full index replication next time we do connect.
This is pretty vague, I don't have a good idea of whether tlog size is the right metric, or some sort of time since last successful transmission, or the queue size or some combination of these and others. The point is simply that after some threshold was crossed, reset to a zero state and avoid the pitfalls of continuing to accumulate updates.
I'd suggest these be tunable parameters defined in solrconfig.xml since I can imagine that terabyte-scale indexes should fall back to full-index replication more rarely than megabyte-scale indexes.
This idea came up in discussions and I wanted to preserve the it in case someone wants to pursue it.