It would be great indeed to be able to simplify the code as you proposed if we can rely on a bootstrap method. Below are some observations that might be useful.
One of the concern I have is related to the default size limit of the update logs. By default, it keeps 10 tlog files or 100 records. This will likely be too small for providing enough buffer for cdcr, and there might be a risk of a continuous cycle of bootstrapping replication. One could increase the values of "numRecordsToKeep" and "maxNumLogsToKeep" in solrconfig to accommodate the cdcr requirements. But this is an additional parameter that the user needs to take into consideration, and make configuration more complex. I am wondering if we could find a more appropriate default value for cdcr ?
The issue with increasing limits in the original update log compared to the cdcr update log is that the original update log will not clean old tlogs files (it will keep all tlogs up to that limit) that are not necessary anymore for the replication. For example, if one increase the maxNumLogsToKeep to 100 and numRecordsToKeep 1000, then the node will always have 100 tlogs files or 1000 records in the update logs, even if all of them has been replicated to the target clusters. This might cause unexpected issues related to disk space or performance.
The CdcrUpdateLog was managing this by allowing a variable size update log that removes a tlog when it has been fully replicated. But then this means we go back to where we were with all the added management around the cdcr update log, i.e., buffer, lastprocessedversion, CdcrLogSynchronizer, ...
If we get rid of the cdcr update log logic, then we can also get rid of the Cdcr Buffer (buffer state, buffer commands, etc.)
I am not sure if we can get entirely rid of the CdcrUpdateLog. It includes logic such as sub-reader and forward seek that are necessary for sending batch updates. Maybe this logic can be moved in the UpdateLog ?
I think it is safe to get rid of this. In the case where a leader goes down while a cdcr reader is forwarding updates, the new leader will likely miss the tlogs necessary to resume where the cdcr reader stopped. But in this case, it can fall back to bootstrapping.
If the tlogs are not replicated during a bootstrap, then tlogs on target will not be in synch. Could this cause any issues on the target cluster, e.g., in case of a recovery ?
If the target is itself configured as a source (i.e. daisy chain), this will probably cause issues. The update logs will likely contain gaps, and it will be very difficult for the source to know that there is a gap. Therefore, it might forward incomplete updates. But this might be a feature we could drop, as suggested in one of your comment on the cwiki.