Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-6273 Cross Data Center Replication
  3. SOLR-6465

CDCR: fall back to whole-index replication when tlogs are insufficient

    Details

    • Type: Sub-task
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      When the peer-shard doesn't have transaction logs to forward all the needed updates to bring a peer up to date, we need to fall back to normal replication.

      1. 1858.log.gz
        384 kB
        Steve Rowe
      2. 1890.log.gz
        417 kB
        Steve Rowe
      3. SOLR-6465.patch
        80 kB
        Shalin Shekhar Mangar
      4. SOLR-6465.patch
        80 kB
        Shalin Shekhar Mangar
      5. SOLR-6465.patch
        64 kB
        Shalin Shekhar Mangar
      6. SOLR-6465.patch
        63 kB
        Shalin Shekhar Mangar

        Issue Links

          Activity

          Hide
          shalinmangar Shalin Shekhar Mangar added a comment - - edited

          This is the first cut for this feature.

          The CdcrRequestHandler supports a new asynchronous command called BOOTSTRAP which triggers a full index replication from a given master URL. There is a corresponding BOOTSTRAP_STATUS command which returns whether a bootstrap operation is running or either finished successfully or failed.

          The "shardcheckpoint" command has been modified to return the max version across the index and update log using the same updateVersionToHighest logic used to initialize version buckets from tlog+index during startup/reload.

          The CdcrReplicatorManager calls collectioncheckpoint to read the max version indexed on the target and then if it finds that there exists a gap in its tlog, asks the target cluster to bootstrap itself from the current shard leader on the source. During this time a flag is set in CdcrReplicatorState such that the CdcrReplicatorScheduler will not send any updates to the target cluster during this time. Once the bootstrap is complete, a collectioncheckpoint is called and the returned version is used to open a regular tlog reader using which normal cdcr replication mechanism takes over.

          A new test called CdcrBootstrapTest is added for this feature. There is some additional code in CdcrUpdateLog which allows one to convert an existing cluster with data to be a cdcr source.

          There are plenty of nocommits and debug logging in this patch which I will work to resolve/remove in the next patches. I also found a few bugs for which I'll open separate issues.

          Open items/todos:

          1. Now that we can bootstrap target clusters using the index files, we have no need to keep update logs around for a long time. Therefore, we can get rid of CdcrUpdateLog itself and make CDCR work with regular UpdateLog.
          2. In the same vein, there is no need for replicating tlog files from leader to replicas on the source cluster so "lastprocessedversion", CdcrLogSynchronizer and tlog replication code be purged.
          3. This patch currently stops regular CDCR updates from being sent to target leaders during bootstrap but that is not necessary as we can buffer them and apply after bootstrap completes.
          4. Hardening is required against the bootstrap process racing with recovery. Normally this won't happen because bootstrap only happens on target shard leaders but if/when the leadership changes, I suspect bootstrap can continue to run for a while and race with core recovery. I haven't been able to trigger this yet in a test case but I'll continue to work on it.
          5. In this patch, the bootstrap trigger thread is initiated on state change but if it exits due to a unhandled condition then the replication state is forever in bootstrapping mode and there is no corrective action except disabling and re-enabling CDCR replication. Although care has been taken to handle most failures but after implementing this, I feel that it is unnecessarily fragile and we are better off adding some logic in the scheduled replicator component than trying to do bootstrap only once on init.
          6. The existing CDCR tests which test aspects related to tlog replication do not pass currently. Once we yank that code, this would be a non-issue.
          7. Tests and more tests!
          Show
          shalinmangar Shalin Shekhar Mangar added a comment - - edited This is the first cut for this feature. The CdcrRequestHandler supports a new asynchronous command called BOOTSTRAP which triggers a full index replication from a given master URL. There is a corresponding BOOTSTRAP_STATUS command which returns whether a bootstrap operation is running or either finished successfully or failed. The "shardcheckpoint" command has been modified to return the max version across the index and update log using the same updateVersionToHighest logic used to initialize version buckets from tlog+index during startup/reload. The CdcrReplicatorManager calls collectioncheckpoint to read the max version indexed on the target and then if it finds that there exists a gap in its tlog, asks the target cluster to bootstrap itself from the current shard leader on the source. During this time a flag is set in CdcrReplicatorState such that the CdcrReplicatorScheduler will not send any updates to the target cluster during this time. Once the bootstrap is complete, a collectioncheckpoint is called and the returned version is used to open a regular tlog reader using which normal cdcr replication mechanism takes over. A new test called CdcrBootstrapTest is added for this feature. There is some additional code in CdcrUpdateLog which allows one to convert an existing cluster with data to be a cdcr source. There are plenty of nocommits and debug logging in this patch which I will work to resolve/remove in the next patches. I also found a few bugs for which I'll open separate issues. Open items/todos: Now that we can bootstrap target clusters using the index files, we have no need to keep update logs around for a long time. Therefore, we can get rid of CdcrUpdateLog itself and make CDCR work with regular UpdateLog. In the same vein, there is no need for replicating tlog files from leader to replicas on the source cluster so "lastprocessedversion", CdcrLogSynchronizer and tlog replication code be purged. This patch currently stops regular CDCR updates from being sent to target leaders during bootstrap but that is not necessary as we can buffer them and apply after bootstrap completes. Hardening is required against the bootstrap process racing with recovery. Normally this won't happen because bootstrap only happens on target shard leaders but if/when the leadership changes, I suspect bootstrap can continue to run for a while and race with core recovery. I haven't been able to trigger this yet in a test case but I'll continue to work on it. In this patch, the bootstrap trigger thread is initiated on state change but if it exits due to a unhandled condition then the replication state is forever in bootstrapping mode and there is no corrective action except disabling and re-enabling CDCR replication. Although care has been taken to handle most failures but after implementing this, I feel that it is unnecessarily fragile and we are better off adding some logic in the scheduled replicator component than trying to do bootstrap only once on init. The existing CDCR tests which test aspects related to tlog replication do not pass currently. Once we yank that code, this would be a non-issue. Tests and more tests!
          Hide
          noble.paul Noble Paul added a comment - - edited

          why the double lock in CdcrRequestHandler.handleBootstrapAction()

          boolean locked = bootstrapLock.tryLock();
                  try {
                    if (!locked) {
                      throw new SolrException(SolrException.ErrorCode.INVALID_STATE, "A bootstrap action is already running");
                    }
                    bootstrapLock.lock();
          

          isn't just one enough?

          Show
          noble.paul Noble Paul added a comment - - edited why the double lock in CdcrRequestHandler.handleBootstrapAction() boolean locked = bootstrapLock.tryLock(); try { if (!locked) { throw new SolrException(SolrException.ErrorCode.INVALID_STATE, "A bootstrap action is already running" ); } bootstrapLock.lock(); isn't just one enough?
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Changes:

          1. Refactored bootstrap status runnable to an inner class called BootstrapStatusRunnable
          2. BootstrapStatusRunnable is closed when either CDCR is disabled via API or when the current core is no longer the leadder. It will cancel waiting for bootstrap if cdcr is stopped or if the current core is no longer the leader.
          3. Removed the unused BootstrapService from CdcrRequestHandler
          4. Added a CANCEL_BOOTSTRAP action in CdcrRequestHandler which will make a best effort to cancel a running bootstrap operation.
          5. If CDCR is disabled on the source cluster of if the leader loses leadership then a cancel bootstrap message is sent to the target cluster.

          why the double lock in...

          This has been removed to only use tryLock.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Changes: Refactored bootstrap status runnable to an inner class called BootstrapStatusRunnable BootstrapStatusRunnable is closed when either CDCR is disabled via API or when the current core is no longer the leadder. It will cancel waiting for bootstrap if cdcr is stopped or if the current core is no longer the leader. Removed the unused BootstrapService from CdcrRequestHandler Added a CANCEL_BOOTSTRAP action in CdcrRequestHandler which will make a best effort to cancel a running bootstrap operation. If CDCR is disabled on the source cluster of if the leader loses leadership then a cancel bootstrap message is sent to the target cluster. why the double lock in... This has been removed to only use tryLock.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          I'm going to start a branch because the current patch (and code for related issues) is becoming difficult to maintain.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - I'm going to start a branch because the current patch (and code for related issues) is becoming difficult to maintain.
          Hide
          dpgove Dennis Gove added a comment -

          Does this apply to v5.5 or is it 6.x only?

          Show
          dpgove Dennis Gove added a comment - Does this apply to v5.5 or is it 6.x only?
          Hide
          dpgove Dennis Gove added a comment -

          I was able to get this to apply to 5.5 by first applying https://issues.apache.org/jira/secure/attachment/12775961/SOLR-6273-plus-8263-5x.patch which is attached to SOLR-6273. I did need to make some minor tweaks to that patch as it didn't apply perfectly cleanly and left a couple of minor compilation errors (visibility of certain variables to extension classes).

          This patch also required a few tweaks as it didn't apply cleanly but more-so it appears the patch contains Java lambdas which do not work in Java 7 or below.

          common.compile-core:
              [javac] Compiling 25 source files to /Users/dgove1/dev/bfs/lucene-solr/solr/build/solr-core/classes/java
              [javac] warning: [options] bootstrap class path not set in conjunction with -source 1.7
              [javac] /Users/dgove1/dev/bfs/lucene-solr/solr/core/src/java/org/apache/solr/handler/CdcrRequestHandler.java:623: error: lambda expressions are not supported in -source 1.7
              [javac]     Runnable thread = () -> {
              [javac]                          ^
              [javac]   (use -source 8 or higher to enable lambda expressions)
              [javac] 1 error
              [javac] 1 warning
          
          Show
          dpgove Dennis Gove added a comment - I was able to get this to apply to 5.5 by first applying https://issues.apache.org/jira/secure/attachment/12775961/SOLR-6273-plus-8263-5x.patch which is attached to SOLR-6273 . I did need to make some minor tweaks to that patch as it didn't apply perfectly cleanly and left a couple of minor compilation errors (visibility of certain variables to extension classes). This patch also required a few tweaks as it didn't apply cleanly but more-so it appears the patch contains Java lambdas which do not work in Java 7 or below. common.compile-core: [javac] Compiling 25 source files to /Users/dgove1/dev/bfs/lucene-solr/solr/build/solr-core/classes/java [javac] warning: [options] bootstrap class path not set in conjunction with -source 1.7 [javac] /Users/dgove1/dev/bfs/lucene-solr/solr/core/src/java/org/apache/solr/handler/CdcrRequestHandler.java:623: error: lambda expressions are not supported in -source 1.7 [javac] Runnable thread = () -> { [javac] ^ [javac] (use -source 8 or higher to enable lambda expressions) [javac] 1 error [javac] 1 warning
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Hi Dennis Gove, this patch applies on master (maybe a bit out of date as master changes very quickly). I'll push a branch (off master) containing this patch shortly. It uses lambdas (not required of course) because it is my understanding that there may not be a new 5.6 feature release. But if such a release happens, I can change the code and merge back when required.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Hi Dennis Gove , this patch applies on master (maybe a bit out of date as master changes very quickly). I'll push a branch (off master) containing this patch shortly. It uses lambdas (not required of course) because it is my understanding that there may not be a new 5.6 feature release. But if such a release happens, I can change the code and merge back when required.
          Hide
          erickerickson Erick Erickson added a comment -

          Dennis:

          If it's easy, could you add your tweaks to the SOLR-6273-plus-8263-5x.patch to 6273? When I put that patch up, it was in case we ever decided to back-port CDCR to 5x or if people (such as yourself) wanted to be brave. Since the 5x branch is getting more and more static your new version would be helpful if anyone else wants to go down that route for quite some time.

          Or are the tweaks small enough that anyone brave enough to apply that patch would also be find tweaking it easy?

          Show
          erickerickson Erick Erickson added a comment - Dennis: If it's easy, could you add your tweaks to the SOLR-6273 -plus-8263-5x.patch to 6273? When I put that patch up, it was in case we ever decided to back-port CDCR to 5x or if people (such as yourself) wanted to be brave. Since the 5x branch is getting more and more static your new version would be helpful if anyone else wants to go down that route for quite some time. Or are the tweaks small enough that anyone brave enough to apply that patch would also be find tweaking it easy?
          Hide
          dpgove Dennis Gove added a comment -

          The tweaks were very small but they do change visibility of some things.

          1. in CdcrTransactionLog, close() is changed from protected to public.
          2. in TransactionLog, log and debug are changed from private to protected
          3. in UpdateLog, log and debug are changed from private to protected

          full diff of changes is below. If we're comfortable with the visibility changes I'd be happy to add them to SOLR-6273.

          diff --git a/solr/core/src/java/org/apache/solr/update/CdcrTransactionLog.java b/solr/core/src/java/org/apache/solr/update/CdcrTransactionLog.java
          index f800f6f..e706733 100644
          --- a/solr/core/src/java/org/apache/solr/update/CdcrTransactionLog.java
          +++ b/solr/core/src/java/org/apache/solr/update/CdcrTransactionLog.java
          @@ -193,7 +193,7 @@ public class CdcrTransactionLog extends TransactionLog {
             }
          
             @Override
          -  protected void close() {
          +  public void close() {
               try {
                 if (debug) {
                   log.debug("Closing tlog" + this);
          diff --git a/solr/core/src/java/org/apache/solr/update/TransactionLog.java b/solr/core/src/java/org/apache/solr/update/TransactionLog.java
          index c8b8332..35020be 100644
          --- a/solr/core/src/java/org/apache/solr/update/TransactionLog.java
          +++ b/solr/core/src/java/org/apache/solr/update/TransactionLog.java
          @@ -63,8 +63,8 @@ import org.slf4j.LoggerFactory;
            *
            */
           public class TransactionLog implements Closeable {
          -  private static final Logger log = LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
          -  private static boolean debug = log.isDebugEnabled();
          +  protected static final Logger log = LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
          +  protected static boolean debug = log.isDebugEnabled();
             private static boolean trace = log.isTraceEnabled();
          
             public final static String END_MESSAGE="SOLR_TLOG_END";
          diff --git a/solr/core/src/java/org/apache/solr/update/UpdateLog.java b/solr/core/src/java/org/apache/solr/update/UpdateLog.java
          index c5dc9a4..ad05d5f 100644
          --- a/solr/core/src/java/org/apache/solr/update/UpdateLog.java
          +++ b/solr/core/src/java/org/apache/solr/update/UpdateLog.java
          @@ -75,8 +75,8 @@ public class UpdateLog implements PluginInfoInitialized {
             public static String LOG_FILENAME_PATTERN = "%s.%019d";
             public static String TLOG_NAME="tlog";
          
          -  private static final Logger log = LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
          -  private static boolean debug = log.isDebugEnabled();
          +  protected static final Logger log = LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
          +  protected static boolean debug = log.isDebugEnabled();
             private static boolean trace = log.isTraceEnabled();
          
             // TODO: hack
          
          Show
          dpgove Dennis Gove added a comment - The tweaks were very small but they do change visibility of some things. 1. in CdcrTransactionLog, close() is changed from protected to public. 2. in TransactionLog, log and debug are changed from private to protected 3. in UpdateLog, log and debug are changed from private to protected full diff of changes is below. If we're comfortable with the visibility changes I'd be happy to add them to SOLR-6273 . diff --git a/solr/core/src/java/org/apache/solr/update/CdcrTransactionLog.java b/solr/core/src/java/org/apache/solr/update/CdcrTransactionLog.java index f800f6f..e706733 100644 --- a/solr/core/src/java/org/apache/solr/update/CdcrTransactionLog.java +++ b/solr/core/src/java/org/apache/solr/update/CdcrTransactionLog.java @@ -193,7 +193,7 @@ public class CdcrTransactionLog extends TransactionLog { } @Override - protected void close() { + public void close() { try { if (debug) { log.debug( "Closing tlog" + this ); diff --git a/solr/core/src/java/org/apache/solr/update/TransactionLog.java b/solr/core/src/java/org/apache/solr/update/TransactionLog.java index c8b8332..35020be 100644 --- a/solr/core/src/java/org/apache/solr/update/TransactionLog.java +++ b/solr/core/src/java/org/apache/solr/update/TransactionLog.java @@ -63,8 +63,8 @@ import org.slf4j.LoggerFactory; * */ public class TransactionLog implements Closeable { - private static final Logger log = LoggerFactory.getLogger(MethodHandles.lookup().lookupClass()); - private static boolean debug = log.isDebugEnabled(); + protected static final Logger log = LoggerFactory.getLogger(MethodHandles.lookup().lookupClass()); + protected static boolean debug = log.isDebugEnabled(); private static boolean trace = log.isTraceEnabled(); public final static String END_MESSAGE= "SOLR_TLOG_END" ; diff --git a/solr/core/src/java/org/apache/solr/update/UpdateLog.java b/solr/core/src/java/org/apache/solr/update/UpdateLog.java index c5dc9a4..ad05d5f 100644 --- a/solr/core/src/java/org/apache/solr/update/UpdateLog.java +++ b/solr/core/src/java/org/apache/solr/update/UpdateLog.java @@ -75,8 +75,8 @@ public class UpdateLog implements PluginInfoInitialized { public static String LOG_FILENAME_PATTERN = "%s.%019d" ; public static String TLOG_NAME= "tlog" ; - private static final Logger log = LoggerFactory.getLogger(MethodHandles.lookup().lookupClass()); - private static boolean debug = log.isDebugEnabled(); + protected static final Logger log = LoggerFactory.getLogger(MethodHandles.lookup().lookupClass()); + protected static boolean debug = log.isDebugEnabled(); private static boolean trace = log.isTraceEnabled(); // TODO: hack
          Hide
          erickerickson Erick Erickson added a comment -

          As far as I'm concerned, just add the diff file to 6273 with a note. Anybody adventurous enough to apply CDCR to 5x will be well enough served just knowing somebody else's experience.

          Show
          erickerickson Erick Erickson added a comment - As far as I'm concerned, just add the diff file to 6273 with a note. Anybody adventurous enough to apply CDCR to 5x will be well enough served just knowing somebody else's experience.
          Hide
          rendel Renaud Delbru added a comment -

          Shalin Shekhar Mangar, would the goal be to rely solely on the bootstrapping method to replicate indexes, instead of using the updates forwarding method (i.e., cdcr update logs) ? Or would it be a combination of bootstrapping and updates forwarding (based on the original update log, not the cdcr one) ?

          Show
          rendel Renaud Delbru added a comment - Shalin Shekhar Mangar , would the goal be to rely solely on the bootstrapping method to replicate indexes, instead of using the updates forwarding method (i.e., cdcr update logs) ? Or would it be a combination of bootstrapping and updates forwarding (based on the original update log, not the cdcr one) ?
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          The goal is to rely on both based on the original update log.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - The goal is to rely on both based on the original update log.
          Hide
          rendel Renaud Delbru added a comment -

          It would be great indeed to be able to simplify the code as you proposed if we can rely on a bootstrap method. Below are some observations that might be useful.

          One of the concern I have is related to the default size limit of the update logs. By default, it keeps 10 tlog files or 100 records. This will likely be too small for providing enough buffer for cdcr, and there might be a risk of a continuous cycle of bootstrapping replication. One could increase the values of "numRecordsToKeep" and "maxNumLogsToKeep" in solrconfig to accommodate the cdcr requirements. But this is an additional parameter that the user needs to take into consideration, and make configuration more complex. I am wondering if we could find a more appropriate default value for cdcr ?

          The issue with increasing limits in the original update log compared to the cdcr update log is that the original update log will not clean old tlogs files (it will keep all tlogs up to that limit) that are not necessary anymore for the replication. For example, if one increase the maxNumLogsToKeep to 100 and numRecordsToKeep 1000, then the node will always have 100 tlogs files or 1000 records in the update logs, even if all of them has been replicated to the target clusters. This might cause unexpected issues related to disk space or performance.

          The CdcrUpdateLog was managing this by allowing a variable size update log that removes a tlog when it has been fully replicated. But then this means we go back to where we were with all the added management around the cdcr update log, i.e., buffer, lastprocessedversion, CdcrLogSynchronizer, ...

          Cdcr Buffer

          If we get rid of the cdcr update log logic, then we can also get rid of the Cdcr Buffer (buffer state, buffer commands, etc.)

          CdcrUpdateLog

          I am not sure if we can get entirely rid of the CdcrUpdateLog. It includes logic such as sub-reader and forward seek that are necessary for sending batch updates. Maybe this logic can be moved in the UpdateLog ?

          CdcrLogSynchronizer

          I think it is safe to get rid of this. In the case where a leader goes down while a cdcr reader is forwarding updates, the new leader will likely miss the tlogs necessary to resume where the cdcr reader stopped. But in this case, it can fall back to bootstrapping.

          Tlog Replication

          If the tlogs are not replicated during a bootstrap, then tlogs on target will not be in synch. Could this cause any issues on the target cluster, e.g., in case of a recovery ?
          If the target is itself configured as a source (i.e. daisy chain), this will probably cause issues. The update logs will likely contain gaps, and it will be very difficult for the source to know that there is a gap. Therefore, it might forward incomplete updates. But this might be a feature we could drop, as suggested in one of your comment on the cwiki.

          Show
          rendel Renaud Delbru added a comment - It would be great indeed to be able to simplify the code as you proposed if we can rely on a bootstrap method. Below are some observations that might be useful. One of the concern I have is related to the default size limit of the update logs. By default, it keeps 10 tlog files or 100 records. This will likely be too small for providing enough buffer for cdcr, and there might be a risk of a continuous cycle of bootstrapping replication. One could increase the values of "numRecordsToKeep" and "maxNumLogsToKeep" in solrconfig to accommodate the cdcr requirements. But this is an additional parameter that the user needs to take into consideration, and make configuration more complex. I am wondering if we could find a more appropriate default value for cdcr ? The issue with increasing limits in the original update log compared to the cdcr update log is that the original update log will not clean old tlogs files (it will keep all tlogs up to that limit) that are not necessary anymore for the replication. For example, if one increase the maxNumLogsToKeep to 100 and numRecordsToKeep 1000, then the node will always have 100 tlogs files or 1000 records in the update logs, even if all of them has been replicated to the target clusters. This might cause unexpected issues related to disk space or performance. The CdcrUpdateLog was managing this by allowing a variable size update log that removes a tlog when it has been fully replicated. But then this means we go back to where we were with all the added management around the cdcr update log, i.e., buffer, lastprocessedversion, CdcrLogSynchronizer, ... Cdcr Buffer If we get rid of the cdcr update log logic, then we can also get rid of the Cdcr Buffer (buffer state, buffer commands, etc.) CdcrUpdateLog I am not sure if we can get entirely rid of the CdcrUpdateLog. It includes logic such as sub-reader and forward seek that are necessary for sending batch updates. Maybe this logic can be moved in the UpdateLog ? CdcrLogSynchronizer I think it is safe to get rid of this. In the case where a leader goes down while a cdcr reader is forwarding updates, the new leader will likely miss the tlogs necessary to resume where the cdcr reader stopped. But in this case, it can fall back to bootstrapping. Tlog Replication If the tlogs are not replicated during a bootstrap, then tlogs on target will not be in synch. Could this cause any issues on the target cluster, e.g., in case of a recovery ? If the target is itself configured as a source (i.e. daisy chain), this will probably cause issues. The update logs will likely contain gaps, and it will be very difficult for the source to know that there is a gap. Therefore, it might forward incomplete updates. But this might be a feature we could drop, as suggested in one of your comment on the cwiki.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Major changes:

          1. Hardened the bootstrap and bootstrap status request code paths. The bootstrap is still done only once during init but I wrote chaos monkey style tests to exercise this path. Also see SOLR-9364
          2. tlog replication can be disabled via a parameter which is used by target clusters during bootstrap. This prevents tlogs from source leaders to be replicated to target leaders.
          3. Assert that we are the leader before starting bootstrap process
          4. Bootstrap uses the same recovery lock to avoid racing with recovery and potentially corrupting the index
          5. CdcrReplicatorState is initialized eagerly rather than waiting for bootstrap to allow QUEUES action to work
          6. Added a new test CdcrBootstrapTest#testBootstrapWithContinousIndexingOnSourceCluster to stress bootstrap during indexing load
          7. All existing tests pass and precommit passes

          The current patch implements the goal of this ticket which is to fall back to whole-index replication when tlogs are insufficient. Therefore, this patch does not remove CdcrUpdateLog and related functionality which can be a bit complicated as Renaud had pointed out. This patch also does not allow updates to be sent while a bootstrap is in progress. Doing that opens a can of worms because you need to carefuly co-ordinate with the leader the order of hard commit and start of buffering to avoid losing documents. Unless the source cluster has very high update rates, the replicator thread should be able to catch up even without this headstart.

          I plan to commit this patch as-is and open follow up issues for refactoring and other improvements.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Major changes: Hardened the bootstrap and bootstrap status request code paths. The bootstrap is still done only once during init but I wrote chaos monkey style tests to exercise this path. Also see SOLR-9364 tlog replication can be disabled via a parameter which is used by target clusters during bootstrap. This prevents tlogs from source leaders to be replicated to target leaders. Assert that we are the leader before starting bootstrap process Bootstrap uses the same recovery lock to avoid racing with recovery and potentially corrupting the index CdcrReplicatorState is initialized eagerly rather than waiting for bootstrap to allow QUEUES action to work Added a new test CdcrBootstrapTest#testBootstrapWithContinousIndexingOnSourceCluster to stress bootstrap during indexing load All existing tests pass and precommit passes The current patch implements the goal of this ticket which is to fall back to whole-index replication when tlogs are insufficient. Therefore, this patch does not remove CdcrUpdateLog and related functionality which can be a bit complicated as Renaud had pointed out. This patch also does not allow updates to be sent while a bootstrap is in progress. Doing that opens a can of worms because you need to carefuly co-ordinate with the leader the order of hard commit and start of buffering to avoid losing documents. Unless the source cluster has very high update rates, the replicator thread should be able to catch up even without this headstart. I plan to commit this patch as-is and open follow up issues for refactoring and other improvements.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -
          • Uses the CdcrReplicatorState's httpclient for all bootstrap related activities i.e. submission, status and cancellation
          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Uses the CdcrReplicatorState's httpclient for all bootstrap related activities i.e. submission, status and cancellation
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 153c2700450af1e1c4bd063d7d8b65cc4a726438 in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=153c270 ]

          SOLR-6465: CDCR: fall back to whole-index replication when tlogs are insufficient

          Show
          jira-bot ASF subversion and git services added a comment - Commit 153c2700450af1e1c4bd063d7d8b65cc4a726438 in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=153c270 ] SOLR-6465 : CDCR: fall back to whole-index replication when tlogs are insufficient
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit cc3f3e8a8b37bba8c465beded466ba95e3c4a77d in lucene-solr's branch refs/heads/branch_6x from Shalin Shekhar Mangar
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=cc3f3e8 ]

          SOLR-6465: CDCR: fall back to whole-index replication when tlogs are insufficient

          (cherry picked from commit 153c2700450af1e1c4bd063d7d8b65cc4a726438)

          Show
          jira-bot ASF subversion and git services added a comment - Commit cc3f3e8a8b37bba8c465beded466ba95e3c4a77d in lucene-solr's branch refs/heads/branch_6x from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=cc3f3e8 ] SOLR-6465 : CDCR: fall back to whole-index replication when tlogs are insufficient (cherry picked from commit 153c2700450af1e1c4bd063d7d8b65cc4a726438)
          Hide
          steve_rowe Steve Rowe added a comment -

          My Jenkins had a CdcrBootstrapTest.testBootstrapWithContinousIndexingOnSourceCluster() failure on branch_6x that does not reproduce for me (I tried the repro line both with and without tests.method) - I'm attaching a compressed excerpt from the build log for that run (1858.log.gz):

            [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=CdcrBootstrapTest -Dtests.method=testBootstrapWithContinousIndexingOnSourceCluster -Dtests.seed=B8FE08D75B76C10A -Dtests.slow=true -Dtests.linedocsfile=/home/jenkins/lucene-data/enwiki.random.lines.txt -Dtests.locale=uk -Dtests.timezone=US/Aleutian -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1
             [junit4] FAILURE  145s J7  | CdcrBootstrapTest.testBootstrapWithContinousIndexingOnSourceCluster <<<
             [junit4]    > Throwable #1: java.lang.AssertionError: Document mismatch on target after sync expected:<20000> but was:<0>
             [junit4]    > 	at __randomizedtesting.SeedInfo.seed([B8FE08D75B76C10A:6CBB438EBC2072F1]:0)
             [junit4]    > 	at org.apache.solr.cloud.CdcrBootstrapTest.testBootstrapWithContinousIndexingOnSourceCluster(CdcrBootstrapTest.java:334)
             [junit4]    > 	at java.lang.Thread.run(Thread.java:745)
          
          Show
          steve_rowe Steve Rowe added a comment - My Jenkins had a CdcrBootstrapTest.testBootstrapWithContinousIndexingOnSourceCluster() failure on branch_6x that does not reproduce for me (I tried the repro line both with and without tests.method ) - I'm attaching a compressed excerpt from the build log for that run ( 1858.log.gz ): [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=CdcrBootstrapTest -Dtests.method=testBootstrapWithContinousIndexingOnSourceCluster -Dtests.seed=B8FE08D75B76C10A -Dtests.slow=true -Dtests.linedocsfile=/home/jenkins/lucene-data/enwiki.random.lines.txt -Dtests.locale=uk -Dtests.timezone=US/Aleutian -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1 [junit4] FAILURE 145s J7 | CdcrBootstrapTest.testBootstrapWithContinousIndexingOnSourceCluster <<< [junit4] > Throwable #1: java.lang.AssertionError: Document mismatch on target after sync expected:<20000> but was:<0> [junit4] > at __randomizedtesting.SeedInfo.seed([B8FE08D75B76C10A:6CBB438EBC2072F1]:0) [junit4] > at org.apache.solr.cloud.CdcrBootstrapTest.testBootstrapWithContinousIndexingOnSourceCluster(CdcrBootstrapTest.java:334) [junit4] > at java.lang.Thread.run(Thread.java:745)
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Thanks Steve, I'll take a look.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Thanks Steve, I'll take a look.
          Hide
          steve_rowe Steve Rowe added a comment -

          My Jenkins saw a different CdcrBootstrapTest fail on branch_6x: testConvertClusterToCdcrAndBootstrap() - again, it doesn't reproduce for me, regardless of inclusion of tests.method on the cmdline - I'm attaching a compressed excerpt from the build log for that run (1890.log.gz):

             [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=CdcrBootstrapTest -Dtests.method=testConvertClusterToCdcrAndBootstrap -Dtests.seed=C8BABD95B571310C -Dtests.slow=true -Dtests.linedocsfile=/home/jenkins/lucene-data/enwiki.random.lines.txt -Dtests.locale=th -Dtests.timezone=America/Argentina/Cordoba -Dtests.asserts=true -Dtests.file.encoding=US-ASCII
             [junit4] FAILURE  138s J6  | CdcrBootstrapTest.testConvertClusterToCdcrAndBootstrap <<<
             [junit4]    > Throwable #1: java.lang.AssertionError: Document mismatch on target after sync expected:<10000> but was:<0>
             [junit4]    > 	at __randomizedtesting.SeedInfo.seed([C8BABD95B571310C:1F6D92E2012EA94B]:0)
             [junit4]    > 	at org.apache.solr.cloud.CdcrBootstrapTest.testConvertClusterToCdcrAndBootstrap(CdcrBootstrapTest.java:144)
             [junit4]    > 	at java.lang.Thread.run(Thread.java:745)
          
          Show
          steve_rowe Steve Rowe added a comment - My Jenkins saw a different CdcrBootstrapTest fail on branch_6x: testConvertClusterToCdcrAndBootstrap() - again, it doesn't reproduce for me, regardless of inclusion of tests.method on the cmdline - I'm attaching a compressed excerpt from the build log for that run ( 1890.log.gz ): [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=CdcrBootstrapTest -Dtests.method=testConvertClusterToCdcrAndBootstrap -Dtests.seed=C8BABD95B571310C -Dtests.slow=true -Dtests.linedocsfile=/home/jenkins/lucene-data/enwiki.random.lines.txt -Dtests.locale=th -Dtests.timezone=America/Argentina/Cordoba -Dtests.asserts=true -Dtests.file.encoding=US-ASCII [junit4] FAILURE 138s J6 | CdcrBootstrapTest.testConvertClusterToCdcrAndBootstrap <<< [junit4] > Throwable #1: java.lang.AssertionError: Document mismatch on target after sync expected:<10000> but was:<0> [junit4] > at __randomizedtesting.SeedInfo.seed([C8BABD95B571310C:1F6D92E2012EA94B]:0) [junit4] > at org.apache.solr.cloud.CdcrBootstrapTest.testConvertClusterToCdcrAndBootstrap(CdcrBootstrapTest.java:144) [junit4] > at java.lang.Thread.run(Thread.java:745)

            People

            • Assignee:
              shalinmangar Shalin Shekhar Mangar
              Reporter:
              yseeley@gmail.com Yonik Seeley
            • Votes:
              1 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:

                Development