Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-11069

CDCR bootstrapping can get into an infinite loop when a core is reloaded

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 6.2, 6.3, 6.4, 6.5, 6.6, 7.0
    • Fix Version/s: 6.6.1, 6.7, 7.0, 7.1, master (8.0)
    • Component/s: CDCR
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None

      Description

      LASTPROCESSEDVERSION (a.b.v. LPV) action for CDCR breaks down due to poorly initialised and maintained buffer log for either source or target cluster core nodes.

      If buffer is enabled for cores of either source or target cluster, it return -1, irrespective of number of entries in tlog read by the leader node of each shard of respective collection of respective cluster. Once disabled, it starts telling us the correct LPV for each core.

      Due to the same flawed behavior, Update Log Synchroniser may doesn't work properly as expected, i.e. provides incorrect seek to the non-leader nodes to advance at. I am not sure whether this is an intended behavior for sync but it surely doesn't feel right.

      1. SOLR-11069.patch
        51 kB
        Erick Erickson
      2. SOLR-11069.patch
        51 kB
        Erick Erickson
      3. SOLR-11069.patch
        8 kB
        Erick Erickson

        Activity

        Hide
        erickerickson Erick Erickson added a comment -
        Show
        erickerickson Erick Erickson added a comment - Shalin Shekhar Mangar Renaud Delbru Any comments?
        Hide
        sarkaramrit2@gmail.com Amrit Sarkar added a comment -

        So when we enable buffering in CDCR, buffertoggle gets initialised via newLogReader() where ::

         return new CdcrLogReader(new ArrayList(logs), tlog);
        
            private CdcrLogReader(List<TransactionLog> tlogs, TransactionLog tlog) {
              this.tlogs = new LinkedBlockingDeque<>();
              this.tlogs.addAll(tlogs);
              if (tlog != null) this.tlogs.push(tlog); // ensure that the tlog being written is pushed
        
              // Register the pointer in the parent UpdateLog
              pointer = new CdcrLogPointer();
              logPointers.put(this, pointer);
        
              // If the reader is initialised while the updates log is empty, do nothing
              if ((currentTlog = this.tlogs.peekLast()) != null) {
                tlogReader = currentTlog.getReader(0);
                pointer.set(currentTlog.tlogFile);
                numRecordsReadInCurrentTlog = 0;
                log.debug("Init new tlog reader for {} - tlogReader = {}", currentTlog.tlogFile, tlogReader);
              }
            }
        

        lastVersion and nextToLastVersion initialised as -1 and never changed / modified / updated ever. The recent logs are added into tlogs and current tlog is maintained though.

        Now LPV is calculated as: CdcrRequestHandler::handleLastProcessedVersionAction

            for (CdcrReplicatorState state : replicatorManager.getReplicatorStates()) {
              long version = Long.MAX_VALUE;
              if (state.getLogReader() != null) {
                version = state.getLogReader().getLastVersion();
              }
              lastProcessedVersion = Math.min(lastProcessedVersion, version);
            }
        
            // next check the log reader of the buffer
            CdcrUpdateLog.CdcrLogReader bufferLogReader = ((CdcrUpdateLog) core.getUpdateHandler().getUpdateLog()).getBufferToggle();
            if (bufferLogReader != null) {
              lastProcessedVersion = Math.min(lastProcessedVersion, bufferLogReader.getLastVersion());
            }
        

        bufferLogReader.getLastVersion() is calculated -1 and LPV outputs -1.

        Show
        sarkaramrit2@gmail.com Amrit Sarkar added a comment - So when we enable buffering in CDCR, buffertoggle gets initialised via newLogReader() where :: return new CdcrLogReader( new ArrayList(logs), tlog); private CdcrLogReader(List<TransactionLog> tlogs, TransactionLog tlog) { this .tlogs = new LinkedBlockingDeque<>(); this .tlogs.addAll(tlogs); if (tlog != null ) this .tlogs.push(tlog); // ensure that the tlog being written is pushed // Register the pointer in the parent UpdateLog pointer = new CdcrLogPointer(); logPointers.put( this , pointer); // If the reader is initialised while the updates log is empty, do nothing if ((currentTlog = this .tlogs.peekLast()) != null ) { tlogReader = currentTlog.getReader(0); pointer.set(currentTlog.tlogFile); numRecordsReadInCurrentTlog = 0; log.debug( "Init new tlog reader for {} - tlogReader = {}" , currentTlog.tlogFile, tlogReader); } } lastVersion and nextToLastVersion initialised as -1 and never changed / modified / updated ever. The recent logs are added into tlogs and current tlog is maintained though. Now LPV is calculated as: CdcrRequestHandler::handleLastProcessedVersionAction for (CdcrReplicatorState state : replicatorManager.getReplicatorStates()) { long version = Long .MAX_VALUE; if (state.getLogReader() != null ) { version = state.getLogReader().getLastVersion(); } lastProcessedVersion = Math .min(lastProcessedVersion, version); } // next check the log reader of the buffer CdcrUpdateLog.CdcrLogReader bufferLogReader = ((CdcrUpdateLog) core.getUpdateHandler().getUpdateLog()).getBufferToggle(); if (bufferLogReader != null ) { lastProcessedVersion = Math .min(lastProcessedVersion, bufferLogReader.getLastVersion()); } bufferLogReader.getLastVersion() is calculated -1 and LPV outputs -1 .
        Hide
        sarkaramrit2@gmail.com Amrit Sarkar added a comment - - edited

        Regarding updateLogSynchronizer ::

        Everytime we call DISABLEBUFFER or ENABLEBUFFER, CdcrBufferManager::stateUpdate gets invoked::

        @Override
          public synchronized void stateUpdate() {
            CdcrUpdateLog ulog = (CdcrUpdateLog) core.getUpdateHandler().getUpdateLog();
            // If I am not the leader, I should always buffer my updates
            if (!leaderStateManager.amILeader()) {
              ulog.enableBuffer();
              return;
            }
            // If I am the leader, I should buffer my updates only if buffer is enabled
            else if (bufferStateManager.getState().equals(CdcrParams.BufferState.ENABLED)) {
              ulog.enableBuffer();
              return;
            }
            // otherwise, disable the buffer
            ulog.disableBuffer();
          }
        

        The non-leader nodes are by-defaulted are always buffer enabled ::

        if (!leaderStateManager.amILeader()) {
              ulog.enableBuffer();
              return;
            }
        

        though LPV always calculated on leader but it has serious drawbacks explained later:

        in CdcrUpdateLogSynchronizer:: run :: if buffering is enabled ::

        // if we received -1, it means that the log reader on the leader has not yet started to read log entries
                // do nothing
                if (lastVersion == -1) {
                  return;
                }
                try {
                  CdcrUpdateLog ulog = (CdcrUpdateLog) core.getUpdateHandler().getUpdateLog();
                  if (ulog.isBuffering()) {
                    log.debug("Advancing replica buffering tlog reader to {} @ {}:{}", lastVersion, collection, shardId);
                    ulog.getBufferToggle().seek(lastVersion);
                  }
                }
        

        It always returns on lastVersion == -1 and look at the comment if we received -1, it means that the log reader on the leader has not yet started to read log entries, that's misleading.

        As the lastVersion is not +ve, the seek for the corresponding non-leader nodes are never set to appropriate LPV.

        Now if the leader goes down, and some non-leader becomes the leader himself, the LPV is not set properly resulting in improper sync and I have no idea how the impact will be in that case.

        Also, as for non-leader nodes buffer is always on, if in the future it becomes the leader itself, even if we have disabled buffer for the source collection cluster, the status and its action will be buffer enabled. Again, not sure of the impact, need to look closely.

        Show
        sarkaramrit2@gmail.com Amrit Sarkar added a comment - - edited Regarding updateLogSynchronizer :: Everytime we call DISABLEBUFFER or ENABLEBUFFER , CdcrBufferManager::stateUpdate gets invoked:: @Override public synchronized void stateUpdate() { CdcrUpdateLog ulog = (CdcrUpdateLog) core.getUpdateHandler().getUpdateLog(); // If I am not the leader, I should always buffer my updates if (!leaderStateManager.amILeader()) { ulog.enableBuffer(); return ; } // If I am the leader, I should buffer my updates only if buffer is enabled else if (bufferStateManager.getState().equals(CdcrParams.BufferState.ENABLED)) { ulog.enableBuffer(); return ; } // otherwise, disable the buffer ulog.disableBuffer(); } The non-leader nodes are by-defaulted are always buffer enabled :: if (!leaderStateManager.amILeader()) { ulog.enableBuffer(); return ; } though LPV always calculated on leader but it has serious drawbacks explained later: in CdcrUpdateLogSynchronizer:: run :: if buffering is enabled :: // if we received -1, it means that the log reader on the leader has not yet started to read log entries // do nothing if (lastVersion == -1) { return ; } try { CdcrUpdateLog ulog = (CdcrUpdateLog) core.getUpdateHandler().getUpdateLog(); if (ulog.isBuffering()) { log.debug( "Advancing replica buffering tlog reader to {} @ {}:{}" , lastVersion, collection, shardId); ulog.getBufferToggle().seek(lastVersion); } } It always returns on lastVersion == -1 and look at the comment if we received -1, it means that the log reader on the leader has not yet started to read log entries , that's misleading. As the lastVersion is not +ve, the seek for the corresponding non-leader nodes are never set to appropriate LPV. Now if the leader goes down, and some non-leader becomes the leader himself, the LPV is not set properly resulting in improper sync and I have no idea how the impact will be in that case. Also, as for non-leader nodes buffer is always on, if in the future it becomes the leader itself, even if we have disabled buffer for the source collection cluster, the status and its action will be buffer enabled . Again, not sure of the impact, need to look closely.
        Hide
        sarkaramrit2@gmail.com Amrit Sarkar added a comment -

        Continuing with how LPV was never tested robust:

        The only bit where the LPV mentioned in the tests is in CdcrRequestHandlerTest

            // replication never started, lastProcessedVersion should be -1 for both shards
            rsp = invokeCdcrAction(shardToLeaderJetty.get(SOURCE_COLLECTION).get(SHARD1), CdcrParams.CdcrAction.LASTPROCESSEDVERSION);
            long lastVersion = (Long) rsp.get(CdcrParams.LAST_PROCESSED_VERSION);
            assertEquals(-1l, lastVersion);
        
            rsp = invokeCdcrAction(shardToLeaderJetty.get(SOURCE_COLLECTION).get(SHARD2), CdcrParams.CdcrAction.LASTPROCESSEDVERSION);
            lastVersion = (Long) rsp.get(CdcrParams.LAST_PROCESSED_VERSION);
            assertEquals(-1l, lastVersion);
        

        LPV > -1 or what LPV value (which should > 1 atleast) can be when leader reads some entries from tlogs is never tested anywhere or at least I cannot find it.

        Show
        sarkaramrit2@gmail.com Amrit Sarkar added a comment - Continuing with how LPV was never tested robust: The only bit where the LPV mentioned in the tests is in CdcrRequestHandlerTest // replication never started, lastProcessedVersion should be -1 for both shards rsp = invokeCdcrAction(shardToLeaderJetty.get(SOURCE_COLLECTION).get(SHARD1), CdcrParams.CdcrAction.LASTPROCESSEDVERSION); long lastVersion = ( Long ) rsp.get(CdcrParams.LAST_PROCESSED_VERSION); assertEquals(-1l, lastVersion); rsp = invokeCdcrAction(shardToLeaderJetty.get(SOURCE_COLLECTION).get(SHARD2), CdcrParams.CdcrAction.LASTPROCESSEDVERSION); lastVersion = ( Long ) rsp.get(CdcrParams.LAST_PROCESSED_VERSION); assertEquals(-1l, lastVersion); LPV > -1 or what LPV value (which should > 1 atleast) can be when leader reads some entries from tlogs is never tested anywhere or at least I cannot find it.
        Hide
        varunthacker Varun Thacker added a comment -

        I just saw a case where restarting the source cluster triggered bootstrap. Since LASTPROCESSEDVERSION was -1 the source ended up bootstrapping the target. Disabling buffer on source makes the pointer move correctly.

        Show
        varunthacker Varun Thacker added a comment - I just saw a case where restarting the source cluster triggered bootstrap. Since LASTPROCESSEDVERSION was -1 the source ended up bootstrapping the target. Disabling buffer on source makes the pointer move correctly.
        Hide
        sarkaramrit2@gmail.com Amrit Sarkar added a comment -

        I just saw a case where restarting the source cluster triggered bootstrap.

        When leader of shard of source collection goes down and non-leader is selected, it triggers bootstrap due to above stated reason, LPV set to -1.

        Show
        sarkaramrit2@gmail.com Amrit Sarkar added a comment - I just saw a case where restarting the source cluster triggered bootstrap. When leader of shard of source collection goes down and non-leader is selected, it triggers bootstrap due to above stated reason, LPV set to -1 .
        Hide
        erickerickson Erick Erickson added a comment -

        I'm dithering back and forth about this. I suspect that we're conflating a couple of issues. There's definitely a problem with bootstrapping (I'll attach a patch in a minute). It may well be that the LASTPROCESSEDVERSION is not actually a problem, at least in some testing (with the attached patch) the fact that it is -1 when buffering is enabled seems to be OK.

        I propose we use the patch as a starting point to see if this LASTPROCESSEDVERSION is a problem or not.

        1> when buffering is enabled, tlogs will accrue forever according to the original intent. From Renaud:

        The original goal of the buffer on cdcr is to indeed keep indefinitely the tlogs until the buffer is deactivated (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62687462#CrossDataCenterReplication(CDCR)-TheBufferElement). This was useful for example during maintenance operations, to ensure that the source cluster will keep all the tlogs until the target clsuter is properly initialised. In this scenario, one will activate the buffer on the source. The source will start to store all the tlogs (and does not purge them). Once the target cluster is initialised, and has register a tlog pointer on the source, one can deactivate the buffer on the source and the tlog will start to be purged once they are read by the target cluster.

        But additionally he had this to say:
        Regarding the issue about LPV = -1, I am a bit surprised as this sentinel value should be used only when the source cluster does not have any log pointers, i.e., no target cluster were configured and initialised with this source cluster. In this case it indicates that there is no registered log reader, and that we should not remove any tlogs if buffer is enabled (as we have to wait for the target to register a log reader and log pointer).

        And enabling buffering definitely causes LASTPROCESSEDVERSION to return -1. However, with the patch LPV immediately goes back to a reasonable value as soon as buffering is disabled, the tlogs get cleaned up etc. without bootstrapping. So I do wonder if the -1 value is just overloaded in this case to also mean "don't purge tlogs".

        We need to unentangle a couple of things. I'll attach a patch in a few minutes that might help.

        Show
        erickerickson Erick Erickson added a comment - I'm dithering back and forth about this. I suspect that we're conflating a couple of issues. There's definitely a problem with bootstrapping (I'll attach a patch in a minute). It may well be that the LASTPROCESSEDVERSION is not actually a problem, at least in some testing (with the attached patch) the fact that it is -1 when buffering is enabled seems to be OK. I propose we use the patch as a starting point to see if this LASTPROCESSEDVERSION is a problem or not. 1> when buffering is enabled, tlogs will accrue forever according to the original intent. From Renaud: The original goal of the buffer on cdcr is to indeed keep indefinitely the tlogs until the buffer is deactivated ( https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62687462#CrossDataCenterReplication(CDCR)-TheBufferElement ). This was useful for example during maintenance operations, to ensure that the source cluster will keep all the tlogs until the target clsuter is properly initialised. In this scenario, one will activate the buffer on the source. The source will start to store all the tlogs (and does not purge them). Once the target cluster is initialised, and has register a tlog pointer on the source, one can deactivate the buffer on the source and the tlog will start to be purged once they are read by the target cluster. But additionally he had this to say: Regarding the issue about LPV = -1, I am a bit surprised as this sentinel value should be used only when the source cluster does not have any log pointers, i.e., no target cluster were configured and initialised with this source cluster. In this case it indicates that there is no registered log reader, and that we should not remove any tlogs if buffer is enabled (as we have to wait for the target to register a log reader and log pointer). And enabling buffering definitely causes LASTPROCESSEDVERSION to return -1. However, with the patch LPV immediately goes back to a reasonable value as soon as buffering is disabled, the tlogs get cleaned up etc. without bootstrapping. So I do wonder if the -1 value is just overloaded in this case to also mean "don't purge tlogs". We need to unentangle a couple of things. I'll attach a patch in a few minutes that might help.
        Hide
        erickerickson Erick Erickson added a comment -

        figuring out the LPV issue is hard because bootstrapping had a problem. At the end of the process, the core is reloaded. However, that means that the code that checks on the state of the replication returns a "notfound", which causes another bootstrap command to be sent.

        So this patch moves the relevant objects to (Default)SolrCoreState where they're preserved around core reloads. With this patch (PoC) I can get bootstrapping to occur, enable/disable buffering, bring the target up and down etc. The fact that LPV is -1 when buffering is enabled doesn't seem to be a problem.

        So if others can give this a whirl and see if their testing is OK with it then maybe the LPV issue is not an issue.

        Mostly I'm throwing this out for others to consider. What do people think about putting the additional objects in SolrCoreState? Putting the objects there was quick, I'm interested in seeing if my results work for others. If so we can decide whether this is the right way to go.

        Haven't run precommit, haven't run the full test suite. Did run CdcrBootstrapTest. Also, the CDCR docs need to be updated.

        Show
        erickerickson Erick Erickson added a comment - figuring out the LPV issue is hard because bootstrapping had a problem. At the end of the process, the core is reloaded. However, that means that the code that checks on the state of the replication returns a "notfound", which causes another bootstrap command to be sent. So this patch moves the relevant objects to (Default)SolrCoreState where they're preserved around core reloads. With this patch (PoC) I can get bootstrapping to occur, enable/disable buffering, bring the target up and down etc. The fact that LPV is -1 when buffering is enabled doesn't seem to be a problem. So if others can give this a whirl and see if their testing is OK with it then maybe the LPV issue is not an issue. Mostly I'm throwing this out for others to consider. What do people think about putting the additional objects in SolrCoreState? Putting the objects there was quick, I'm interested in seeing if my results work for others. If so we can decide whether this is the right way to go. Haven't run precommit, haven't run the full test suite. Did run CdcrBootstrapTest. Also, the CDCR docs need to be updated.
        Hide
        sarkaramrit2@gmail.com Amrit Sarkar added a comment - - edited

        Thank you Erick for clarifying the root cause. I see LPV may very well not be the issue we are facing here, pardon my limited testing for this.

        Three things I tested on limited schedule to see issues are addressed with Erick's patch on branch_6x:

        1. Restart source and target clusters at different intervals, see if bootstrap is happening.
        2. On 2x2 source and target collection - clusters, shut down one node / leader to get the other nodes / follower as leader, see if bootstrap is happening.
        3. Observe behaviour of source and target tlogs across all cores in both source and target collections.

        1. Restart source and target clusters, see if bootstrap is happening.

        No bootstrap except the obvious, when's required. The combinations I tested:
        1. CDCR stop, buffer enable, index X documents and then CDCR on, multiple restarts
        2. CDCR stop, buffer disable, index X documents and then CDCR on, multiple restarts
        3. CDCR stop, buffer enable, index X documents and then CDCR on, buffer enable, multiple restarts
        4. CDCR stop, buffer disable, index X documents and then CDCR on, buffer disable, multiple restarts
        5. Above 4 steps one after another on singly created source and target collections - clusters.

        The expected behavior is observed, bootstrap when CDCR on.

        2. On 2x2 source and target collection - clusters, shut down one node / leader to get the other nodes / follower as leader, see if bootstrap is happening.

        No bootstrap except the obvious, when's required. The combinations I tested:
        1. CDCR stop, buffer enable, index X documents and then CDCR on, shut down the leader node
        2. CDCR stop, buffer disable, index X documents and then CDCR on, shut down the leader node
        3. CDCR stop, buffer enable, index X documents and then CDCR on, buffer enable, shut down the leader node
        4. CDCR stop, buffer disable, index X documents and then CDCR on, buffer disable, shut down the leader node
        5. Above 4 steps one after another on singly created source and target collections - clusters.

        The expected behavior is observed, bootstrap when CDCR on. COLLECTIONCHECKPOINT and LASTPROCESSESVERSION are transferred / referred to corresponding new leader elected successfully.

        3. Observe behaviour of source and target tlogs across all cores in both source and target collections.

        This was peculiar and as stated by Erick on an offline discussion, I had the same observations;
        a) When buffer enable, all the tlogs are maintained forever on disk.
        b) Once we disable, when no indexing is taking place, it remains as it is.
        c) When a single document is indexed after that, the old tlogs gets purged, it doesn't maintain 10 tlogs ONLY as expected, but more which gradually decreases as we index along.
        d) There are times only 1-2 tlogs will be present in each core of source collections, as observed by Erick too, when we stop indexing all together or index slowly. Not sure of the reason, didn't had a chance to look into, but I speculate there is no need to maintain 10 or N definite number but to keep a tab on the last processed tlog version, it can be 2nd, 10th or ith, depends ?!

        Show
        sarkaramrit2@gmail.com Amrit Sarkar added a comment - - edited Thank you Erick for clarifying the root cause. I see LPV may very well not be the issue we are facing here, pardon my limited testing for this. Three things I tested on limited schedule to see issues are addressed with Erick's patch on branch_6x : 1. Restart source and target clusters at different intervals, see if bootstrap is happening. 2. On 2x2 source and target collection - clusters, shut down one node / leader to get the other nodes / follower as leader, see if bootstrap is happening. 3. Observe behaviour of source and target tlogs across all cores in both source and target collections. 1. Restart source and target clusters, see if bootstrap is happening. No bootstrap except the obvious, when's required. The combinations I tested: 1. CDCR stop, buffer enable, index X documents and then CDCR on, multiple restarts 2. CDCR stop, buffer disable, index X documents and then CDCR on, multiple restarts 3. CDCR stop, buffer enable, index X documents and then CDCR on, buffer enable, multiple restarts 4. CDCR stop, buffer disable, index X documents and then CDCR on, buffer disable, multiple restarts 5. Above 4 steps one after another on singly created source and target collections - clusters. The expected behavior is observed, bootstrap when CDCR on. 2. On 2x2 source and target collection - clusters, shut down one node / leader to get the other nodes / follower as leader, see if bootstrap is happening. No bootstrap except the obvious, when's required. The combinations I tested: 1. CDCR stop, buffer enable, index X documents and then CDCR on, shut down the leader node 2. CDCR stop, buffer disable, index X documents and then CDCR on, shut down the leader node 3. CDCR stop, buffer enable, index X documents and then CDCR on, buffer enable, shut down the leader node 4. CDCR stop, buffer disable, index X documents and then CDCR on, buffer disable, shut down the leader node 5. Above 4 steps one after another on singly created source and target collections - clusters. The expected behavior is observed, bootstrap when CDCR on. COLLECTIONCHECKPOINT and LASTPROCESSESVERSION are transferred / referred to corresponding new leader elected successfully. 3. Observe behaviour of source and target tlogs across all cores in both source and target collections. This was peculiar and as stated by Erick on an offline discussion, I had the same observations; a) When buffer enable, all the tlogs are maintained forever on disk. b) Once we disable, when no indexing is taking place, it remains as it is. c) When a single document is indexed after that, the old tlogs gets purged, it doesn't maintain 10 tlogs ONLY as expected , but more which gradually decreases as we index along. d) There are times only 1-2 tlogs will be present in each core of source collections, as observed by Erick too, when we stop indexing all together or index slowly. Not sure of the reason , didn't had a chance to look into, but I speculate there is no need to maintain 10 or N definite number but to keep a tab on the last processed tlog version, it can be 2nd, 10th or ith, depends ?!
        Hide
        erickerickson Erick Erickson added a comment -

        Thanks for testing! So net-net is that with this patch, with the exception of the tlog purging being a little confusing, the patch seems to fix CDCR?

        On a relatively brief inspection of the code the 10 tlog bit is unimportant. The loop in CdcrUpdateLog.addOldLog removes old logs if and only if there's nothing pointing to it. In fact I don't really see the reason for even testing it, assuming that the "if (!this.hasLogPointer(log)) {"
        line preserves tlogs necessary for CDCR.

        I'm not sure we need to fix the fact that tlogs aren't getting purged quite the way we'd expect on this ticket, perhaps raise another one? Especially if this behavior is also present on 6.1, which I believe it is. CDCR is pretty broken with the infinite bootstrapping, but just a little confusing with the tlog retention.

        Show
        erickerickson Erick Erickson added a comment - Thanks for testing! So net-net is that with this patch, with the exception of the tlog purging being a little confusing, the patch seems to fix CDCR? On a relatively brief inspection of the code the 10 tlog bit is unimportant. The loop in CdcrUpdateLog.addOldLog removes old logs if and only if there's nothing pointing to it. In fact I don't really see the reason for even testing it, assuming that the "if (!this.hasLogPointer(log)) {" line preserves tlogs necessary for CDCR. I'm not sure we need to fix the fact that tlogs aren't getting purged quite the way we'd expect on this ticket, perhaps raise another one? Especially if this behavior is also present on 6.1, which I believe it is. CDCR is pretty broken with the infinite bootstrapping, but just a little confusing with the tlog retention.
        Hide
        erickerickson Erick Erickson added a comment -

        This one against master.

        Fixes precommit and has documentation changes.

        All tests pass.

        I need to go over the doc changes again, but this is what it's looking like at this point. The major changes are an admonition about buffering and some explanation about what it's for.

        Show
        erickerickson Erick Erickson added a comment - This one against master. Fixes precommit and has documentation changes. All tests pass. I need to go over the doc changes again, but this is what it's looking like at this point. The major changes are an admonition about buffering and some explanation about what it's for.
        Hide
        erickerickson Erick Erickson added a comment -

        Just asciidoc changes, somehow a bunch of my edits yesterday got lost.

        Show
        erickerickson Erick Erickson added a comment - Just asciidoc changes, somehow a bunch of my edits yesterday got lost.
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Looks good to me, Erick! Thanks for fixing this.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Looks good to me, Erick! Thanks for fixing this.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit ac97931c7e5800b2e314545f54c4d524eb69b73b in lucene-solr's branch refs/heads/master from Erick Erickson
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ac97931 ]

        SOLR-11069: CDCR bootstrapping can get into an infinite loop when a core is reloaded

        Show
        jira-bot ASF subversion and git services added a comment - Commit ac97931c7e5800b2e314545f54c4d524eb69b73b in lucene-solr's branch refs/heads/master from Erick Erickson [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ac97931 ] SOLR-11069 : CDCR bootstrapping can get into an infinite loop when a core is reloaded
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit a850749ab32e57d0bd96a8517798febeaad9dec1 in lucene-solr's branch refs/heads/branch_7x from Erick Erickson
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a850749 ]

        SOLR-11069: CDCR bootstrapping can get into an infinite loop when a core is reloaded

        (cherry picked from commit ac97931)

        Show
        jira-bot ASF subversion and git services added a comment - Commit a850749ab32e57d0bd96a8517798febeaad9dec1 in lucene-solr's branch refs/heads/branch_7x from Erick Erickson [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a850749 ] SOLR-11069 : CDCR bootstrapping can get into an infinite loop when a core is reloaded (cherry picked from commit ac97931)
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 34139f7deb698611046263503272267179c0d315 in lucene-solr's branch refs/heads/branch_7_0 from Erick Erickson
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=34139f7 ]

        SOLR-11069: CDCR bootstrapping can get into an infinite loop when a core is reloaded

        (cherry picked from commit ac97931c7e5800b2e314545f54c4d524eb69b73b)

        Show
        jira-bot ASF subversion and git services added a comment - Commit 34139f7deb698611046263503272267179c0d315 in lucene-solr's branch refs/heads/branch_7_0 from Erick Erickson [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=34139f7 ] SOLR-11069 : CDCR bootstrapping can get into an infinite loop when a core is reloaded (cherry picked from commit ac97931c7e5800b2e314545f54c4d524eb69b73b)
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit e2477ecce2503f7c4f69ac1966c49691a3c977b8 in lucene-solr's branch refs/heads/branch_6x from Erick Erickson
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e2477ec ]

        SOLR-11069: CDCR bootstrapping can get into an infinite loop when a core is reloaded

        (cherry picked from commit ac97931c7e5800b2e314545f54c4d524eb69b73b)

        Show
        jira-bot ASF subversion and git services added a comment - Commit e2477ecce2503f7c4f69ac1966c49691a3c977b8 in lucene-solr's branch refs/heads/branch_6x from Erick Erickson [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e2477ec ] SOLR-11069 : CDCR bootstrapping can get into an infinite loop when a core is reloaded (cherry picked from commit ac97931c7e5800b2e314545f54c4d524eb69b73b)
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit c7f9fcea4b4455c921987e4447b68cdbe046e2f6 in lucene-solr's branch refs/heads/branch_6_6 from Erick Erickson
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c7f9fce ]

        SOLR-11069: CDCR bootstrapping can get into an infinite loop when a core is reloaded

        (cherry picked from commit ac97931c7e5800b2e314545f54c4d524eb69b73b)

        Show
        jira-bot ASF subversion and git services added a comment - Commit c7f9fcea4b4455c921987e4447b68cdbe046e2f6 in lucene-solr's branch refs/heads/branch_6_6 from Erick Erickson [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c7f9fce ] SOLR-11069 : CDCR bootstrapping can get into an infinite loop when a core is reloaded (cherry picked from commit ac97931c7e5800b2e314545f54c4d524eb69b73b)
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Bulk close after 7.1.0 release

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Bulk close after 7.1.0 release

          People

          • Assignee:
            erickerickson Erick Erickson
            Reporter:
            sarkaramrit2@gmail.com Amrit Sarkar
          • Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development