Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.0
    • Component/s: None
    • Labels:
      None

      Description

      This is the master issue for Cross Data Center Replication (CDCR)
      described at a high level here: http://yonik.com/solr-cross-data-center-replication/

      1. forShalin.patch
        36 kB
        Erick Erickson
      2. SOLR-6273.patch
        312 kB
        Renaud Delbru
      3. SOLR-6273.patch
        314 kB
        Renaud Delbru
      4. SOLR-6273.patch
        298 kB
        Greg Solovyev
      5. SOLR-6273.patch
        298 kB
        Renaud Delbru
      6. SOLR-6273-5x-rollup.patch
        317 kB
        Erick Erickson
      7. SOLR-6273-plus-8263-5x.patch
        336 kB
        Dennis Gove
      8. SOLR-6273-plus-8263-5x.patch
        340 kB
        Erick Erickson
      9. SOLR-6273-trunk.patch
        321 kB
        Erick Erickson
      10. SOLR-6273-trunk.patch
        320 kB
        Erick Erickson
      11. SOLR-6273-trunk-testfix1.patch
        2 kB
        Renaud Delbru
      12. SOLR-6273-trunk-testfix2.patch
        15 kB
        Renaud Delbru
      13. SOLR-6273-trunk-testfix3.patch
        25 kB
        Erick Erickson
      14. SOLR-6273-trunk-testfix6.patch
        37 kB
        Renaud Delbru
      15. SOLR-6273-trunk-testfix7.patch
        46 kB
        Erick Erickson

        Issue Links

          Activity

          Hide
          arcadius Arcadius Ahouansou added a comment -

          Thanks Yonik Seeley for the detailed blog entry.
          This issue looks very similar to SOLR-6205

          Show
          arcadius Arcadius Ahouansou added a comment - Thanks Yonik Seeley for the detailed blog entry. This issue looks very similar to SOLR-6205
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          This issue looks very similar to SOLR-6205

          Not really... other than they both have "Data Center" in the title.
          SOLR-6205 looks like it is about location awareness (rack, zone, DC, etc) and is a good thing to have independent of this issue.

          Show
          yseeley@gmail.com Yonik Seeley added a comment - This issue looks very similar to SOLR-6205 Not really... other than they both have "Data Center" in the title. SOLR-6205 looks like it is about location awareness (rack, zone, DC, etc) and is a good thing to have independent of this issue.
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          Note: some of us will be collaborating on a github branch here:
          https://github.com/Heliosearch/lucene-solr/tree/solr6273

          Let me know (privately to keep down noise) if you want to help out and want write access to that repo.

          Show
          yseeley@gmail.com Yonik Seeley added a comment - Note: some of us will be collaborating on a github branch here: https://github.com/Heliosearch/lucene-solr/tree/solr6273 Let me know (privately to keep down noise) if you want to help out and want write access to that repo.
          Hide
          rendel Renaud Delbru added a comment - - edited

          The initial patch for cdcr for trunk. It contains a working version of the cross data center replication for active-passive scenarios. The CdcrRequestHandler provides an API to control and monitor the replication. A documentation on how to configure cdcr and of the API can be found here.
          This patch includes the following patches: SOLR-6621, SOLR-6819, SOLR-6823, and a few minor modifications on the UpdateLog and TransactionLog classes. Other than that, the rest of the CDCR code simply extends the Solr Core code.

          Show
          rendel Renaud Delbru added a comment - - edited The initial patch for cdcr for trunk. It contains a working version of the cross data center replication for active-passive scenarios. The CdcrRequestHandler provides an API to control and monitor the replication. A documentation on how to configure cdcr and of the API can be found here . This patch includes the following patches: SOLR-6621 , SOLR-6819 , SOLR-6823 , and a few minor modifications on the UpdateLog and TransactionLog classes. Other than that, the rest of the CDCR code simply extends the Solr Core code.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          This is a very important feature. Thanks for all your work! I intend to start reviewing this in detail next week.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - This is a very important feature. Thanks for all your work! I intend to start reviewing this in detail next week.
          Hide
          joel.bernstein Joel Bernstein added a comment - - edited

          Renaud Delbru, this looks awesome! I've had a chance to review the patch and do a fair amount of manual testing. So far, during manual testing CDCR is working as designed. The unit tests look strong.

          Shalin Shekhar Mangar, looking forward to hearing your thoughts on the patch when you've had a chance to review.

          I'd be happy to move this forward towards committal unless another committer would like this assignment.

          I think it makes sense to commit this to trunk and then spend some time refining this before backporting to 5x.

          I see two things we'll need to tackle before committing to trunk:

          1) The CdcReplicationDistributedZkTest takes about 7 minutes to run on my computer. We'll need to come up with a strategy for shortening this for normal test runs. If anyone knows any tricks to make this test run faster please chime in. If we can't make it run faster we can move parts of the test to @nightly.

          2) We need to make sure that the default CDCR startup state for trunk doesn't cause any issues. I'll do some manual testing and see if I see any issues here.

          Show
          joel.bernstein Joel Bernstein added a comment - - edited Renaud Delbru , this looks awesome! I've had a chance to review the patch and do a fair amount of manual testing. So far, during manual testing CDCR is working as designed. The unit tests look strong. Shalin Shekhar Mangar , looking forward to hearing your thoughts on the patch when you've had a chance to review. I'd be happy to move this forward towards committal unless another committer would like this assignment. I think it makes sense to commit this to trunk and then spend some time refining this before backporting to 5x. I see two things we'll need to tackle before committing to trunk: 1) The CdcReplicationDistributedZkTest takes about 7 minutes to run on my computer. We'll need to come up with a strategy for shortening this for normal test runs. If anyone knows any tricks to make this test run faster please chime in. If we can't make it run faster we can move parts of the test to @nightly. 2) We need to make sure that the default CDCR startup state for trunk doesn't cause any issues. I'll do some manual testing and see if I see any issues here.
          Hide
          grishick Greg Solovyev added a comment -

          I am working on applying this patch to test cross datacenter replication. Unless I am misunderstanding the code, this patch assumes that replication for each collection is configured in solrconfig.xml. I.e without this section in a collection's solrconfig.xml file the collection won't get replicated:
          <lst name="replica">
          <str name="zkHost">$

          {zkHost}

          </str>
          <str name="source">source_collection</str>
          <str name="target">target_collection</str>
          </lst>

          This means that CDCR won't work for collections created via collections API using a shared configset, because when collections are created via collections API with a configset, all collections will have identical solrconfig.xml and so there is no way to overwrite source and target parameters for each collection.

          Show
          grishick Greg Solovyev added a comment - I am working on applying this patch to test cross datacenter replication. Unless I am misunderstanding the code, this patch assumes that replication for each collection is configured in solrconfig.xml. I.e without this section in a collection's solrconfig.xml file the collection won't get replicated: <lst name="replica"> <str name="zkHost">$ {zkHost} </str> <str name="source">source_collection</str> <str name="target">target_collection</str> </lst> This means that CDCR won't work for collections created via collections API using a shared configset, because when collections are created via collections API with a configset, all collections will have identical solrconfig.xml and so there is no way to overwrite source and target parameters for each collection.
          Hide
          grishick Greg Solovyev added a comment -

          This patch expands the previously added patch to add the following features:

          • if source_collection is not defined - use the collection name associated with the Core
          • if target_collection is not defined - use the same name as source_collection
          • if target collection does not exist on the target cloud - provision it with the same parameters as the source collection
          Show
          grishick Greg Solovyev added a comment - This patch expands the previously added patch to add the following features: if source_collection is not defined - use the collection name associated with the Core if target_collection is not defined - use the same name as source_collection if target collection does not exist on the target cloud - provision it with the same parameters as the source collection
          Hide
          grishick Greg Solovyev added a comment -

          P.S. last patch is made off of 4.10.2 tag

          Show
          grishick Greg Solovyev added a comment - P.S. last patch is made off of 4.10.2 tag
          Hide
          rendel Renaud Delbru added a comment -

          A new version of the patch. The patch has been created from the latest branch_5x. The full Solr test suite has been executed successfully (there were a few timeouts in some of the tests, but this seems irrelevant to this patch). The principal change in this new version includes a fix for the replication of tlog files. The ReplicationHandler and IndexFetcher have been modifed to replicate tlog files during a recovery (only if CDCR is activated). Some unit tests covering various scenarios can be found
          in core/src/test/org/apache/solr/cloud/CdcrReplicationHandlerTest.java.
          In addition of the suite of automated unit tests, this version has been tested in various real deployments. One client has extensively tested the robustness and performance of CDCR in pre-prod, and is satisfied with the results.

          We think that the code is in a relatively good state to be pushed to Solr. How can we move forward from here ?

          Show
          rendel Renaud Delbru added a comment - A new version of the patch. The patch has been created from the latest branch_5x. The full Solr test suite has been executed successfully (there were a few timeouts in some of the tests, but this seems irrelevant to this patch). The principal change in this new version includes a fix for the replication of tlog files. The ReplicationHandler and IndexFetcher have been modifed to replicate tlog files during a recovery (only if CDCR is activated). Some unit tests covering various scenarios can be found in core/src/test/org/apache/solr/cloud/CdcrReplicationHandlerTest.java . In addition of the suite of automated unit tests, this version has been tested in various real deployments. One client has extensively tested the robustness and performance of CDCR in pre-prod, and is satisfied with the results. We think that the code is in a relatively good state to be pushed to Solr. How can we move forward from here ?
          Hide
          erickerickson Erick Erickson added a comment -

          Renaud:

          Which of the sub-tasks are still open? Should we create a different JIRA for "CDCR enhancements" or some such and deal with the sub-tasks there? Mostly I'm thinking about how to close this JIRA at checkin if/when.

          All:

          This is a major patch that adds much-needed functionality to Solr, something that we haven't had a really good answer for in the past. But it's...er...big. After we get consensus, I expect we'd want to check this into trunk and let it bake for a while (how long?) before merging into 5x.

          I think we really need some eyes on this....

          Show
          erickerickson Erick Erickson added a comment - Renaud: Which of the sub-tasks are still open? Should we create a different JIRA for "CDCR enhancements" or some such and deal with the sub-tasks there? Mostly I'm thinking about how to close this JIRA at checkin if/when. All: This is a major patch that adds much-needed functionality to Solr, something that we haven't had a really good answer for in the past. But it's...er...big. After we get consensus, I expect we'd want to check this into trunk and let it bake for a while (how long?) before merging into 5x. I think we really need some eyes on this....
          Hide
          grishick Greg Solovyev added a comment -

          Frankly, we would not be able to use this feature without auto-provisioning of collections (the feature that I added in my version of the patch). I cannot tell from the subtasks if this feature part of any of them.

          Show
          grishick Greg Solovyev added a comment - Frankly, we would not be able to use this feature without auto-provisioning of collections (the feature that I added in my version of the patch). I cannot tell from the subtasks if this feature part of any of them.
          Hide
          janhoy Jan Høydahl added a comment -

          Great stuff, can we get this into trunk as experimental/beta for wider exposure to the real world, before stabilizing the APIs?

          I notice that slice is being used instead of shard in the patch. I thought we decided to use shard in all user facing APIs and docs?

          Show
          janhoy Jan Høydahl added a comment - Great stuff, can we get this into trunk as experimental/beta for wider exposure to the real world, before stabilizing the APIs? I notice that slice is being used instead of shard in the patch. I thought we decided to use shard in all user facing APIs and docs?
          Hide
          rendel Renaud Delbru added a comment -

          Hi,

          Erick Erickson: From the original subtasks, the ones that are not covered with this patch are: SOLR-6465 and SOLR-6466.

          Greg Solovyev: The current patch does not cover the auto-provisioning of collections / live configuration of peer clusters. I think this issue should be tackled as part of SOLR-6466.

          Jan Høydahl: Could you point to where slice is being used instead of shard ? This should not be a problem to change that.

          Show
          rendel Renaud Delbru added a comment - Hi, Erick Erickson : From the original subtasks, the ones that are not covered with this patch are: SOLR-6465 and SOLR-6466 . Greg Solovyev : The current patch does not cover the auto-provisioning of collections / live configuration of peer clusters. I think this issue should be tackled as part of SOLR-6466 . Jan Høydahl : Could you point to where slice is being used instead of shard ? This should not be a problem to change that.
          Hide
          janhoy Jan Høydahl added a comment -

          Jan Høydahl: Could you point to where slice is being used instead of shard ? This should not be a problem to change that.

          curl -s "https://issues.apache.org/jira/secure/attachment/12725545/SOLR-6273.patch" |grep -n slice
          
          Show
          janhoy Jan Høydahl added a comment - Jan Høydahl: Could you point to where slice is being used instead of shard ? This should not be a problem to change that. curl -s "https://issues.apache.org/jira/secure/attachment/12725545/SOLR-6273.patch" |grep -n slice
          Hide
          rendel Renaud Delbru added a comment -

          Here is a new patch with the following changes:

          • Renamed 'slice' into 'shard'
          • Removed an optimisation in the replication of tlog files which could lead to duplicate tlog entries on a slave node. We were trying to avoid transferring tlog files that were already present on the slave nodes in order to reduce network transfer. However, tlog files between the master and slave can differ, overlap, etc. making the comparison difficult to achieve. We removed this optimisation and now during a recovery the tlog replication will transfer all the tlog files from the master to the slave, and replace on the slave node all the existing tlog files.
          Show
          rendel Renaud Delbru added a comment - Here is a new patch with the following changes: Renamed 'slice' into 'shard' Removed an optimisation in the replication of tlog files which could lead to duplicate tlog entries on a slave node. We were trying to avoid transferring tlog files that were already present on the slave nodes in order to reduce network transfer. However, tlog files between the master and slave can differ, overlap, etc. making the comparison difficult to achieve. We removed this optimisation and now during a recovery the tlog replication will transfer all the tlog files from the master to the slave, and replace on the slave node all the existing tlog files.
          Hide
          erickerickson Erick Erickson added a comment -

          What do people think about letting this bake in trunk for a while? If there are no objections I'll probably commit this to trunk in the next few days.

          Show
          erickerickson Erick Erickson added a comment - What do people think about letting this bake in trunk for a while? If there are no objections I'll probably commit this to trunk in the next few days.
          Hide
          erickerickson Erick Erickson added a comment -

          OK, there have been no objections to this, so I'm going to commit it to trunk, let it bake for a little while then merge into 5.2. Probably get this done tonight or tomorrow.

          Show
          erickerickson Erick Erickson added a comment - OK, there have been no objections to this, so I'm going to commit it to trunk, let it bake for a little while then merge into 5.2. Probably get this done tonight or tomorrow.
          Hide
          arcadius Arcadius Ahouansou added a comment -

          Note that this is just a question, not an objection.

          The design blog talks only about 2 DCs being required.

          Lets suppose that the DC1 and DC2 are both operational and updates being send to both of them.
          If suddenly, the pipe between the two get broken for a couple of hours while updates still going into individual DC.
          When the link is re-established, updates/replications from TransactionLog will fly from both DCs i.e DC1->DC2 and DC2->DC1

          • How do we guaranty the order of execution of updates from TL? ... i.e when the link was broken, there have been deletion of doc#1 in DC1 followed by add/update of same doc#1 in DC2
          • In case DC1 lags far behind the other, full index replication (a la master-slave) may happen, meaning all updates done on DC1 will be overwritten by data from DC2 leading to data loss?
          • Would a 3rd DC help make the system more redundant?
          Show
          arcadius Arcadius Ahouansou added a comment - Note that this is just a question, not an objection. The design blog talks only about 2 DCs being required. Lets suppose that the DC1 and DC2 are both operational and updates being send to both of them. If suddenly, the pipe between the two get broken for a couple of hours while updates still going into individual DC. When the link is re-established, updates/replications from TransactionLog will fly from both DCs i.e DC1->DC2 and DC2->DC1 How do we guaranty the order of execution of updates from TL? ... i.e when the link was broken, there have been deletion of doc#1 in DC1 followed by add/update of same doc#1 in DC2 In case DC1 lags far behind the other, full index replication (a la master-slave) may happen, meaning all updates done on DC1 will be overwritten by data from DC2 leading to data loss? Would a 3rd DC help make the system more redundant?
          Hide
          erickerickson Erick Erickson added a comment -

          Here's a patch that applies against trunk, so far it passes precommit but I haven't yet run the full test suite, won't have a chance until tonight.

          Differences from original patch:

          • The "usual" reconciliation issues, a few minor incompatibilities with code that's changed in trunk.
          • CloudSolrServer <- CloudSolrClient (etc).
          • Fixed forbidden APIs that failed precommit
          • Some files were prefixed by Cdc, others by Cdcr so I made them all Cdcr for consistency's sake.
          • formatted everything that'd changed. NOTE: I tried the nifty "only vcs changed" option in IntelliJ and it seemed to work fine. If anyone sees gratuitous formatting changes, let me know. By far the majority of the code is new though.

          Assuming the tests run OK, I intend to commit this later this evening or perhaps tomorrow evening unless there are objections. I'll commit this to trunk, let it bake for a while then merge back into 5x.

          Show
          erickerickson Erick Erickson added a comment - Here's a patch that applies against trunk, so far it passes precommit but I haven't yet run the full test suite, won't have a chance until tonight. Differences from original patch: The "usual" reconciliation issues, a few minor incompatibilities with code that's changed in trunk. CloudSolrServer <- CloudSolrClient (etc). Fixed forbidden APIs that failed precommit Some files were prefixed by Cdc, others by Cdcr so I made them all Cdcr for consistency's sake. formatted everything that'd changed. NOTE: I tried the nifty "only vcs changed" option in IntelliJ and it seemed to work fine. If anyone sees gratuitous formatting changes, let me know. By far the majority of the code is new though. Assuming the tests run OK, I intend to commit this later this evening or perhaps tomorrow evening unless there are objections. I'll commit this to trunk, let it bake for a while then merge back into 5x.
          Hide
          erickerickson Erick Erickson added a comment -

          Renaud Delbru Things are looking pretty good. The only thing that isn't working when I run "ant test" on trunk is it complains that AbstractCdcrDistributedZkTest should be a concrete class since it contains the @Test annotation. Should
          public void testDistribSearch()

          just have the @Test removed? (It's late or I'd look some more).

          If it's just that simple, let me know and I'll fix it up in the trunk patch and it'll be merged into 5x along with the rest of my changes.

          Thanks!
          Erick

          Show
          erickerickson Erick Erickson added a comment - Renaud Delbru Things are looking pretty good. The only thing that isn't working when I run "ant test" on trunk is it complains that AbstractCdcrDistributedZkTest should be a concrete class since it contains the @Test annotation. Should public void testDistribSearch() just have the @Test removed? (It's late or I'd look some more). If it's just that simple, let me know and I'll fix it up in the trunk patch and it'll be merged into 5x along with the rest of my changes. Thanks! Erick
          Hide
          erickerickson Erick Erickson added a comment -

          Removing the @Test in this case seems to be fine.

          Got through nightly OK.

          Speaking of which, these are pretty long tests. Should they be annotated with @Nightly? What do people think? Or perhaps left the way they are for baking then made Nightly later?

          Show
          erickerickson Erick Erickson added a comment - Removing the @Test in this case seems to be fine. Got through nightly OK. Speaking of which, these are pretty long tests. Should they be annotated with @Nightly? What do people think? Or perhaps left the way they are for baking then made Nightly later?
          Hide
          jkot Jakub Kotowski added a comment -

          Renaud is travelling this week so he might not be able to respond. I think that removing @Tests is ok. Even though, Renaud had to dig deep in the test framework to make things work so better that he confirms later when he's back. Can't comment about @Nightly as I don't know your process.

          Show
          jkot Jakub Kotowski added a comment - Renaud is travelling this week so he might not be able to respond. I think that removing @Tests is ok. Even though, Renaud had to dig deep in the test framework to make things work so better that he confirms later when he's back. Can't comment about @Nightly as I don't know your process.
          Hide
          erickerickson Erick Erickson added a comment -

          Arcadius Ahouansou Sorry it took a while to get back to you, but currently CDCR is active-passive, not active-active so the scenario you asked about shouldn't arise.

          Show
          erickerickson Erick Erickson added a comment - Arcadius Ahouansou Sorry it took a while to get back to you, but currently CDCR is active-passive, not active-active so the scenario you asked about shouldn't arise.
          Hide
          erickerickson Erick Erickson added a comment -

          Updated trunk patch (the original patch was against 4x). In addition to the last trunk patch I uploaded on 10-May, this one tries to resolve the test issues. You'll see some //nocommit and //EOE comments so it won't pass precommit. These are just markers to allow others to review the test changes, I'll remove them before committing as well as add a new CHANGES.txt entry.

          Fortunately, the original patch against 4x was mostly new code so there were very few places that needed to be reconciled.

          Show
          erickerickson Erick Erickson added a comment - Updated trunk patch (the original patch was against 4x). In addition to the last trunk patch I uploaded on 10-May, this one tries to resolve the test issues. You'll see some //nocommit and //EOE comments so it won't pass precommit. These are just markers to allow others to review the test changes, I'll remove them before committing as well as add a new CHANGES.txt entry. Fortunately, the original patch against 4x was mostly new code so there were very few places that needed to be reconciled.
          Hide
          rendel Renaud Delbru added a comment -

          Erick Erickson I have checked the new patch on the latest trunk. The unit tests seem to properly run with the latest changes. Thanks for porting this to trunk.

          Show
          rendel Renaud Delbru added a comment - Erick Erickson I have checked the new patch on the latest trunk. The unit tests seem to properly run with the latest changes. Thanks for porting this to trunk.
          Hide
          erickerickson Erick Erickson added a comment -

          Renaud Delbru Thanks for looking it over. OK, I'll clean up the nocommits etc and check it in probably tomorrow.

          Show
          erickerickson Erick Erickson added a comment - Renaud Delbru Thanks for looking it over. OK, I'll clean up the nocommits etc and check it in probably tomorrow.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1681186 from Erick Erickson in branch 'dev/trunk'
          [ https://svn.apache.org/r1681186 ]

          SOLR-6273: Cross Data Center Replication

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1681186 from Erick Erickson in branch 'dev/trunk' [ https://svn.apache.org/r1681186 ] SOLR-6273 : Cross Data Center Replication
          Hide
          erickerickson Erick Erickson added a comment -

          I'm going to let this bake on trunk for a week or so, then merge into 5.3.

          thanks Renaud, Yonik et.al.!

          Show
          erickerickson Erick Erickson added a comment - I'm going to let this bake on trunk for a week or so, then merge into 5.3. thanks Renaud, Yonik et.al.!
          Hide
          erickerickson Erick Erickson added a comment -

          Rats, coffee hasn't kicked in yet. Mis-typed the JIRA, here's what the comment should have been on the latest commit:

          "SOLR-6273: Cross Data Center Replication disabling noisy tests until we figure it out"

          Revision is: [ https://svn.apache.org/r1681361 ]

          Show
          erickerickson Erick Erickson added a comment - Rats, coffee hasn't kicked in yet. Mis-typed the JIRA, here's what the comment should have been on the latest commit: " SOLR-6273 : Cross Data Center Replication disabling noisy tests until we figure it out" Revision is: [ https://svn.apache.org/r1681361 ]
          Hide
          rendel Renaud Delbru added a comment -

          Erick Erickson, I was able to reproduce the issues from the failed jinkins build. After replicating the tlog files, the update log of the slave is not properly "re-initialised", and it still contains references to the previous tlog files. I have attached a fix for this.

          Show
          rendel Renaud Delbru added a comment - Erick Erickson , I was able to reproduce the issues from the failed jinkins build. After replicating the tlog files, the update log of the slave is not properly "re-initialised", and it still contains references to the previous tlog files. I have attached a fix for this.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1681839 from Erick Erickson in branch 'dev/trunk'
          [ https://svn.apache.org/r1681839 ]

          SOLR-6273: Cross Data Center Replication: Fix at least one test, un-Ignore tests

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1681839 from Erick Erickson in branch 'dev/trunk' [ https://svn.apache.org/r1681839 ] SOLR-6273 : Cross Data Center Replication: Fix at least one test, un-Ignore tests
          Hide
          erickerickson Erick Erickson added a comment -

          Apologies in advance if re-enabling all the tests generates noise. I couldn't get a failure on my box in 150 tries or so, so I'll have to pull logs from Jenkins if/when additional issues spring up.

          Show
          erickerickson Erick Erickson added a comment - Apologies in advance if re-enabling all the tests generates noise. I couldn't get a failure on my box in 150 tries or so, so I'll have to pull logs from Jenkins if/when additional issues spring up.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1681893 from Erick Erickson in branch 'dev/trunk'
          [ https://svn.apache.org/r1681893 ]

          SOLR-6273: re-ignoring failed tests

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1681893 from Erick Erickson in branch 'dev/trunk' [ https://svn.apache.org/r1681893 ] SOLR-6273 : re-ignoring failed tests
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1681904 from Erick Erickson in branch 'dev/trunk'
          [ https://svn.apache.org/r1681904 ]

          SOLR-6273: disable more failing tests now that we have logs

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1681904 from Erick Erickson in branch 'dev/trunk' [ https://svn.apache.org/r1681904 ] SOLR-6273 : disable more failing tests now that we have logs
          Hide
          rendel Renaud Delbru added a comment -

          Erick Erickson, I have attached a new patch regarding the unit test failures from the jenkins job. It is likely that the errors we saw are due to the jenkins server being under heavy load and therefore less responsive, which might trigger race condition issues in the assertions of the unit tests.
          I have added various safeguard methods to the unit test framework, so that the it will wait for the completion of particular tasks (cdcr state replication, update log cleaning, etc.) and fail after a given timeout (15s).

          Show
          rendel Renaud Delbru added a comment - Erick Erickson , I have attached a new patch regarding the unit test failures from the jenkins job. It is likely that the errors we saw are due to the jenkins server being under heavy load and therefore less responsive, which might trigger race condition issues in the assertions of the unit tests. I have added various safeguard methods to the unit test framework, so that the it will wait for the completion of particular tasks (cdcr state replication, update log cleaning, etc.) and fail after a given timeout (15s).
          Hide
          erickerickson Erick Erickson added a comment -

          Renaud:

          Cool! Yeah, the test cases for this kind of thing are tricky for sure. I'll give it a spin a bit later today and we'll see what Jenkins thinks.

          Show
          erickerickson Erick Erickson added a comment - Renaud: Cool! Yeah, the test cases for this kind of thing are tricky for sure. I'll give it a spin a bit later today and we'll see what Jenkins thinks.
          Hide
          martin.grotzke Martin Grotzke added a comment -

          Hi all, we're currently evaluating how to expand our current single DC solrcloud to multi (2) DCs. This effort here looks very promising, great work!
          Assuming we'd test how it works for us, could we follow the documentation mentioned above (https://docs.google.com/document/d/1DZHUFM3z9OX171DeGjcLTRI9uULM-NB1KsCSpVL3Zy0/edit?usp=sharing)? Does it match the current implementation? Do you have any other suggestions for us if we'd test this? Thanks!

          Show
          martin.grotzke Martin Grotzke added a comment - Hi all, we're currently evaluating how to expand our current single DC solrcloud to multi (2) DCs. This effort here looks very promising, great work! Assuming we'd test how it works for us, could we follow the documentation mentioned above ( https://docs.google.com/document/d/1DZHUFM3z9OX171DeGjcLTRI9uULM-NB1KsCSpVL3Zy0/edit?usp=sharing)? Does it match the current implementation? Do you have any other suggestions for us if we'd test this? Thanks!
          Hide
          rendel Renaud Delbru added a comment -

          Hi Martin,

          The google doc is up to date with the current implementation. One suggestion is for tuning the performance of the replication. The performance of the replication depends on the "Replicator Parameters". In your scenario, the two main parameters will be "schedule" and "batchSize". If you would like to see a very small latency between replication batches, you can decrease the "schedule" parameter from 1000ms to 1ms. To improve the network IO, you can also try to increase the "batchSize" parameter to a larger number (if your documents are a few kbs or less, you can try to increase it to 500, 1000 or more).

          To measure the impact that the parameters have on the replication performance, you can use the monitoring api, e.g., ?action=QUEUES, to retrieve some stats about the replication queue. The queue size will tell you how much your replica lags behind the source cluster. If the replication is not fast enough, you'll see the queue size increasing. The idea is to try to tune the schedule and batchSize parameters until you find the optimal values for your collection and setup, and see this queue being relatively stable and small.

          Show
          rendel Renaud Delbru added a comment - Hi Martin, The google doc is up to date with the current implementation. One suggestion is for tuning the performance of the replication. The performance of the replication depends on the "Replicator Parameters". In your scenario, the two main parameters will be "schedule" and "batchSize". If you would like to see a very small latency between replication batches, you can decrease the "schedule" parameter from 1000ms to 1ms. To improve the network IO, you can also try to increase the "batchSize" parameter to a larger number (if your documents are a few kbs or less, you can try to increase it to 500, 1000 or more). To measure the impact that the parameters have on the replication performance, you can use the monitoring api, e.g., ?action=QUEUES, to retrieve some stats about the replication queue. The queue size will tell you how much your replica lags behind the source cluster. If the replication is not fast enough, you'll see the queue size increasing. The idea is to try to tune the schedule and batchSize parameters until you find the optimal values for your collection and setup, and see this queue being relatively stable and small.
          Hide
          martin.grotzke Martin Grotzke added a comment -

          Great, thanks for the advice, Renaud!

          Show
          martin.grotzke Martin Grotzke added a comment - Great, thanks for the advice, Renaud!
          Hide
          erickerickson Erick Erickson added a comment -

          All tests pass consistently for me now, I'll be committing this shortly for more baking.

          Show
          erickerickson Erick Erickson added a comment - All tests pass consistently for me now, I'll be committing this shortly for more baking.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1691606 from Erick Erickson in branch 'dev/trunk'
          [ https://svn.apache.org/r1691606 ]

          SOLR-6273: Cross Data Center Replication. All tests are now passing on my machine, let's see if Jenkins flushes anything out

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1691606 from Erick Erickson in branch 'dev/trunk' [ https://svn.apache.org/r1691606 ] SOLR-6273 : Cross Data Center Replication. All tests are now passing on my machine, let's see if Jenkins flushes anything out
          Hide
          erickerickson Erick Erickson added a comment -

          Note: These tests take a long time to run. I'm thinking about changing the annotation to "Nightly" after it bakes for a bit, I'll assign a JIRA to myself to track.

          Show
          erickerickson Erick Erickson added a comment - Note: These tests take a long time to run. I'm thinking about changing the annotation to "Nightly" after it bakes for a bit, I'll assign a JIRA to myself to track.
          Hide
          steve_rowe Steve Rowe added a comment -

          Erick Erickson, I got a CdcrReplicationDistributedZkTest failure on my Jenkins: http://jenkins.sarowe.net/job/Lucene-Solr-tests-trunk/806/

          Stack Trace:
          java.lang.AssertionError
          	at __randomizedtesting.SeedInfo.seed([679B796D4028309D:6FFB0C414F261896]:0)
          	at org.junit.Assert.fail(Assert.java:92)
          	at org.junit.Assert.assertTrue(Assert.java:43)
          	at org.junit.Assert.assertTrue(Assert.java:54)
          	at org.apache.solr.cloud.CdcrReplicationDistributedZkTest.doTestTargetCollectionNotAvailable(CdcrReplicationDistributedZkTest.java:138)
          	at org.apache.solr.cloud.CdcrReplicationDistributedZkTest.doTests(CdcrReplicationDistributedZkTest.java:46)
          [...]
          
          Show
          steve_rowe Steve Rowe added a comment - Erick Erickson , I got a CdcrReplicationDistributedZkTest failure on my Jenkins: http://jenkins.sarowe.net/job/Lucene-Solr-tests-trunk/806/ Stack Trace: java.lang.AssertionError at __randomizedtesting.SeedInfo.seed([679B796D4028309D:6FFB0C414F261896]:0) at org.junit.Assert.fail(Assert.java:92) at org.junit.Assert.assertTrue(Assert.java:43) at org.junit.Assert.assertTrue(Assert.java:54) at org.apache.solr.cloud.CdcrReplicationDistributedZkTest.doTestTargetCollectionNotAvailable(CdcrReplicationDistributedZkTest.java:138) at org.apache.solr.cloud.CdcrReplicationDistributedZkTest.doTests(CdcrReplicationDistributedZkTest.java:46) [...]
          Hide
          erickerickson Erick Erickson added a comment -

          Steve Rowe Got it, thanks! I knew it was too good to be true.

          Show
          erickerickson Erick Erickson added a comment - Steve Rowe Got it, thanks! I knew it was too good to be true.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1693786 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1693786 ]

          SOLR-6273: Reset test hooks in a finally block to avoid leakage to other tests

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1693786 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1693786 ] SOLR-6273 : Reset test hooks in a finally block to avoid leakage to other tests
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Before we release 5.3, we should move this issue out of the 5.3 section and move it to 6.0.0 until it is backported.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Before we release 5.3, we should move this issue out of the 5.3 section and move it to 6.0.0 until it is backported.
          Hide
          erickerickson Erick Erickson added a comment -

          It's not in the CHANGES.txt file for 5.x at all (just checked, but I can always miss things), just in trunk so maybe this isn't a problem? I'll move it to the proper place before I merge it back to 5x.

          Or am I missing the point?

          Show
          erickerickson Erick Erickson added a comment - It's not in the CHANGES.txt file for 5.x at all (just checked, but I can always miss things), just in trunk so maybe this isn't a problem? I'll move it to the proper place before I merge it back to 5x. Or am I missing the point?
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Oh, I am sorry. This issue is mentioned under 5.3.0 on trunk which got me confused.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Oh, I am sorry. This issue is mentioned under 5.3.0 on trunk which got me confused.
          Hide
          erickerickson Erick Erickson added a comment -

          NP, I'm happy to have as many eyes catching stuff I miss as possible!

          Show
          erickerickson Erick Erickson added a comment - NP, I'm happy to have as many eyes catching stuff I miss as possible!
          Hide
          andyetitmoves Ramkumar Aiyengar added a comment - - edited

          Erick Erickson, any plans to move this to branch_5x soon? I am aware that this needs to be bedded in a bit, so no big deal either way, but if you follow my merge for SOLR-7859, you might have to merge in a change to CdcrReplicatorState to avoid failing on forbidden-apis/precommit.

          Show
          andyetitmoves Ramkumar Aiyengar added a comment - - edited Erick Erickson , any plans to move this to branch_5x soon? I am aware that this needs to be bedded in a bit, so no big deal either way, but if you follow my merge for SOLR-7859 , you might have to merge in a change to CdcrReplicatorState to avoid failing on forbidden-apis/precommit.
          Hide
          erickerickson Erick Erickson added a comment -

          Ramkumar Aiyengar Real Soon Now. Which is what I've been thinking for a month or more. I have another version that might make the test behave itself better that I'm going to try to get to today or tomorrow. But I'll just have to deal with the merging issues if and when. Except I'll be merging in the current trunk before then so probably pick those changes up as I go.

          Thanks for the heads-up, but don't delay your stuff at all this may (continue to) take a while.

          Show
          erickerickson Erick Erickson added a comment - Ramkumar Aiyengar Real Soon Now. Which is what I've been thinking for a month or more. I have another version that might make the test behave itself better that I'm going to try to get to today or tomorrow. But I'll just have to deal with the merging issues if and when. Except I'll be merging in the current trunk before then so probably pick those changes up as I go. Thanks for the heads-up, but don't delay your stuff at all this may (continue to) take a while.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          I've been playing with this feature for a couple of days and I have a few thoughts to improve this before we merge it into branch_5x.

          1. I think the configuration should be moved out of solrconfig.xml – the source collection name is redundant (it is always the one to which the core belongs) and it is the wrong place to configure peer cluster details. Perhaps the peer cluster details should be in cluster properties and the target collection should live as a collection level property. All this should be editable using our config APIs
          2. I feel it is too complex to have the user configure things like batch sizes and scheduler delays etc. Maybe a better way is to stream the transaction log in a single thread constantly and throttle to a configurable transfer rate. This will also reduce memory requirements by avoiding huge batches and possibly improve transfer speed as well. See point below.
          3. The current CDCR code behaves poorly on bulk loads. I loaded a 600MB file containing 2.7M JSON documents into the source collection in 177 seconds but it took more than 6 hours to replicate them into the target collection using schedule=1ms and batch size = 64. We need to do better than that by default.
          4. Related to the point above, the current CDCR code is not suitable for bootstrapping a new target cluster. We should look into a snapshot replication to speed up the bootstrap process (and maybe even the bulk loads)
          5. We need better stats/reporting including transfer rate, latency etc
          6. Each core puts a watch on the current shard's leader node to figure out if it is the current leader and therefore whether it should start the cdcr threads. I think this is not necessary. A similar problem was faced by SOLR-6266 the couchbase indexer plugin (not committed yet). I think we should have a event handler API for cores to listen for important cluster state events such as leader changes or state changes and do away with individual plugins adding a listener on ZK nodes. A better solution may be to have collection level plugins that can be automatically elected, failed over etc but that is a lot of work so I'll defer that for now.

          Thoughts?

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - I've been playing with this feature for a couple of days and I have a few thoughts to improve this before we merge it into branch_5x. I think the configuration should be moved out of solrconfig.xml – the source collection name is redundant (it is always the one to which the core belongs) and it is the wrong place to configure peer cluster details. Perhaps the peer cluster details should be in cluster properties and the target collection should live as a collection level property. All this should be editable using our config APIs I feel it is too complex to have the user configure things like batch sizes and scheduler delays etc. Maybe a better way is to stream the transaction log in a single thread constantly and throttle to a configurable transfer rate. This will also reduce memory requirements by avoiding huge batches and possibly improve transfer speed as well. See point below. The current CDCR code behaves poorly on bulk loads. I loaded a 600MB file containing 2.7M JSON documents into the source collection in 177 seconds but it took more than 6 hours to replicate them into the target collection using schedule=1ms and batch size = 64. We need to do better than that by default. Related to the point above, the current CDCR code is not suitable for bootstrapping a new target cluster. We should look into a snapshot replication to speed up the bootstrap process (and maybe even the bulk loads) We need better stats/reporting including transfer rate, latency etc Each core puts a watch on the current shard's leader node to figure out if it is the current leader and therefore whether it should start the cdcr threads. I think this is not necessary. A similar problem was faced by SOLR-6266 the couchbase indexer plugin (not committed yet). I think we should have a event handler API for cores to listen for important cluster state events such as leader changes or state changes and do away with individual plugins adding a listener on ZK nodes. A better solution may be to have collection level plugins that can be automatically elected, failed over etc but that is a lot of work so I'll defer that for now. Thoughts?
          Hide
          erickerickson Erick Erickson added a comment -

          thanks for looking! Currently I'm soooo far behind trying to figure out what is with the test framework (I suspect that framework is failing, we've hit some kind of edge case or something) that the notion of next steps is kind of off my radar but we'll certainly look at improvements once the test issues are worked out.

          Show
          erickerickson Erick Erickson added a comment - thanks for looking! Currently I'm soooo far behind trying to figure out what is with the test framework (I suspect that framework is failing, we've hit some kind of edge case or something) that the notion of next steps is kind of off my radar but we'll certainly look at improvements once the test issues are worked out.
          Hide
          anshumg Anshum Gupta added a comment - - edited

          Erick Erickson: Should we just move the entry for this from the 5.3 section and into the 6.0 section (even on trunk)? It's kind of confusing as it wasn't released with 5.3.

          Show
          anshumg Anshum Gupta added a comment - - edited Erick Erickson : Should we just move the entry for this from the 5.3 section and into the 6.0 section (even on trunk)? It's kind of confusing as it wasn't released with 5.3.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1709619 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1709619 ]

          SOLR-6273: Fixed a null check, some typos and a few compiler warnings

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1709619 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1709619 ] SOLR-6273 : Fixed a null check, some typos and a few compiler warnings
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1709829 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1709829 ]

          SOLR-6273: Moved entry in change log from Solr 5.3.0 to 6.0

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1709829 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1709829 ] SOLR-6273 : Moved entry in change log from Solr 5.3.0 to 6.0
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Can some one explain at what point is the tlog replication used?

          Renaud Delbru or Yonik Seeley

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Can some one explain at what point is the tlog replication used? Renaud Delbru or Yonik Seeley
          Hide
          rendel Renaud Delbru added a comment -

          Shalin Shekhar Mangar thanks for looking into this.

          Regarding performance (2 and 3), it is true that the right batch size and scheduler delay is very important for optimal performance. With the proper batch sizes and scheduler delays, we have seen very low update latency between the source and target clusters. In your setup, one document was approximately 0.2kb in size, therefore the batch size was ~14kb which should correspond to ~14mb/s of transfer rate. With such a transfer rate, the replication should have been done in a few seconds / minutes, not hours. Could you give more information about your setup / benchmark ? Were replication turned off while you were indexing on the source, or you turned it on after ?

          In term of moving from a batch model to to a pure streaming one, this might probably simplify the configuration on the user size, but in term of performance, I am not sure - maybe some other people can give their opinion here. Batch size might not use that much memory (if properly configured), and transfer speed also (if the batch size is properly configured too). One way to simplify also the configuration for the user is, like you proposed, having a configurable transfer rate but with some logic to automatically adjust the batch size and scheduler delay based on the configurable transfer rate ?

          About 5, I think transfer rate is a good addition. Latency could be computed as the QUEUES monitoring action is returning the last document timestamp.

          Show
          rendel Renaud Delbru added a comment - Shalin Shekhar Mangar thanks for looking into this. Regarding performance (2 and 3), it is true that the right batch size and scheduler delay is very important for optimal performance. With the proper batch sizes and scheduler delays, we have seen very low update latency between the source and target clusters. In your setup, one document was approximately 0.2kb in size, therefore the batch size was ~14kb which should correspond to ~14mb/s of transfer rate. With such a transfer rate, the replication should have been done in a few seconds / minutes, not hours. Could you give more information about your setup / benchmark ? Were replication turned off while you were indexing on the source, or you turned it on after ? In term of moving from a batch model to to a pure streaming one, this might probably simplify the configuration on the user size, but in term of performance, I am not sure - maybe some other people can give their opinion here. Batch size might not use that much memory (if properly configured), and transfer speed also (if the batch size is properly configured too). One way to simplify also the configuration for the user is, like you proposed, having a configurable transfer rate but with some logic to automatically adjust the batch size and scheduler delay based on the configurable transfer rate ? About 5, I think transfer rate is a good addition. Latency could be computed as the QUEUES monitoring action is returning the last document timestamp.
          Hide
          rendel Renaud Delbru added a comment -

          The tlog replication is only relevant to the source cluster, as it ensures that tlogs are replicated between a master and slaves in case of a recovery (with a snappull). If not, then there are some scenarios where a slave can end up with an incomplete update log, and if it becomes the master, then we will miss some updates and the target cluster becomes inconsistent wrt the source cluster.

          Show
          rendel Renaud Delbru added a comment - The tlog replication is only relevant to the source cluster, as it ensures that tlogs are replicated between a master and slaves in case of a recovery (with a snappull). If not, then there are some scenarios where a slave can end up with an incomplete update log, and if it becomes the master, then we will miss some updates and the target cluster becomes inconsistent wrt the source cluster.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Sorry, you are right. I wasn't using the 1ms delay – I had uploaded the new config but forgot to reload the source collection so it was using 1000ms as the schedule which explains the slowness.

          In term of moving from a batch model to to a pure streaming one, this might probably simplify the configuration on the user size, but in term of performance, I am not sure...

          Yeah, I now see that it probably won't affect performance much. But I would still prefer streaming because that the batch size and schedule is really achieving the same thing i.e. streaming. Furthermore, as you said, schedule and batchSize are two more things for the user to configure whereas setting a transfer rate is much easier for the user.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Sorry, you are right. I wasn't using the 1ms delay – I had uploaded the new config but forgot to reload the source collection so it was using 1000ms as the schedule which explains the slowness. In term of moving from a batch model to to a pure streaming one, this might probably simplify the configuration on the user size, but in term of performance, I am not sure... Yeah, I now see that it probably won't affect performance much. But I would still prefer streaming because that the batch size and schedule is really achieving the same thing i.e. streaming. Furthermore, as you said, schedule and batchSize are two more things for the user to configure whereas setting a transfer rate is much easier for the user.
          Hide
          rendel Renaud Delbru added a comment -

          Yes, I think we should probably change the default value of the scheduler to 1ms unless we change the model to a streaming one. 1000ms is way too high as default value.

          Show
          rendel Renaud Delbru added a comment - Yes, I think we should probably change the default value of the scheduler to 1ms unless we change the model to a streaming one. 1000ms is way too high as default value.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          The tlog replication is only relevant to the source cluster, as it ensures that tlogs are replicated between a master and slaves in case of a recovery (with a snappull)

          Ah, I see, thanks for explaining. Am I correct in assuming that since the current tlog is not in the logs deque therefore this does not interfere with the replaying of buffered updates?

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - The tlog replication is only relevant to the source cluster, as it ensures that tlogs are replicated between a master and slaves in case of a recovery (with a snappull) Ah, I see, thanks for explaining. Am I correct in assuming that since the current tlog is not in the logs deque therefore this does not interfere with the replaying of buffered updates?
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Any idea why this might happen? Looks like the state is null. This started happening after I reloaded the source collection and re-indexed the JSON documents.

          339784408 ERROR (cdcr-replicator-41-thread-155-processing-n:127.0.1.1:8001_solr x:cdcr_source_shard1_replica1 s:shard1 c:cdcr_source r:core_node1) [c:cdcr_source s:shard1 r:core_node1 x:cdcr_source_shard1_replica1] o.a.s.c.u.ExecutorUtil Uncaught exception java.lang.NullPointerException thrown by thread: cdcr-replicator-41-thread-155-processing-n:127.0.1.1:8001_solr x:cdcr_source_shard1_replica1 s:shard1 c:cdcr_source r:core_node1
          java.lang.Exception: Submitter stack trace
          	at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.execute(ExecutorUtil.java:204)
          	at org.apache.solr.handler.CdcrReplicatorScheduler$1.run(CdcrReplicatorScheduler.java:80)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
          	at java.lang.Thread.run(Thread.java:745)
          Exception in thread "cdcr-replicator-41-thread-155" java.lang.NullPointerException
          	at java.util.concurrent.ConcurrentLinkedQueue.checkNotNull(ConcurrentLinkedQueue.java:914)
          	at java.util.concurrent.ConcurrentLinkedQueue.offer(ConcurrentLinkedQueue.java:327)
          	at org.apache.solr.handler.CdcrReplicatorScheduler$1$1.run(CdcrReplicatorScheduler.java:89)
          	at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
          	at java.lang.Thread.run(Thread.java:745)
          
          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Any idea why this might happen? Looks like the state is null. This started happening after I reloaded the source collection and re-indexed the JSON documents. 339784408 ERROR (cdcr-replicator-41-thread-155-processing-n:127.0.1.1:8001_solr x:cdcr_source_shard1_replica1 s:shard1 c:cdcr_source r:core_node1) [c:cdcr_source s:shard1 r:core_node1 x:cdcr_source_shard1_replica1] o.a.s.c.u.ExecutorUtil Uncaught exception java.lang.NullPointerException thrown by thread: cdcr-replicator-41-thread-155-processing-n:127.0.1.1:8001_solr x:cdcr_source_shard1_replica1 s:shard1 c:cdcr_source r:core_node1 java.lang.Exception: Submitter stack trace at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.execute(ExecutorUtil.java:204) at org.apache.solr.handler.CdcrReplicatorScheduler$1.run(CdcrReplicatorScheduler.java:80) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang. Thread .run( Thread .java:745) Exception in thread "cdcr-replicator-41-thread-155" java.lang.NullPointerException at java.util.concurrent.ConcurrentLinkedQueue.checkNotNull(ConcurrentLinkedQueue.java:914) at java.util.concurrent.ConcurrentLinkedQueue.offer(ConcurrentLinkedQueue.java:327) at org.apache.solr.handler.CdcrReplicatorScheduler$1$1.run(CdcrReplicatorScheduler.java:89) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang. Thread .run( Thread .java:745)
          Hide
          rendel Renaud Delbru added a comment -

          That's a good point, and I think the current implementation might interfere with the replay of the buffered updates. The current tlog replication works as follow:
          1) Fetch the the tlog files from the master
          2) reset the update log before switching the tlog directory
          3) switch the tlog directory and re-initialise the update log with the new directory.
          Currently there is no logic to keep "buffered updates" while resetting and reinitializing the update log. It looks like the tlog replication still needs some work.

          Show
          rendel Renaud Delbru added a comment - That's a good point, and I think the current implementation might interfere with the replay of the buffered updates. The current tlog replication works as follow: 1) Fetch the the tlog files from the master 2) reset the update log before switching the tlog directory 3) switch the tlog directory and re-initialise the update log with the new directory. Currently there is no logic to keep "buffered updates" while resetting and reinitializing the update log. It looks like the tlog replication still needs some work.
          Hide
          rendel Renaud Delbru added a comment -

          First time I saw this issue.
          How did you perform the reload ? Have you deleted it the source collection before the reload, or just reload and overwrite the existing documents ?

          Show
          rendel Renaud Delbru added a comment - First time I saw this issue. How did you perform the reload ? Have you deleted it the source collection before the reload, or just reload and overwrite the existing documents ?
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          I used the collection reload API and then added new documents. Since my json documents do not have an 'id' field and I am using data driven schema, there is no overwriting and the same docs are added again with a new unique key.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - I used the collection reload API and then added new documents. Since my json documents do not have an 'id' field and I am using data driven schema, there is no overwriting and the same docs are added again with a new unique key.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          In that case, this can easily lead to lost updates. We should add a test which does constant indexing and triggers a recovery in a replica and asserts that all replicas are consistent at steady state.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - In that case, this can easily lead to lost updates. We should add a test which does constant indexing and triggers a recovery in a replica and asserts that all replicas are consistent at steady state.
          Hide
          erickerickson Erick Erickson added a comment -

          Shalin Shekhar Mangar Renaud and I have been trying to figure out what in the test framework seems to be giving us trouble getting the existing tests to pass. We've (well, mostly Renaud) have reworked some of the tests but still having problems. I have several changes on my local machine that help isolate the problems, but don't fix it. But some recent changes have caused a 100% failure case so I'm not going to commit anything. If you (or anyone else) want to play with the changes I can attach a patch that applies to trunk.

          We're getting an NPE that wasn't there before and I won't have time until this weekend at best to look any more deeply.

          Let me know if you'd like to see the current patch, I've been waiting until I could get some better idea of what it is in the current tests that's wonky before checking anything in.

          Show
          erickerickson Erick Erickson added a comment - Shalin Shekhar Mangar Renaud and I have been trying to figure out what in the test framework seems to be giving us trouble getting the existing tests to pass. We've (well, mostly Renaud) have reworked some of the tests but still having problems. I have several changes on my local machine that help isolate the problems, but don't fix it. But some recent changes have caused a 100% failure case so I'm not going to commit anything. If you (or anyone else) want to play with the changes I can attach a patch that applies to trunk. We're getting an NPE that wasn't there before and I won't have time until this weekend at best to look any more deeply. Let me know if you'd like to see the current patch, I've been waiting until I could get some better idea of what it is in the current tests that's wonky before checking anything in.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Hi Erick Erickson, please post the patch and I can take a look.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Hi Erick Erickson , please post the patch and I can take a look.
          Hide
          erickerickson Erick Erickson added a comment -

          Shalin Shekhar Mangar Attached a patch for you that should apply cleanly to trunk. It rolls up all the intermediate changes we've made AND has some special logging in IndexFetcher to show which of the chained calls generates the null pointer exception, and it's:
          solrcore.getUpdateHandler().getUpdateLog() that's generating the exception

          Just look for the initials EOE line 290 or so. Obviously that shouldn't be committed

          Applying this patch to trunk should be used as a base for ongoing work, I've been meaning to commit it for a while but haven't gotten to the bottom of the test failures we were having before the null pointer issue cropped up. I'll be happy to coordinate that whenever.

          Show
          erickerickson Erick Erickson added a comment - Shalin Shekhar Mangar Attached a patch for you that should apply cleanly to trunk. It rolls up all the intermediate changes we've made AND has some special logging in IndexFetcher to show which of the chained calls generates the null pointer exception, and it's: solrcore.getUpdateHandler().getUpdateLog() that's generating the exception Just look for the initials EOE line 290 or so. Obviously that shouldn't be committed Applying this patch to trunk should be used as a base for ongoing work, I've been meaning to commit it for a while but haven't gotten to the bottom of the test failures we were having before the null pointer issue cropped up. I'll be happy to coordinate that whenever.
          Hide
          rendel Renaud Delbru added a comment - - edited

          Erick Erickson Find attached your patch with some fixes.
          The cause of the NPE was that some replication handler tests were not running in cloud mode, and therefore the update log was null. I have added a simple fix for that issue. I have also fixed some merge issues with the latest trunk. The full Solr test suite was executed successfully.

          Shalin Shekhar Mangar Regarding the potential issue with the transaction log replication, I will have a look this week. Should I open a sub-issue to track this separately ?

          Show
          rendel Renaud Delbru added a comment - - edited Erick Erickson Find attached your patch with some fixes. The cause of the NPE was that some replication handler tests were not running in cloud mode, and therefore the update log was null. I have added a simple fix for that issue. I have also fixed some merge issues with the latest trunk. The full Solr test suite was executed successfully. Shalin Shekhar Mangar Regarding the potential issue with the transaction log replication, I will have a look this week. Should I open a sub-issue to track this separately ?
          Hide
          erickerickson Erick Erickson added a comment -

          OK, this patch fixes up a number of test issues. There still remains some zombie thread leaks. I tried extending the ThreadLeakLingering annotation just for a quick test but that didn't seem to cure the zombie problem.

          Apart from the zombie issue, I haven't seen test failures for about 300 tests of CdcrReplicationDistributedZkTest, and both precommit and test succeed. I'll be beasting the other CDCR tests over the weekend, but as they take quite some time to run it'll be a slow process.

          I'm going to commit this as it's the current state of the art and we should base any additional changes on this code line...

          Show
          erickerickson Erick Erickson added a comment - OK, this patch fixes up a number of test issues. There still remains some zombie thread leaks. I tried extending the ThreadLeakLingering annotation just for a quick test but that didn't seem to cure the zombie problem. Apart from the zombie issue, I haven't seen test failures for about 300 tests of CdcrReplicationDistributedZkTest, and both precommit and test succeed. I'll be beasting the other CDCR tests over the weekend, but as they take quite some time to run it'll be a slow process. I'm going to commit this as it's the current state of the art and we should base any additional changes on this code line...
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1713022 from Erick Erickson in branch 'dev/trunk'
          [ https://svn.apache.org/r1713022 ]

          SOLR-6273: testfix7, improves test pass ratio significantly

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1713022 from Erick Erickson in branch 'dev/trunk' [ https://svn.apache.org/r1713022 ] SOLR-6273 : testfix7, improves test pass ratio significantly
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1713207 from Erick Erickson in branch 'dev/trunk'
          [ https://svn.apache.org/r1713207 ]

          SOLR-6273: Took out inadvertent copyright comment

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1713207 from Erick Erickson in branch 'dev/trunk' [ https://svn.apache.org/r1713207 ] SOLR-6273 : Took out inadvertent copyright comment
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1714099 from Erick Erickson in branch 'dev/trunk'
          [ https://svn.apache.org/r1714099 ]

          SOLR-6273: Removed unused imports, no code changes

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1714099 from Erick Erickson in branch 'dev/trunk' [ https://svn.apache.org/r1714099 ] SOLR-6273 : Removed unused imports, no code changes
          Hide
          erickerickson Erick Erickson added a comment -

          This patch rolls up all of the changes from trunk for application on the 5x code line for this JIRA. I'm adding it for a couple reasons:

          1> If we decide to fold this into 5.5, we should just be able to apply this and go rather than reconstruct all the commits.

          2> Applying this along with the patch for SOLR-8263 will be simple for anyone who wants this functionality on the Solr 5x code line. I recommend that this be applied against the 5.4.x line as that's closest to the code base used to generate this patch.

          WARNINGS!
          1> THIS IS SUPPLIED "AS IS". I've applied it to 5x very close to the time Solr 5.4 was cut. I've run over 100 runs of all the CDCR test suites, precommit and test on it. All this works just like trunk where we haven't seen Jenkins errors for quite a while. All that said, this is not officially supported on 5x and may never be, depending on when 6.0 is released.

          2> SOLR-8263 addresses a potential data loss issue and should be applied after this patch. There'll be a 5x version of SOLR-8263 soon. Like this JIRA, the code for SOLR-8263 will be applied to trunk but not 5x unless we decide to back-port this functionality to 5x.

          Show
          erickerickson Erick Erickson added a comment - This patch rolls up all of the changes from trunk for application on the 5x code line for this JIRA. I'm adding it for a couple reasons: 1> If we decide to fold this into 5.5, we should just be able to apply this and go rather than reconstruct all the commits. 2> Applying this along with the patch for SOLR-8263 will be simple for anyone who wants this functionality on the Solr 5x code line. I recommend that this be applied against the 5.4.x line as that's closest to the code base used to generate this patch. WARNINGS! 1> THIS IS SUPPLIED "AS IS". I've applied it to 5x very close to the time Solr 5.4 was cut. I've run over 100 runs of all the CDCR test suites, precommit and test on it. All this works just like trunk where we haven't seen Jenkins errors for quite a while. All that said, this is not officially supported on 5x and may never be, depending on when 6.0 is released. 2> SOLR-8263 addresses a potential data loss issue and should be applied after this patch. There'll be a 5x version of SOLR-8263 soon. Like this JIRA, the code for SOLR-8263 will be applied to trunk but not 5x unless we decide to back-port this functionality to 5x.
          Hide
          erickerickson Erick Erickson added a comment -

          Closing this (finally!). SOLR-8263 needs to be fixed yet in order for us to tie a bow around CDCR, that's a separate issue.

          This is NOT being back-ported to 5.x at this point. I've provided a 5x rollup patch in case we want to do that, but we'll decide that later. I'll create a blocker on 5.5 just as a marker for consciously resolving this question if there's going to be a 5.5. release.

          Show
          erickerickson Erick Erickson added a comment - Closing this (finally!). SOLR-8263 needs to be fixed yet in order for us to tie a bow around CDCR, that's a separate issue. This is NOT being back-ported to 5.x at this point. I've provided a 5x rollup patch in case we want to do that, but we'll decide that later. I'll create a blocker on 5.5 just as a marker for consciously resolving this question if there's going to be a 5.5. release.
          Hide
          erickerickson Erick Erickson added a comment -

          Need to attach combined 6273 and 8263 patch here too

          Show
          erickerickson Erick Erickson added a comment - Need to attach combined 6273 and 8263 patch here too
          Hide
          erickerickson Erick Erickson added a comment -

          This patch is for 5x if we ever want to put CDCR in a 5x release since both SOLR-6273 and SOLR-8263 should be committed. I'll put this patch on both JIRAs. The patch should just be applied to 5x, no merging from trunk is necessary there.

          NOTE: The 5x patch was a little tricky to generate as dis-allowing local loggers happened between times, but all that is incorporated here.

          Many kudos to Renaud for all this work

          Show
          erickerickson Erick Erickson added a comment - This patch is for 5x if we ever want to put CDCR in a 5x release since both SOLR-6273 and SOLR-8263 should be committed. I'll put this patch on both JIRAs. The patch should just be applied to 5x, no merging from trunk is necessary there. NOTE: The 5x patch was a little tricky to generate as dis-allowing local loggers happened between times, but all that is incorporated here. Many kudos to Renaud for all this work
          Hide
          erickerickson Erick Erickson added a comment - - edited

          Attached 5x rollup patch for 6273 and 8263

          BTW, I've beasted the 4 CDCR test suites over 100 times each with this rollup patch against 5x so I'm pretty confident it's faithfully reflects the trunk code.

          Show
          erickerickson Erick Erickson added a comment - - edited Attached 5x rollup patch for 6273 and 8263 BTW, I've beasted the 4 CDCR test suites over 100 times each with this rollup patch against 5x so I'm pretty confident it's faithfully reflects the trunk code.
          Hide
          dpgove Dennis Gove added a comment - - edited

          Updated patch for 5x (specifically v5.5) which includes the changes in SOLR-8263. A small number of changes related to variable visibility have been made to the original patch.

          Also, this patch was created with git whereas the original one appears to have been created with svn. I believe this is the cause of the file size difference (the new one is smaller).

          Show
          dpgove Dennis Gove added a comment - - edited Updated patch for 5x (specifically v5.5) which includes the changes in SOLR-8263 . A small number of changes related to variable visibility have been made to the original patch. Also, this patch was created with git whereas the original one appears to have been created with svn. I believe this is the cause of the file size difference (the new one is smaller).
          Hide
          griffy Michael Griffith added a comment -

          Is this CDCR compatible with 4.10.3 – is it already baked into the code base prior to 4.10.3? Alfresco 5.1 is now out and it uses 4.10.3 as its solr base, but the alfresco project heavily modifies the code. I'm try to figure out if this is something that can be used in our alfresco data centers without having to patch or change any code.

          thanks in advance,

          Show
          griffy Michael Griffith added a comment - Is this CDCR compatible with 4.10.3 – is it already baked into the code base prior to 4.10.3? Alfresco 5.1 is now out and it uses 4.10.3 as its solr base, but the alfresco project heavily modifies the code. I'm try to figure out if this is something that can be used in our alfresco data centers without having to patch or change any code. thanks in advance,
          Hide
          erickerickson Erick Erickson added a comment -

          First, it's better to raise this kind of question on the user's list, a closed JIRA will only get eyeballs on it by chance.

          Gah, there was a kerfuffle with the labels for JIRAs and this one is labeled "master", which isn't very helpful. To answer though, this only current on 6.0+. There is a 5x patch that I put up "just in case" that's never been applied to the 5x code line. 4x is not going to happen.

          Show
          erickerickson Erick Erickson added a comment - First, it's better to raise this kind of question on the user's list, a closed JIRA will only get eyeballs on it by chance. Gah, there was a kerfuffle with the labels for JIRAs and this one is labeled "master", which isn't very helpful. To answer though, this only current on 6.0+. There is a 5x patch that I put up "just in case" that's never been applied to the 5x code line. 4x is not going to happen.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          Alfresco's Solr implementation doesn't use SolrCloud, so CDCR is not going to work with Alfresco Solr.

          Alfresco's Solr implementation is eventually consistent though and should work across data centers, no CDCR needed.

          Show
          joel.bernstein Joel Bernstein added a comment - Alfresco's Solr implementation doesn't use SolrCloud, so CDCR is not going to work with Alfresco Solr. Alfresco's Solr implementation is eventually consistent though and should work across data centers, no CDCR needed.

            People

            • Assignee:
              erickerickson Erick Erickson
              Reporter:
              yseeley@gmail.com Yonik Seeley
            • Votes:
              16 Vote for this issue
              Watchers:
              54 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development