Solr
  1. Solr
  2. SOLR-3397

Insure that Replication and Solr Cloud are compatible

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.6, Trunk
    • Labels:
      None

      Description

      There has been at least one report of an early-adopter having replication (as in master/slave) configured with SolrCloud and having very odd results. Experienced Solr users could reasonably try this (or just have their configurations from 3.x Solr installations hanging around). Since SolrCloud takes this functionality over completely, it seems like replication needs to be made smart enough to disable itself if running under SolrCloud.

      1. SOLR-3397.patch
        1 kB
        Erick Erickson

        Activity

        Hide
        ASF subversion and git services added a comment -

        Commit 1540930 from Erick Erickson in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1540930 ]

        SOLR-3397: Insure that replication and SolrCloud are compatible. Actually, just log a warning if SolrCloud is detected and master or slave is configured in solrconfig.xml

        Show
        ASF subversion and git services added a comment - Commit 1540930 from Erick Erickson in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1540930 ] SOLR-3397 : Insure that replication and SolrCloud are compatible. Actually, just log a warning if SolrCloud is detected and master or slave is configured in solrconfig.xml
        Hide
        Erick Erickson added a comment -

        Patch that logs a warning if master or slave is configured and a zkController is detected.

        Show
        Erick Erickson added a comment - Patch that logs a warning if master or slave is configured and a zkController is detected.
        Hide
        ASF subversion and git services added a comment -

        Commit 1540881 from Erick Erickson in branch 'dev/trunk'
        [ https://svn.apache.org/r1540881 ]

        SOLR-3397: Insure that replication and SolrCloud are compatible. Actually, just log a warning if SolrCloud is detected and master or slave is configured in solrconfig.xml

        Show
        ASF subversion and git services added a comment - Commit 1540881 from Erick Erickson in branch 'dev/trunk' [ https://svn.apache.org/r1540881 ] SOLR-3397 : Insure that replication and SolrCloud are compatible. Actually, just log a warning if SolrCloud is detected and master or slave is configured in solrconfig.xml
        Hide
        Erick Erickson added a comment -

        Comments from dev list discussion:

        Erick:

        Let's say a configuration is running SolrCloud and has <lst
        name="master"> or <lst name="slave"> bits defined in the replication
        handler. Is it valid? Taken care of? Is it worth a JIRA to barf if we
        detect that condition?

        Because it strikes me as something that's at worst undefined behavior,
        at best ignored and somewhere in the middle does replications as well
        as peer synchs as well as distributed updates.

        Under any circumstances it doesn't seem like the user is doing the right thing.

        Shawn:

        Initial thought: Yes, detect and explode.

        Second thought: Allowing replication config for the expert user (possibly for backup purposes) might be useful.

        Third thought: Yes, detect and explode. If someone wanted to write an application that used the handler as a direct API rather than through solrconfig.xml configuration, that would work with no problem. SolrCloud basically requires that the /replication handler be enabled, but not configured.

        Is the replication API fully documented anywhere? It might be nice to provide a skeletal example java application that talks to the replication API for simple index backup purposes. It would be particularly nice if it used CloudSolrServer (or the ZK client classes) and showed how to back up and restore multiple shards. If I had any idea how to write such an application, I would have already gotten started on it.

        Show
        Erick Erickson added a comment - Comments from dev list discussion: Erick: Let's say a configuration is running SolrCloud and has <lst name="master"> or <lst name="slave"> bits defined in the replication handler. Is it valid? Taken care of? Is it worth a JIRA to barf if we detect that condition? Because it strikes me as something that's at worst undefined behavior, at best ignored and somewhere in the middle does replications as well as peer synchs as well as distributed updates. Under any circumstances it doesn't seem like the user is doing the right thing. Shawn: Initial thought: Yes, detect and explode. Second thought: Allowing replication config for the expert user (possibly for backup purposes) might be useful. Third thought: Yes, detect and explode. If someone wanted to write an application that used the handler as a direct API rather than through solrconfig.xml configuration, that would work with no problem. SolrCloud basically requires that the /replication handler be enabled, but not configured. Is the replication API fully documented anywhere? It might be nice to provide a skeletal example java application that talks to the replication API for simple index backup purposes. It would be particularly nice if it used CloudSolrServer (or the ZK client classes) and showed how to back up and restore multiple shards. If I had any idea how to write such an application, I would have already gotten started on it.
        Hide
        Otis Gospodnetic added a comment -

        I would look at HBase replication. HBase also relies on ZK, yet people use HBase replication across DCs. Not bidirectional I believe, though. http://hbase.apache.org/replication.html . {Mark Miller's colleague, JD, is the main person responsible for HBase replication existence, so Mark can probably poke him for some tips.

        Show
        Otis Gospodnetic added a comment - I would look at HBase replication. HBase also relies on ZK, yet people use HBase replication across DCs. Not bidirectional I believe, though. http://hbase.apache.org/replication.html . { Mark Miller 's colleague, JD, is the main person responsible for HBase replication existence, so Mark can probably poke him for some tips.
        Hide
        Shawn Heisey added a comment -

        The primary issue with a 'two-DC' solution is zookeeper. A 'three-DC' solution with one zookeeper at each DC would work great, except for possible latency problems.

        Show
        Shawn Heisey added a comment - The primary issue with a 'two-DC' solution is zookeeper. A 'three-DC' solution with one zookeeper at each DC would work great, except for possible latency problems.
        Hide
        Shawn Heisey added a comment -

        I was planning to file a feature request for something that the Gluster project calls geo-replication. Would that fall under this issue, or should it be a new one?

        With this feature, you would have two semi-independent SolrCloud setups, one of which would replicate from the other. Ideally, there would be no master or slave - either the replication would work in both directions, or a higher-level leader election would take place.

        The driving force behind this feature is a user that wants to have a SolrCloud setup that is fully redundant between two data centers and can remain operational in the event of an entire data center going down. I'm not aware of a way to build a zookeeper ensemble in two locations that can always guarantee a working cluster without split-brain. Also, there is the fact that each update must be sent to all shards/replicas at the same time, which can be problematic when half of them are on a connection with high latency. If synchronizing the two clouds happens on a longer interval than the indexing, latency is less of a problem.

        I'm sure there are a ton of technical challenges to this idea. Perhaps this is a good candidate for GSoC?

        Show
        Shawn Heisey added a comment - I was planning to file a feature request for something that the Gluster project calls geo-replication. Would that fall under this issue, or should it be a new one? With this feature, you would have two semi-independent SolrCloud setups, one of which would replicate from the other. Ideally, there would be no master or slave - either the replication would work in both directions, or a higher-level leader election would take place. The driving force behind this feature is a user that wants to have a SolrCloud setup that is fully redundant between two data centers and can remain operational in the event of an entire data center going down. I'm not aware of a way to build a zookeeper ensemble in two locations that can always guarantee a working cluster without split-brain. Also, there is the fact that each update must be sent to all shards/replicas at the same time, which can be problematic when half of them are on a connection with high latency. If synchronizing the two clouds happens on a longer interval than the indexing, latency is less of a problem. I'm sure there are a ton of technical challenges to this idea. Perhaps this is a good candidate for GSoC?
        Hide
        Erick Erickson added a comment -

        I phrased it poorly, I'm aware that SolrCloud uses replication as needed.

        My base question here is are we sure that if a user has a classic master/slave setup and is running SolrCloud, do they play nice together? So a slave polls the master, and new segments are moved to the slave (classic). Meanwhile, the master may or may not be the leader. The updates may have been received already via the leader forwarding the requests. Does this all behave well?

        And if the leader goes down, a new leader is elected and classic replication does what?

        It seems like the replication handler polling should just be disabled in the SolrCloud world or is this all "just handled" today?

        We've seen problems in the past where people configure a classic master/slave setup then merrily index to both machines, replication can get all confused. I'm making sure this has been handled or at least is flagged as something to check.

        And, yes, "they shouldn't do that". If we can put in a low-cost way to insure this it might save people grief. And people will no doubt be upgrading from 3.x at some point, moving their solrconfig files if nothing else....

        But as I said, I may be seeing something that's not there in which case we can close this as "silly boy is hallucinating again"...

        Show
        Erick Erickson added a comment - I phrased it poorly, I'm aware that SolrCloud uses replication as needed. My base question here is are we sure that if a user has a classic master/slave setup and is running SolrCloud, do they play nice together? So a slave polls the master, and new segments are moved to the slave (classic). Meanwhile, the master may or may not be the leader. The updates may have been received already via the leader forwarding the requests. Does this all behave well? And if the leader goes down, a new leader is elected and classic replication does what? It seems like the replication handler polling should just be disabled in the SolrCloud world or is this all "just handled" today? We've seen problems in the past where people configure a classic master/slave setup then merrily index to both machines, replication can get all confused. I'm making sure this has been handled or at least is flagged as something to check. And, yes, "they shouldn't do that". If we can put in a low-cost way to insure this it might save people grief. And people will no doubt be upgrading from 3.x at some point, moving their solrconfig files if nothing else.... But as I said, I may be seeing something that's not there in which case we can close this as "silly boy is hallucinating again"...
        Hide
        Mark Miller added a comment -

        Since SolrCloud takes this functionality over completely, it seems like replication needs to be made smart enough to disable itself if running under SolrCloud.

        SolrCloud does not take this functionality over - it uses it. The example replication handler given in the example solrconfig.xml is sufficient. Some other config might make sense (say you want to enable basic auth or compression), while others should be left alone - eg you don't want to configure a node as a slave most likely.

        SolrCloud takes advantage of the existing replication functionality and pass parameters about who to poll from and when - you can't disable it - we might just do some conf checks and print warnings and or fail depending on config that is found - but in some cases their might be legit reasons to alter config in ways we don't anticipate depending on what you might be trying to setup.

        Show
        Mark Miller added a comment - Since SolrCloud takes this functionality over completely, it seems like replication needs to be made smart enough to disable itself if running under SolrCloud. SolrCloud does not take this functionality over - it uses it. The example replication handler given in the example solrconfig.xml is sufficient. Some other config might make sense (say you want to enable basic auth or compression), while others should be left alone - eg you don't want to configure a node as a slave most likely. SolrCloud takes advantage of the existing replication functionality and pass parameters about who to poll from and when - you can't disable it - we might just do some conf checks and print warnings and or fail depending on config that is found - but in some cases their might be legit reasons to alter config in ways we don't anticipate depending on what you might be trying to setup.

          People

          • Assignee:
            Erick Erickson
            Reporter:
            Erick Erickson
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development