Solr
  1. Solr
  2. SOLR-5656

Add autoAddReplicas feature for shared file systems.

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.10, 6.0
    • Component/s: None
    • Labels:
      None

      Description

      When using HDFS, the Overseer should have the ability to reassign the cores from failed nodes to running nodes.

      Given that the index and transaction logs are in hdfs, it's simple for surviving hardware to take over serving cores for failed hardware.

      There are some tricky issues around having the Overseer handle this for you, but seems a simple first pass is not too difficult.

      This will add another alternative to replicating both with hdfs and solr.

      It shouldn't be specific to hdfs, and would be an option for any shared file system Solr supports.

      https://reviews.apache.org/r/23371/

      1. SOLR-5656.patch
        161 kB
        Mark Miller
      2. SOLR-5656.patch
        164 kB
        Mark Miller
      3. SOLR-5656.patch
        161 kB
        Mark Miller
      4. SOLR-5656.patch
        174 kB
        Mark Miller

        Issue Links

          Activity

          Hide
          Mark Miller added a comment -

          I've got a pretty nicely working patch for auto replica failover ready for this issue. It will need some small extensions to work without a shared filesystem, but it provides all of the ground work that we will want for that. I'll post my current progress in a couple days.

          Show
          Mark Miller added a comment - I've got a pretty nicely working patch for auto replica failover ready for this issue. It will need some small extensions to work without a shared filesystem, but it provides all of the ground work that we will want for that. I'll post my current progress in a couple days.
          Hide
          Shalin Shekhar Mangar added a comment -

          It will need some small extensions to work without a shared filesystem, but it provides all of the ground work that we will want for that. I'll post my current progress in a couple days.

          That's awesome, looking forward to it!

          Show
          Shalin Shekhar Mangar added a comment - It will need some small extensions to work without a shared filesystem, but it provides all of the ground work that we will want for that. I'll post my current progress in a couple days. That's awesome, looking forward to it!
          Hide
          Mark Miller added a comment -

          Sorry, took me a while to get this patch up.

          Here is a first patch for feed back. It's a git patch against trunk from a couple days ago.

          I'll add a new patch that's converted to svn trunk shortly.

          I'll also comment shortly with more details on the patch.

          Show
          Mark Miller added a comment - Sorry, took me a while to get this patch up. Here is a first patch for feed back. It's a git patch against trunk from a couple days ago. I'll add a new patch that's converted to svn trunk shortly. I'll also comment shortly with more details on the patch.
          Hide
          Mark Miller added a comment -

          Here is an svn patch against trunk.

          Show
          Mark Miller added a comment - Here is an svn patch against trunk.
          Hide
          Mark Miller added a comment -

          The approach is fairly simple.

          The Overseer class gets a new thread that periodically evaluates live nodes and cluster state and fires off SolrCore create commands to add replicas when there are not enough replicas up to meet a collections replicationFactor.

          The feature is enabled per collection by an additional boolean create collections API param called autoAddReplicas.

          This feature only works with the Collections API.

          In this initial implementation, replicas are not removed if you end up with too many for some reason, and replicas are not rebalanced when nodes come back to life. You must manually move replicas after restoring a node to rebalance.

          There are three settings exposed:

          autoReplicaFailoverWorkLoopDelay: How often the Overseer inspects the clusterstate and possibly takes action.

          autoReplicaFailoverWaitAfterExpiration: Once a replica no longer looks live, it won't be replaced until at least this much time has passed after noticing that.

          autoReplicaFailoverBadNodeExpiration: Once a replica is marked as looking like it needs to be replaced, if it still looks bad on a future cycle, it will be replaced. Once a node is marked as looking bad, after this much time it will be unmarked.

          Additional automated testing needs to be added as initially I have focused on manual testing. To aid in that, I have improved the cloud-dev scripts to make this type of feature much easier to test. I have once more patch to put up that expands on that a bit by starting another Solr node that can run ZooKeeper external to the cluster and that can be used to view the Solr Admin Cloud tab without actually participating in the cluster. Just makes monitoring while testing easier and takes away needing to run zk yourself and internally on one of the cluster nodes.

          Show
          Mark Miller added a comment - The approach is fairly simple. The Overseer class gets a new thread that periodically evaluates live nodes and cluster state and fires off SolrCore create commands to add replicas when there are not enough replicas up to meet a collections replicationFactor. The feature is enabled per collection by an additional boolean create collections API param called autoAddReplicas. This feature only works with the Collections API. In this initial implementation, replicas are not removed if you end up with too many for some reason, and replicas are not rebalanced when nodes come back to life. You must manually move replicas after restoring a node to rebalance. There are three settings exposed: autoReplicaFailoverWorkLoopDelay: How often the Overseer inspects the clusterstate and possibly takes action. autoReplicaFailoverWaitAfterExpiration: Once a replica no longer looks live, it won't be replaced until at least this much time has passed after noticing that. autoReplicaFailoverBadNodeExpiration: Once a replica is marked as looking like it needs to be replaced, if it still looks bad on a future cycle, it will be replaced. Once a node is marked as looking bad, after this much time it will be unmarked. Additional automated testing needs to be added as initially I have focused on manual testing. To aid in that, I have improved the cloud-dev scripts to make this type of feature much easier to test. I have once more patch to put up that expands on that a bit by starting another Solr node that can run ZooKeeper external to the cluster and that can be used to view the Solr Admin Cloud tab without actually participating in the cluster. Just makes monitoring while testing easier and takes away needing to run zk yourself and internally on one of the cluster nodes.
          Hide
          Mark Miller added a comment -

          Here is another patch with a couple additions:

          • A couple fixes that 'ant precommit' brought up.
          • Finished the improvements to the /cloud-scripts I was working on it.
          Show
          Mark Miller added a comment - Here is another patch with a couple additions: A couple fixes that 'ant precommit' brought up. Finished the improvements to the /cloud-scripts I was working on it.
          Hide
          Gregory Chanan added a comment -

          An aside: we should probably set up review board for this project. No need to make it necessary, but it might make it easier to review and thus improve the number of reviews.

          HdfsChaosMonkeySafeLeaderTest.java:

          Line 28: import org.junit.Ignore;
          

          doesn't seem to be needed

          ClusterState.java:

          Line 218: public String getShardIdByCoreNodeName(String collectionName, String coreNodeName) { 
          

          This is a general comment, but there are a bunch of strings like coreNodeName, replica names, baseUrls whose relations I'm not sure about (I usually end up logging them and doing some ad-hoc comparisons). Perhaps turning these into types with proper conversions would make things clearer? Not necessary for this patch, just wondering your thoughts.

          ClusterStateUtil.java:

          Line 40: *          how to wait before giving up

          how long to wait (this appears multiple times)

          Line 113: for (Replica replica : replicas) {
          

          can we reduce the amount of duplicate code in these functions?

          Line 200: int timeOutInSeconds) {
          

          Seconds seem limiting, why not milliseconds?

          Line 242: isAddAddReplicas(...)
          

          move ZkStateReader to first param to match other functions here

          Line 247: boolean autoAddRepliacs = docCollection.getAutoAddReplicas();
          

          can just return autoAddReplicas here (also Repliacs is mispelled)

          ZkStateReader.java:

          Line 369: for (LiveNodesListener listener : liveNodesListeners) {
          

          Do we need this? No one seems to be using it. It's a little hard to figure out the model as well, i.e. it's not called when you explicitly call updateClusterState. I think the real issue is ZkStateReader is too complicated, but I haven't thought about how to address that.

          Line 284: controlJetty = createJetty(controlJettyDir, useJettyDataDir ? getDataDir(testDir
          

          Why can't we call createControlJetty?

          MockZkStateReader.java
          There was some comment that this is only necessary temporarily. Is that already addressed? Is there a JIRA

          Show
          Gregory Chanan added a comment - An aside: we should probably set up review board for this project. No need to make it necessary, but it might make it easier to review and thus improve the number of reviews. HdfsChaosMonkeySafeLeaderTest.java: Line 28: import org.junit.Ignore; doesn't seem to be needed ClusterState.java: Line 218: public String getShardIdByCoreNodeName( String collectionName, String coreNodeName) { This is a general comment, but there are a bunch of strings like coreNodeName, replica names, baseUrls whose relations I'm not sure about (I usually end up logging them and doing some ad-hoc comparisons). Perhaps turning these into types with proper conversions would make things clearer? Not necessary for this patch, just wondering your thoughts. ClusterStateUtil.java: Line 40: * how to wait before giving up how long to wait (this appears multiple times) Line 113: for (Replica replica : replicas) { can we reduce the amount of duplicate code in these functions? Line 200: int timeOutInSeconds) { Seconds seem limiting, why not milliseconds? Line 242: isAddAddReplicas(...) move ZkStateReader to first param to match other functions here Line 247: boolean autoAddRepliacs = docCollection.getAutoAddReplicas(); can just return autoAddReplicas here (also Repliacs is mispelled) ZkStateReader.java: Line 369: for (LiveNodesListener listener : liveNodesListeners) { Do we need this? No one seems to be using it. It's a little hard to figure out the model as well, i.e. it's not called when you explicitly call updateClusterState. I think the real issue is ZkStateReader is too complicated, but I haven't thought about how to address that. Line 284: controlJetty = createJetty(controlJettyDir, useJettyDataDir ? getDataDir(testDir Why can't we call createControlJetty? MockZkStateReader.java There was some comment that this is only necessary temporarily. Is that already addressed? Is there a JIRA
          Hide
          Mark Miller added a comment -

          An aside: we should probably set up review board for this project. No need to make it necessary, but it might make it easier to review and thus improve the number of reviews.

          I filed an issue a while back: https://issues.apache.org/jira/browse/INFRA-7630

          there are a bunch of strings like coreNodeName, replica names, baseUrls whose relations I'm not sure about (I usually end up logging them and doing some ad-hoc comparisons). Perhaps turning these into types with proper conversions would make things clearer? Not necessary for this patch, just wondering your thoughts.

          Sounds interesting to me - I agree, let's do it another issue though.

          can we reduce the amount of duplicate code in these functions?

          I'll look at it - I made an attempt at one point to converge some code and some small annoyance had me just revert it then.

          Seconds seem limiting, why not milliseconds?

          I think given the speed at which state updates propagate through the system, seconds is all the granularity that is needed, but I don't mind making it milliseconds.

          Do we need this?

          No, the listeners were for a failed path. I had thought I had taken them out in git, but probably used SmartGit to edit it in the index and not the working tree and lost the changes. Happens sometimes when I decide to edit with SmartGit.

          I think the real issue is ZkStateReader is too complicated

          Yeah, it kind of became a catch all class for anything above SolrZkClient and below ZkController. I'd just file a JIRA issue to at least start collecting ideas on some refactoring. ZkController could probably also use some editing.

          Why can't we call createControlJetty?

          I'll have to look. At one point, because it needed to be different, but I don't remember offhand if that is still the case.

          There was some comment that this is only necessary temporarily. Is that already addressed?

          That's https://issues.apache.org/jira/browse/SOLR-5473. I merged this up to that the first time it was committed. Once that was temporarily reverted and I updated to trunk, I commented that bit out. I'll leave it in until that issue is resolved without putting a ZkStateReader in ClusterState.

          Show
          Mark Miller added a comment - An aside: we should probably set up review board for this project. No need to make it necessary, but it might make it easier to review and thus improve the number of reviews. I filed an issue a while back: https://issues.apache.org/jira/browse/INFRA-7630 there are a bunch of strings like coreNodeName, replica names, baseUrls whose relations I'm not sure about (I usually end up logging them and doing some ad-hoc comparisons). Perhaps turning these into types with proper conversions would make things clearer? Not necessary for this patch, just wondering your thoughts. Sounds interesting to me - I agree, let's do it another issue though. can we reduce the amount of duplicate code in these functions? I'll look at it - I made an attempt at one point to converge some code and some small annoyance had me just revert it then. Seconds seem limiting, why not milliseconds? I think given the speed at which state updates propagate through the system, seconds is all the granularity that is needed, but I don't mind making it milliseconds. Do we need this? No, the listeners were for a failed path. I had thought I had taken them out in git, but probably used SmartGit to edit it in the index and not the working tree and lost the changes. Happens sometimes when I decide to edit with SmartGit. I think the real issue is ZkStateReader is too complicated Yeah, it kind of became a catch all class for anything above SolrZkClient and below ZkController. I'd just file a JIRA issue to at least start collecting ideas on some refactoring. ZkController could probably also use some editing. Why can't we call createControlJetty? I'll have to look. At one point, because it needed to be different, but I don't remember offhand if that is still the case. There was some comment that this is only necessary temporarily. Is that already addressed? That's https://issues.apache.org/jira/browse/SOLR-5473 . I merged this up to that the first time it was committed. Once that was temporarily reverted and I updated to trunk, I commented that bit out. I'll leave it in until that issue is resolved without putting a ZkStateReader in ClusterState.
          Hide
          Mark Miller added a comment -

          I'll put up another patch shortly.

          Show
          Mark Miller added a comment - I'll put up another patch shortly.
          Hide
          Gregory Chanan added a comment -

          DocCollection.java

          Line 154: if (maxShardsPerNode == null) {
          

          Why is this being checked? https://cwiki.apache.org/confluence/display/solr/Collections+API says replication factor is required, but not maxShards?

          SharedFSAutoReplicaFailoverTest.java

          Line 133: assertTrue("Timeout waiting for all live and active", ClusterStateUtil.waitForAllActiveAndLive(cloudClient.getZkStateReader(), collection1, 120));
          

          How come you only check collection1 throughout this test?

          Line 135: assertSliceAndReplicaCount(collection1);
          

          What about specifically targeting the node with the overseer

          Line 181: assertEquals(2, slices.size());
          

          Lots of magic numbers here

          SharedFSAutoReplicaFailoverUtilsTest.java

          Line 204: * c = collection, s = slice, r = replica, r\d = node\d, -\d = state (1=active,2=recovering,3=down,4=recovery_failed), * = bad replica 
          

          I can't figure anything out from this description. Maybe examples would help? I doubt you are actually saving much space compared to some simple builder.

          Show
          Gregory Chanan added a comment - DocCollection.java Line 154: if (maxShardsPerNode == null ) { Why is this being checked? https://cwiki.apache.org/confluence/display/solr/Collections+API says replication factor is required, but not maxShards? SharedFSAutoReplicaFailoverTest.java Line 133: assertTrue( "Timeout waiting for all live and active" , ClusterStateUtil.waitForAllActiveAndLive(cloudClient.getZkStateReader(), collection1, 120)); How come you only check collection1 throughout this test? Line 135: assertSliceAndReplicaCount(collection1); What about specifically targeting the node with the overseer Line 181: assertEquals(2, slices.size()); Lots of magic numbers here SharedFSAutoReplicaFailoverUtilsTest.java Line 204: * c = collection, s = slice, r = replica, r\d = node\d, -\d = state (1=active,2=recovering,3=down,4=recovery_failed), * = bad replica I can't figure anything out from this description. Maybe examples would help? I doubt you are actually saving much space compared to some simple builder.
          Hide
          Gregory Chanan added a comment -

          autoReplicaFailoverBadNodeExpiration: This name is a bit confusing – from just name I can't figure out if this is the time until a node that has been marked bad is retried or until we stop trying once we detect a node is bad. Maybe something like autoReplicaFailoverBadNodeTimeUntilRetry?

          Overseer.java:

          Line 356: System.err.println("Process msg " + message);
          

          You meant to leave this in?

          660:  //if (!checkCollectionKeyExistence(message)) return clusterState;
          

          why is this commented out?

          OverseerAuthReplicaFailoverThread.java:

          Line 82: private static Integer lastClusterStateVersion;
          

          should be volatile if static? Why static?

          Line 293:  static String getBestCreateUrl(ZkStateReader zkStateReader, DownReplica badReplica) {
          

          The API is a bit confusing, b/c this is the only function that takes a ZkStateReader – I think this is just b/c you want to test this function. Can the test just create one of these objects but not start it to simplify the API?

          ConfigSolr.java:

          Line 295: SOLR_AUTOREPLICAFAILOVER, 

          Is this meant to be here? There's no accessor? I think it's only controlled based on what's in the request?

          ConfigSolrXml.java:

          Line 120: propMap.put(CfgProp.SOLR_AUTOREPLICAFAILOVER, doSub("solr/solrcloud/bool[@name='genericCoreNodeNames']"));
          

          This looks wrong – – should be

          solr/solrcloud/bool[@name='autoReplicaFailover']

          ConfigSolrXmlOld.java:

          Line 168: config.getVal("solr/cores/@autoReplicaFailoverWaitAfterExperation", false));
          

          Expiration

          log4j.properties:

          Line 29: log4j.logger.org.apache.solr.common.cloud.ClusterStateUtil=DEBUG
          

          we want this on for every test?

          Show
          Gregory Chanan added a comment - autoReplicaFailoverBadNodeExpiration: This name is a bit confusing – from just name I can't figure out if this is the time until a node that has been marked bad is retried or until we stop trying once we detect a node is bad. Maybe something like autoReplicaFailoverBadNodeTimeUntilRetry? Overseer.java: Line 356: System .err.println( " Process msg " + message); You meant to leave this in? 660: // if (!checkCollectionKeyExistence(message)) return clusterState; why is this commented out? OverseerAuthReplicaFailoverThread.java: Line 82: private static Integer lastClusterStateVersion; should be volatile if static? Why static? Line 293: static String getBestCreateUrl(ZkStateReader zkStateReader, DownReplica badReplica) { The API is a bit confusing, b/c this is the only function that takes a ZkStateReader – I think this is just b/c you want to test this function. Can the test just create one of these objects but not start it to simplify the API? ConfigSolr.java: Line 295: SOLR_AUTOREPLICAFAILOVER, Is this meant to be here? There's no accessor? I think it's only controlled based on what's in the request? ConfigSolrXml.java: Line 120: propMap.put(CfgProp.SOLR_AUTOREPLICAFAILOVER, doSub( "solr/solrcloud/bool[@name='genericCoreNodeNames']" )); This looks wrong – – should be solr/solrcloud/bool[@name='autoReplicaFailover'] ConfigSolrXmlOld.java: Line 168: config.getVal( "solr/cores/@autoReplicaFailoverWaitAfterExperation" , false )); Expiration log4j.properties: Line 29: log4j.logger.org.apache.solr.common.cloud.ClusterStateUtil=DEBUG we want this on for every test?
          Hide
          Mark Miller added a comment -

          I doubt you are actually saving much space compared to some simple builder.

          It's not a space saving technique - I find it much faster to construct a clusterstate by typing a string of letters (especially a large one) vs dealing with any java code.

          Show
          Mark Miller added a comment - I doubt you are actually saving much space compared to some simple builder. It's not a space saving technique - I find it much faster to construct a clusterstate by typing a string of letters (especially a large one) vs dealing with any java code.
          Hide
          Mark Miller added a comment -

          This patch addresses most comments.

          Other responses coming tomorrow.

          Show
          Mark Miller added a comment - This patch addresses most comments. Other responses coming tomorrow.
          Hide
          Mark Miller added a comment -

          Line 295: SOLR_AUTOREPLICAFAILOVER,

          Is this meant to be here?

          No, from an earlier path. Removed.

          log4j.properties:

          No, I'll remove it before committing, but find it convenient to keep with the patch until then.

          Show
          Mark Miller added a comment - Line 295: SOLR_AUTOREPLICAFAILOVER, Is this meant to be here? No, from an earlier path. Removed. log4j.properties: No, I'll remove it before committing, but find it convenient to keep with the patch until then.
          Hide
          Mark Miller added a comment - - edited

          ReviewBoard has been setup and this seems like a good issue to trial it with. I created a review request: https://reviews.apache.org/r/23371/

          I also sent an email to the dev list so that we can have discussion around ReviewBoard and the Lucene project.

          Show
          Mark Miller added a comment - - edited ReviewBoard has been setup and this seems like a good issue to trial it with. I created a review request: https://reviews.apache.org/r/23371/ I also sent an email to the dev list so that we can have discussion around ReviewBoard and the Lucene project.
          Hide
          David Smiley added a comment -

          It seems this may be the case but I just want to confirm it: will this issue obviate the pointless replication (duplication) of data on a shared file system between replicas?

          Show
          David Smiley added a comment - It seems this may be the case but I just want to confirm it: will this issue obviate the pointless replication (duplication) of data on a shared file system between replicas?
          Hide
          Gregory Chanan added a comment -

          ReviewBoard has been setup and this seems like a good issue to trial it with. I created a review request: https://reviews.apache.org/r/23371/

          Great!

          I doubt you are actually saving much space compared to some simple builder.

          Sorry, I meant screen space .

          Show
          Gregory Chanan added a comment - ReviewBoard has been setup and this seems like a good issue to trial it with. I created a review request: https://reviews.apache.org/r/23371/ Great! I doubt you are actually saving much space compared to some simple builder. Sorry, I meant screen space .
          Hide
          Mark Miller added a comment -

          The format is driven by my laziness. If I want to test how a specific clusterstate will choose to fail over, I want to make it super simple to setup such a clusterstate so that we can test a large variety of them with minimal effort. It is a work in progress overall though - hacked together while traveling to California a month or two ago.

          So to setup a clusterstate.

          csr*r2sr3csr2

          That creates 2 collections. Collection1 has 2 shards. In it's first shard, it's first replica is on node 1 and down. It's second replica is on node 2. In it's second shard, a replica is on node 3. Collection2 has a single replica on it's first shard.

          I left the printout when you run tests in SharedFSAutoReplicaFailoverUtilsTest so it's easy to check the full printout for the clusterstate created.

          Show
          Mark Miller added a comment - The format is driven by my laziness. If I want to test how a specific clusterstate will choose to fail over, I want to make it super simple to setup such a clusterstate so that we can test a large variety of them with minimal effort. It is a work in progress overall though - hacked together while traveling to California a month or two ago. So to setup a clusterstate. csr*r2sr3csr2 That creates 2 collections. Collection1 has 2 shards. In it's first shard, it's first replica is on node 1 and down. It's second replica is on node 2. In it's second shard, a replica is on node 3. Collection2 has a single replica on it's first shard. I left the printout when you run tests in SharedFSAutoReplicaFailoverUtilsTest so it's easy to check the full printout for the clusterstate created.
          Hide
          Gregory Chanan added a comment -

          So, I should assume if there is no node number it's on node 1? And that * attaches to the previous shard? I still don't really know what "-" does.

          Show
          Gregory Chanan added a comment - So, I should assume if there is no node number it's on node 1? And that * attaches to the previous shard? I still don't really know what "-" does.
          Hide
          Mark Miller added a comment -

          So, I should assume if there is no node number it's on node 1?

          Currently it defaults to 1. I was going to make it explicit, but not a lot of error checking yet anyway, so left it for further improvement later. I figure this will be reused in a few other places that have to choose nodes given a clusterstate.

          And that * attaches to the previous shard?

          The * marks a replica as the one being replaced. The current replacement algorithm looks at each replica - when it finds one, it looks for the best place to replace it given a clusterstate.

          I still don't really know what "-" does.

          It just overrides a specific state for a replica in clusterstate.json - so rather than ACTIVE, you could mark them as RECOVERING or DOWN.

          Show
          Mark Miller added a comment - So, I should assume if there is no node number it's on node 1? Currently it defaults to 1. I was going to make it explicit, but not a lot of error checking yet anyway, so left it for further improvement later. I figure this will be reused in a few other places that have to choose nodes given a clusterstate. And that * attaches to the previous shard? The * marks a replica as the one being replaced. The current replacement algorithm looks at each replica - when it finds one, it looks for the best place to replace it given a clusterstate. I still don't really know what "-" does. It just overrides a specific state for a replica in clusterstate.json - so rather than ACTIVE, you could mark them as RECOVERING or DOWN.
          Hide
          Gregory Chanan added a comment -

          I figure this will be reused in a few other places that have to choose nodes given a clusterstate.

          I definitely think this sort of thing is useful, I just find it difficult to parse currently

          The * marks a replica as the one being replaced. The current replacement algorithm looks at each replica - when it finds one, it looks for the best place to replace it given a clusterstate.

          What I mean here is, for example, csr1-2*r2, does the * bind to (r1-2) or (r2).

          Currently it defaults to 1. I was going to make it explicit, but not a lot of error checking yet anyway, so left it for further improvement later. I figure this will be reused in a few other places that have to choose nodes given a clusterstate.

          I think making it explicit is a good idea. If being explicit isn't required, is csr-2 legal?

          It just overrides a specific state for a replica in clusterstate.json - so rather than ACTIVE, you could mark them as RECOVERING or DOWN.

          I wonder if it would be clearer if these states were non-intersecting with the node specification. So like, A=active, D=down, F=failed, R=recovering or if you are worried about case, could make recovering C or M (for moving, there isn't really a difference from a replica being moved vs recovering logically I think)? Then you wouldn't need the minus either. A default of "A" there seems reasonable too. What do you think?

          Show
          Gregory Chanan added a comment - I figure this will be reused in a few other places that have to choose nodes given a clusterstate. I definitely think this sort of thing is useful, I just find it difficult to parse currently The * marks a replica as the one being replaced. The current replacement algorithm looks at each replica - when it finds one, it looks for the best place to replace it given a clusterstate. What I mean here is, for example, csr1-2*r2, does the * bind to (r1-2) or (r2). Currently it defaults to 1. I was going to make it explicit, but not a lot of error checking yet anyway, so left it for further improvement later. I figure this will be reused in a few other places that have to choose nodes given a clusterstate. I think making it explicit is a good idea. If being explicit isn't required, is csr-2 legal? It just overrides a specific state for a replica in clusterstate.json - so rather than ACTIVE, you could mark them as RECOVERING or DOWN. I wonder if it would be clearer if these states were non-intersecting with the node specification. So like, A=active, D=down, F=failed, R=recovering or if you are worried about case, could make recovering C or M (for moving, there isn't really a difference from a replica being moved vs recovering logically I think)? Then you wouldn't need the minus either. A default of "A" there seems reasonable too. What do you think?
          Hide
          Gregory Chanan added a comment -

          I put up a new review on reviewboard.

          Show
          Gregory Chanan added a comment - I put up a new review on reviewboard.
          Hide
          Mark Miller added a comment -

          It seems this may be the case but I just want to confirm it: will this issue obviate the pointless replication (duplication) of data on a shared file system between replicas?

          This is just another option. It works both with or without replicas for a shard. There are trade offs in failover transparency, time, and query throughput depending on what you choose.

          Another option I'm about to start pursuing is SOLR-6237 An option to have only leaders write and replicas read when using a shared file system with SolrCloud.

          I don't yet fully know what trade offs may come up in that.

          Show
          Mark Miller added a comment - It seems this may be the case but I just want to confirm it: will this issue obviate the pointless replication (duplication) of data on a shared file system between replicas? This is just another option. It works both with or without replicas for a shard. There are trade offs in failover transparency, time, and query throughput depending on what you choose. Another option I'm about to start pursuing is SOLR-6237 An option to have only leaders write and replicas read when using a shared file system with SolrCloud. I don't yet fully know what trade offs may come up in that.
          Hide
          Mark Miller added a comment -

          What I mean here is, for example, csr1-2*r2, does the * bind to (r1-2) or (r2).

          It binds to r1-2. It goes csr and everything binds to the right.

          I've got a patch that integrates some of this discussion.

          Show
          Mark Miller added a comment - What I mean here is, for example, csr1-2*r2, does the * bind to (r1-2) or (r2). It binds to r1-2. It goes csr and everything binds to the right. I've got a patch that integrates some of this discussion.
          Hide
          Mark Miller added a comment -

          If being explicit isn't required, is csr-2 legal?

          Yeah, it's legal.

          Show
          Mark Miller added a comment - If being explicit isn't required, is csr-2 legal? Yeah, it's legal.
          Hide
          Mark Miller added a comment -

          Didn't ping the issue itself, but I put up a new patch about a week and a half ago: https://reviews.apache.org/r/23371/

          Show
          Mark Miller added a comment - Didn't ping the issue itself, but I put up a new patch about a week and a half ago: https://reviews.apache.org/r/23371/
          Hide
          Scott Lindner added a comment -

          Does this depend on SOLR-6237? Your comments above seem to imply that it's not, but conceptually can't you end up with multiple replicas pointing to the same location on disk (admittedly I haven't dug into your code)? If so, what impact does this have with multiple solr instances (i.e. multiple replicas of a given shard) pointing to the same location on disk where each could be accepting write changes?

          Show
          Scott Lindner added a comment - Does this depend on SOLR-6237 ? Your comments above seem to imply that it's not, but conceptually can't you end up with multiple replicas pointing to the same location on disk (admittedly I haven't dug into your code)? If so, what impact does this have with multiple solr instances (i.e. multiple replicas of a given shard) pointing to the same location on disk where each could be accepting write changes?
          Hide
          Mark Miller added a comment -

          Does this depend on SOLR-6237?

          No, that is a separate feature.

          but conceptually can't you end up with multiple replicas pointing to the same location on disk

          There are efforts to prevent his, I don't think it's likely, but we will improve this over time. There are settings that can be tuned around timing as well. I have done a fair amount of manual testing already, but the unit tests will be expanded over time.

          If so, what impact does this have with multiple solr instances (i.e. multiple replicas of a given shard) pointing to the same location on disk where each could be accepting write changes?

          You should be running with the hdfs lock impl so that one of the SolrCores would fail to start.

          I plan on committing this soon so that I don't have to keep it up to date with trunk. We can still iteratate if their are further comments. The feature is optional per collection and defaults to off.

          Show
          Mark Miller added a comment - Does this depend on SOLR-6237 ? No, that is a separate feature. but conceptually can't you end up with multiple replicas pointing to the same location on disk There are efforts to prevent his, I don't think it's likely, but we will improve this over time. There are settings that can be tuned around timing as well. I have done a fair amount of manual testing already, but the unit tests will be expanded over time. If so, what impact does this have with multiple solr instances (i.e. multiple replicas of a given shard) pointing to the same location on disk where each could be accepting write changes? You should be running with the hdfs lock impl so that one of the SolrCores would fail to start. I plan on committing this soon so that I don't have to keep it up to date with trunk. We can still iteratate if their are further comments. The feature is optional per collection and defaults to off.
          Hide
          ASF subversion and git services added a comment -

          Commit 1617919 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1617919 ]

          SOLR-5656: Add autoAddReplicas feature for shared file systems.

          Show
          ASF subversion and git services added a comment - Commit 1617919 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1617919 ] SOLR-5656 : Add autoAddReplicas feature for shared file systems.
          Hide
          ASF subversion and git services added a comment -

          Commit 1618655 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1618655 ]

          SOLR-5656: Add autoAddReplicas feature for shared file systems.

          Show
          ASF subversion and git services added a comment - Commit 1618655 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1618655 ] SOLR-5656 : Add autoAddReplicas feature for shared file systems.
          Hide
          David Smiley added a comment -

          How is it that Solr figures out which HDFS data nodes have the same replicated data and thus would make prime candidates to add a shard replica? I looked some of the patch but it wasn't apparent to me how it determined this.

          Show
          David Smiley added a comment - How is it that Solr figures out which HDFS data nodes have the same replicated data and thus would make prime candidates to add a shard replica? I looked some of the patch but it wasn't apparent to me how it determined this.

            People

            • Assignee:
              Mark Miller
              Reporter:
              Mark Miller
            • Votes:
              2 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development