Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0
    • Component/s: SolrCloud, update
    • Labels:
      None

      Description

      The indexing side of SolrCloud - the goal of this issue is to provide durable, fault tolerant indexing to an elastic cluster of Solr instances.

      1. 2shard4server.jpg
        67 kB
        Mark Miller
      2. apache-solr-noggit-r1211150.jar
        22 kB
        Mark Miller
      3. SOLR-2358.patch
        840 kB
        Mark Miller
      4. SOLR-2358.patch
        22 kB
        Alex Cowell
      5. zookeeper-3.3.4.jar
        1001 kB
        Mark Miller

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          Closed after release.

          Show
          Uwe Schindler added a comment - Closed after release.
          Hide
          Michael Garski added a comment -

          I have a use case for shard distribution based on something other than a hash on the document's unique id and was wondering if there are any thoughts as to how such functionality should be implemented? It looks like SOLR-2341 (Shard distribution policy) and SOLR-2592 (pluggable shard lookup mechanism) complement each other for indexing and searching and was wondering if anyone had thoughts as to the approach to take.

          Show
          Michael Garski added a comment - I have a use case for shard distribution based on something other than a hash on the document's unique id and was wondering if there are any thoughts as to how such functionality should be implemented? It looks like SOLR-2341 (Shard distribution policy) and SOLR-2592 (pluggable shard lookup mechanism) complement each other for indexing and searching and was wondering if anyone had thoughts as to the approach to take.
          Hide
          Yonik Seeley added a comment -

          For some things, like a request to recover, timeouts may be fine I think.

          Definitely - we have a lot better handle on Solr created requests. Replication (although it can take a long time to send a big file, there shouldn't be long periods where no packets are sent), PeerSync, etc.

          Although IIRC, a new cloud-style replication request involves the recipient doing a commit?

          Show
          Yonik Seeley added a comment - For some things, like a request to recover, timeouts may be fine I think. Definitely - we have a lot better handle on Solr created requests. Replication (although it can take a long time to send a big file, there shouldn't be long periods where no packets are sent), PeerSync, etc. Although IIRC, a new cloud-style replication request involves the recipient doing a commit?
          Hide
          Mark Miller added a comment -

          Yup, I agree - in general in non test code we don't want to time out by default - that is why I've stuck to only using them in the tests until now. I've tried adding one to the Solr cmd distributor for a bit though - just to see if that helps on Jenkins any. I'd like to narrow in and at least know if this is the problem or not (blackhole hangups). For some things, like a request to recover, timeouts may be fine I think.

          Once I am able to log into jenkins again, I can hopefully narrow down what is happening a lot faster.

          Show
          Mark Miller added a comment - Yup, I agree - in general in non test code we don't want to time out by default - that is why I've stuck to only using them in the tests until now. I've tried adding one to the Solr cmd distributor for a bit though - just to see if that helps on Jenkins any. I'd like to narrow in and at least know if this is the problem or not (blackhole hangups). For some things, like a request to recover, timeouts may be fine I think. Once I am able to log into jenkins again, I can hopefully narrow down what is happening a lot faster.
          Hide
          Yonik Seeley added a comment -

          We should be careful of using socket read timeouts in non-test code for operations that could potentially take a long time... commit, optimize, and even query requests (depending on what the request is). By default, solr does not currently time out requests because we don't know what the upper bound is.

          Show
          Yonik Seeley added a comment - We should be careful of using socket read timeouts in non-test code for operations that could potentially take a long time... commit, optimize, and even query requests (depending on what the request is). By default, solr does not currently time out requests because we don't know what the upper bound is.
          Hide
          Mark Miller added a comment -

          These tests really need to be done with real jetty instances (at least some of them). I'll try adding some timeouts where we are not currently using them (generally they are used from any test code but not always in non test code).

          Show
          Mark Miller added a comment - These tests really need to be done with real jetty instances (at least some of them). I'll try adding some timeouts where we are not currently using them (generally they are used from any test code but not always in non test code).
          Hide
          Robert Muir added a comment -

          I can't currently get into the hudson machine - used the wrong username the other day and seemed to get ip banned pretty much right away. Looking into getting that undone.

          Yeah thats probably the best way to move forward. Otherwise you have to wait like an hour just to see if one tweak to a single test worked.

          Which tricks? This could be part of it by the sound of things.

          It depends on what the test is doing, but just a few ideas:

          • any client operations in tests should have a low connect()timeout/so_timeout.
            if you always set this then it will never hang for long periods of time.
          • if you absolutely need to test the case where you don't get a timeout but another exception,
            use an ipv6 test address (eg [ff01::114]). because jenkins has no ipv6, it fails fast always. this won't work forever...
          • in a situation where you have A talking to B, and you want to test a condition where B goes down,
            instead of just bringing B down, instead you can consider mocking up a remote node to test failures.
            bring up a "mock downed server" (e.g. just a ServerSocket on that same port with reuseAddress=true).
            this one can return whatever error you want, or just disconnect, and even assert that A tried to
            connect to it. maybe instead of using "real remote jettys" at all, most tests could even be totally
            implemented this way: it would be faster and simpler than spinning up so many jettys in all the tests.
          Show
          Robert Muir added a comment - I can't currently get into the hudson machine - used the wrong username the other day and seemed to get ip banned pretty much right away. Looking into getting that undone. Yeah thats probably the best way to move forward. Otherwise you have to wait like an hour just to see if one tweak to a single test worked. Which tricks? This could be part of it by the sound of things. It depends on what the test is doing, but just a few ideas: any client operations in tests should have a low connect()timeout/so_timeout. if you always set this then it will never hang for long periods of time. if you absolutely need to test the case where you don't get a timeout but another exception, use an ipv6 test address (eg [ff01::114] ). because jenkins has no ipv6, it fails fast always. this won't work forever... in a situation where you have A talking to B, and you want to test a condition where B goes down, instead of just bringing B down, instead you can consider mocking up a remote node to test failures. bring up a "mock downed server" (e.g. just a ServerSocket on that same port with reuseAddress=true). this one can return whatever error you want, or just disconnect, and even assert that A tried to connect to it. maybe instead of using "real remote jettys" at all, most tests could even be totally implemented this way: it would be faster and simpler than spinning up so many jettys in all the tests.
          Hide
          Mark Miller added a comment -

          Should another issue be opened for the tests?

          I have another issue for the test problem: SOLR-3066

          Do the failures reproduce if you ssh into the hudson machine itself and test from there?

          I can't currently get into the hudson machine - used the wrong username the other day and seemed to get ip banned pretty much right away. Looking into getting that undone.

          Do any tests rely upon not being able to connect to a tcp/udp port

          Sometimes, yes - because jetties are going up and down during these tests, sometimes you wouldn't be able to connect - I wouldn't say we rely on it, but it seems it could happen.

          unless you do some tricks.

          Which tricks? This could be part of it by the sound of things.

          Show
          Mark Miller added a comment - Should another issue be opened for the tests? I have another issue for the test problem: SOLR-3066 Do the failures reproduce if you ssh into the hudson machine itself and test from there? I can't currently get into the hudson machine - used the wrong username the other day and seemed to get ip banned pretty much right away. Looking into getting that undone. Do any tests rely upon not being able to connect to a tcp/udp port Sometimes, yes - because jetties are going up and down during these tests, sometimes you wouldn't be able to connect - I wouldn't say we rely on it, but it seems it could happen. unless you do some tricks. Which tricks? This could be part of it by the sound of things.
          Hide
          Robert Muir added a comment -

          Should another issue be opened for the tests?

          Do the failures reproduce if you ssh into the hudson machine itself and test from there?
          I've found this useful before when things are hard to reproduce.

          Do any tests rely upon not being able to connect to a tcp/udp port (even localhost)?
          Our hudson machine has an interesting network configuration: it blackholes connections
          to closed ports, so any tests that rely upon this will just hang (for a very long time!)
          unless you do some tricks. This is actually great for testing (imo), because it
          simulates how a real outage can behave: but is likely different from anyone's local machine.

          Show
          Robert Muir added a comment - Should another issue be opened for the tests? Do the failures reproduce if you ssh into the hudson machine itself and test from there? I've found this useful before when things are hard to reproduce. Do any tests rely upon not being able to connect to a tcp/udp port (even localhost)? Our hudson machine has an interesting network configuration: it blackholes connections to closed ports, so any tests that rely upon this will just hang (for a very long time!) unless you do some tricks. This is actually great for testing (imo), because it simulates how a real outage can behave: but is likely different from anyone's local machine.
          Hide
          Mark Miller added a comment -

          I knew hudson would get me - that series of tubes runs stuff in some funny land I always have a hard time reproducing. I've ignored a couple tests for the very short term while I try and replicate the first fails on my mac, linux box, or windows VM. So far, it's proving difficult to replicate those fails, but I'll keep banging away over the short term.

          Show
          Mark Miller added a comment - I knew hudson would get me - that series of tubes runs stuff in some funny land I always have a hard time reproducing. I've ignored a couple tests for the very short term while I try and replicate the first fails on my mac, linux box, or windows VM. So far, it's proving difficult to replicate those fails, but I'll keep banging away over the short term.
          Hide
          Mark Miller added a comment -

          Okay, I just hit commit. I expect I'll have to do some more test hardening, but I will be pretty responsive to that initially.

          I have not worked out the whole changes entry and how to handle all of these sub issues - but I will start on that and leave this issue unresolved until I get that done (today or tomorrow depending on how it goes).

          Show
          Mark Miller added a comment - Okay, I just hit commit. I expect I'll have to do some more test hardening, but I will be pretty responsive to that initially. I have not worked out the whole changes entry and how to handle all of these sub issues - but I will start on that and leave this issue unresolved until I get that done (today or tomorrow depending on how it goes).
          Hide
          Yonik Seeley added a comment -

          +1, looks good!

          Show
          Yonik Seeley added a comment - +1, looks good!
          Hide
          Mark Miller added a comment -

          Okay, here is the patch - also requires new zookeeper and noggit jars.

          Show
          Mark Miller added a comment - Okay, here is the patch - also requires new zookeeper and noggit jars.
          Hide
          Mark Miller added a comment -

          Okay, tests are passing on my linux box, mac and windows vm. I am working on a patch right now to highlight the changes, then I plan on committing this issue in a day or two. From there, we can iterate on any rough edges on trunk.

          Show
          Mark Miller added a comment - Okay, tests are passing on my linux box, mac and windows vm. I am working on a patch right now to highlight the changes, then I plan on committing this issue in a day or two. From there, we can iterate on any rough edges on trunk.
          Hide
          Yonik Seeley added a comment -

          the primary blocker to that that I see at the moment is that org.apache.solr.search.TestRecovery does not pass on Windows

          Yeah, it's the old transaction logs that are still open after a shutdown (and the test tries to remove those log files).
          I'm in the middle of some deleteByQuery stuff right now, but I should be able to figure out a workaround for the TestRecovery issue this weekend.

          Show
          Yonik Seeley added a comment - the primary blocker to that that I see at the moment is that org.apache.solr.search.TestRecovery does not pass on Windows Yeah, it's the old transaction logs that are still open after a shutdown (and the test tries to remove those log files). I'm in the middle of some deleteByQuery stuff right now, but I should be able to figure out a workaround for the TestRecovery issue this weekend.
          Hide
          Mark Miller added a comment -

          I'm ready to start looking at merging this branch to trunk - the primary blocker to that that I see at the moment is that org.apache.solr.search.TestRecovery does not pass on Windows. After that is resolved, I hope to start the merge process!

          Show
          Mark Miller added a comment - I'm ready to start looking at merging this branch to trunk - the primary blocker to that that I see at the moment is that org.apache.solr.search.TestRecovery does not pass on Windows. After that is resolved, I hope to start the merge process!
          Hide
          Mark Miller added a comment -

          I've tried to make good use of atLeast to minimize the times of some of the larger new solrcloud tests, but they are still not super light weight (a few of the new ones spin up multiple jetty instances).

          Here is where they currently stand in comparison to current tests without any nightly or multiplier boosts:

          Worst Times:
          test:org.apache.solr.cloud.FullSolrCloudTest time:33.933
          test:org.apache.solr.handler.TestReplicationHandler time:30.002
          test:org.apache.solr.cloud.ChaosMonkeyNothingIsSafeTest time:24.572
          test:org.apache.solr.cloud.ChaosMonkeySafeLeaderTest time:24.271
          test:org.apache.solr.cloud.RecoveryZkTest time:22.875
          test:org.apache.solr.cloud.FullSolrCloudDistribCmdsTest time:22.161
          test:org.apache.solr.cloud.BasicDistributedZkTest time:16.696
          test:org.apache.solr.search.TestRealTimeGet time:16.385
          test:org.apache.solr.TestDistributedGrouping time:15.136
          test:org.apache.solr.TestDistributedSearch time:14.609
          
          Show
          Mark Miller added a comment - I've tried to make good use of atLeast to minimize the times of some of the larger new solrcloud tests, but they are still not super light weight (a few of the new ones spin up multiple jetty instances). Here is where they currently stand in comparison to current tests without any nightly or multiplier boosts: Worst Times: test:org.apache.solr.cloud.FullSolrCloudTest time:33.933 test:org.apache.solr.handler.TestReplicationHandler time:30.002 test:org.apache.solr.cloud.ChaosMonkeyNothingIsSafeTest time:24.572 test:org.apache.solr.cloud.ChaosMonkeySafeLeaderTest time:24.271 test:org.apache.solr.cloud.RecoveryZkTest time:22.875 test:org.apache.solr.cloud.FullSolrCloudDistribCmdsTest time:22.161 test:org.apache.solr.cloud.BasicDistributedZkTest time:16.696 test:org.apache.solr.search.TestRealTimeGet time:16.385 test:org.apache.solr.TestDistributedGrouping time:15.136 test:org.apache.solr.TestDistributedSearch time:14.609
          Hide
          Mark Miller added a comment -

          I've made the above change in the branch.

          Show
          Mark Miller added a comment - I've made the above change in the branch.
          Hide
          Mark Miller added a comment -

          Came up with a conversation with a user in #solr IRC - we really want to change the search param distrib to default to true rather than false when in SolrCloud mode.

          Show
          Mark Miller added a comment - Came up with a conversation with a user in #solr IRC - we really want to change the search param distrib to default to true rather than false when in SolrCloud mode.
          Hide
          Mark Miller added a comment -

          Perhaps within couple/few weeks, after we stabilize and finish up some hanging work?

          I think we are pretty close to this! There are only a few more nocommits to work down. There is more to add, but I think we will have something stable enough to start iterating on in trunk - hopefully that will trigger even more testing and feedback - it is getting toward the point where the cost of the branch is starting to outweigh the benefits.

          Show
          Mark Miller added a comment - Perhaps within couple/few weeks, after we stabilize and finish up some hanging work? I think we are pretty close to this! There are only a few more nocommits to work down. There is more to add, but I think we will have something stable enough to start iterating on in trunk - hopefully that will trigger even more testing and feedback - it is getting toward the point where the cost of the branch is starting to outweigh the benefits.
          Hide
          Darren Govoni added a comment -

          Great job Mark. Thanks!

          Show
          Darren Govoni added a comment - Great job Mark. Thanks!
          Hide
          Mark Miller added a comment -

          Hey Darren - I have re written the description a bit, attached a little diagram, and started working on an updated version of the solrcloud wiki page (http://wiki.apache.org/solr/SolrCloud) at http://wiki.apache.org/solr/SolrCloud2.

          If you have any user level questions, it might be more useful to do those on the user mailing list. Anything more related to development, fire away right here.

          Loosely, this issue covers the indexing side of the solrcloud vision - the search side had already been largely done in an earlier phase (though some of that has been improved as well in this phase).

          Show
          Mark Miller added a comment - Hey Darren - I have re written the description a bit, attached a little diagram, and started working on an updated version of the solrcloud wiki page ( http://wiki.apache.org/solr/SolrCloud ) at http://wiki.apache.org/solr/SolrCloud2 . If you have any user level questions, it might be more useful to do those on the user mailing list. Anything more related to development, fire away right here. Loosely, this issue covers the indexing side of the solrcloud vision - the search side had already been largely done in an earlier phase (though some of that has been improved as well in this phase).
          Hide
          Darren Govoni added a comment -

          Sounds good! I think I understand what this capability will do, but the ticket description is somewhat vague. Can you give us a more detailed explanation of what this ticket will provide now that its fully underway? thanks!!

          Show
          Darren Govoni added a comment - Sounds good! I think I understand what this capability will do, but the ticket description is somewhat vague. Can you give us a more detailed explanation of what this ticket will provide now that its fully underway? thanks!!
          Hide
          Mark Miller added a comment -

          Buffering is back in with retries on failed forwards to leaders.

          Show
          Mark Miller added a comment - Buffering is back in with retries on failed forwards to leaders.
          Hide
          Mark Miller added a comment -

          Okay - I've got basic buffering back - I've lost forwarding retries for the moment though - I'll wait to commit to the branch until I've brought that back.

          Show
          Mark Miller added a comment - Okay - I've got basic buffering back - I've lost forwarding retries for the moment though - I'll wait to commit to the branch until I've brought that back.
          Hide
          Mark Miller added a comment -

          As I was working on transforming the old distrib update processor code into something we needed for solrcloud, I dropped it's ability to buffer updates. It just made work quicker and I wasn't really sure how much re-factoring would end up happening, so I didn't want to spend too much time on something that only related to performance so early. I'm going to work on adding back buffering to the new SolrCmdDistributor class shortly - I think it means I have to move 'forward failures' retry logic back into the SolrCmdDistributor - I had this there before, but it was ugly, so I pulled it up a level into the distrib update processor. I think with buffering though, it needs to go back. (when a forward to leader fails, we would often like to pause and retry as it is possible the leader went down and now there is a new one)

          Show
          Mark Miller added a comment - As I was working on transforming the old distrib update processor code into something we needed for solrcloud, I dropped it's ability to buffer updates. It just made work quicker and I wasn't really sure how much re-factoring would end up happening, so I didn't want to spend too much time on something that only related to performance so early. I'm going to work on adding back buffering to the new SolrCmdDistributor class shortly - I think it means I have to move 'forward failures' retry logic back into the SolrCmdDistributor - I had this there before, but it was ugly, so I pulled it up a level into the distrib update processor. I think with buffering though, it needs to go back. (when a forward to leader fails, we would often like to pause and retry as it is possible the leader went down and now there is a new one)
          Hide
          Mark Miller added a comment -

          I just made it so that version can be specified on delete's in solrxml and did the work necessary for distrib deletes to work with versioning. You can do delete by id now.

          Show
          Mark Miller added a comment - I just made it so that version can be specified on delete's in solrxml and did the work necessary for distrib deletes to work with versioning. You can do delete by id now.
          Hide
          Mark Miller added a comment -

          We are starting to get some stable, usable stuff here (even though there is much to do!). We are also starting to get some users that are interested in using this stuff (critical feedback there). So I'd like to propose we try and merge the branch into trunk sooner rather than later, and then iterate from there. Anything too experimental in the future could move back onto a branch again. This will make the merge a bit more digestible as well - rather than building up a crazy amount of differences on the branch. There are also a variety of improvements and fixes in the testing framework and elsewhere that would be nice to get back into trunk. Perhaps within couple/few weeks, after we stabilize and finish up some hanging work?

          Show
          Mark Miller added a comment - We are starting to get some stable, usable stuff here (even though there is much to do!). We are also starting to get some users that are interested in using this stuff (critical feedback there). So I'd like to propose we try and merge the branch into trunk sooner rather than later, and then iterate from there. Anything too experimental in the future could move back onto a branch again. This will make the merge a bit more digestible as well - rather than building up a crazy amount of differences on the branch. There are also a variety of improvements and fixes in the testing framework and elsewhere that would be nice to get back into trunk. Perhaps within couple/few weeks, after we stabilize and finish up some hanging work?
          Hide
          Yonik Seeley added a comment -

          I've made the distrib update processor default. I had to @Ignore BasicZkTest for some reason though.

          Show
          Yonik Seeley added a comment - I've made the distrib update processor default. I had to @Ignore BasicZkTest for some reason though.
          Hide
          Mark Miller added a comment -

          note: distrib delete by id not working at the moment - we need to start propagating versions on SolrCmd objects - right now they are lost on conversion to an update request, and the versioning code is not happy.

          Show
          Mark Miller added a comment - note: distrib delete by id not working at the moment - we need to start propagating versions on SolrCmd objects - right now they are lost on conversion to an update request, and the versioning code is not happy.
          Hide
          Lance Norskog added a comment -

          Lamport Clocks are a time-tested way to sequence actions across a network. In this case, you can use an iterate-until-happy algorithm using the locks.

          Google Lamport Clock

          Show
          Lance Norskog added a comment - Lamport Clocks are a time-tested way to sequence actions across a network. In this case, you can use an iterate-until-happy algorithm using the locks. Google Lamport Clock
          Hide
          Darren Govoni added a comment -

          My use case for this is to permit effective indexing on local nodes across a cluster by local processes - in a "shared nothing" architecture as much as is possible. Let's say I have 100 machines each with their own Solr server. Each machine will process data and commit index writes locally. This avoids global locking and also 100 x n number of threads hitting a central server/port (which won't scale). It should be possible then, to reconcile the 100 x n indexes either at query time (federated search) or merge time (distributed merge). In my scenario, I'm not as concerned about replication or redundancy from the cloud. Cheers.

          Show
          Darren Govoni added a comment - My use case for this is to permit effective indexing on local nodes across a cluster by local processes - in a "shared nothing" architecture as much as is possible. Let's say I have 100 machines each with their own Solr server. Each machine will process data and commit index writes locally. This avoids global locking and also 100 x n number of threads hitting a central server/port (which won't scale). It should be possible then, to reconcile the 100 x n indexes either at query time (federated search) or merge time (distributed merge). In my scenario, I'm not as concerned about replication or redundancy from the cloud. Cheers.
          Hide
          Mark Miller added a comment -

          P.S. This lock is simply for auto layout of the cluster - if you are going to manually specific the layout, it wouldn't be used. If we ended up with an overseer, this lock could happen on it instead. Basically, if all the nodes fire up at the same time, you still want them to be sanely assigned to be a shard / replica, which requires knowing the assignments that have already happened.

          Show
          Mark Miller added a comment - P.S. This lock is simply for auto layout of the cluster - if you are going to manually specific the layout, it wouldn't be used. If we ended up with an overseer, this lock could happen on it instead. Basically, if all the nodes fire up at the same time, you still want them to be sanely assigned to be a shard / replica, which requires knowing the assignments that have already happened.
          Hide
          Mark Miller added a comment -

          Generally speaking, it seems like we should avoid locks as much as possible. Should be more scalable...

          Yeah, I had the same initially reaction - a collection wide lock? Who likes locks? In reality, I'm not too worried though - its a simple very short lock for changing the cluster layout for a collection - this is not a normal thing that will happen - normally the cluster layout will be stable - this is mostly just as the cluster is coming up. So for simplicity and in the spirit of getting something working, it's easy to just start with a simple lock here - if it's really a problem (I doubt it myself), it's easy enough to do this differently later.

          Show
          Mark Miller added a comment - Generally speaking, it seems like we should avoid locks as much as possible. Should be more scalable... Yeah, I had the same initially reaction - a collection wide lock? Who likes locks? In reality, I'm not too worried though - its a simple very short lock for changing the cluster layout for a collection - this is not a normal thing that will happen - normally the cluster layout will be stable - this is mostly just as the cluster is coming up. So for simplicity and in the spirit of getting something working, it's easy to just start with a simple lock here - if it's really a problem (I doubt it myself), it's easy enough to do this differently later.
          Hide
          Mark Miller added a comment -

          Initially, a request will be fully synchronous and will not return success to the client until the request is sent to each replica. So if a leader goes down before all replicas receive and ACK the request, the client will not get an ACK. A new leader will be elected. When the downed, previous leader comes back, he will come up in recovery mode. I expect recovery to be a difficult part and we have not fully worked it out yet. To recover, the node will have to talk to the leader and figure out what it has that it should not, what it doesn't have, etc. Then the recovering node either receives replays, or replaces the entire index. Lot's of details to work out here.

          You have an interesting problem in that some replica leader candidates may have an update while others don't, as the leader may have died in the middle of relaying requests. We might prefer a new leader with the greatest versioned doc? Most client retries in this case will be fine (global unique id's are required, so no worry about dupes). Then replicas talk to the leader and sync up. Or when a new leader is elected, replicas just talk amongst each other and sync up, or…

          If the leader fails right before sending an ACK, the client will likely repeat the request. In the case of doc adds/updates and the same id it will just replace the previous success or will be able to use optimistic locking to figure out that either its update or someone else's actually went through already. The client would already know that perhaps its update went through because the connection would have timed out rather than receive a failure.

          Eventually, we might consider a mode where the request is ACK'd before it's on all replicas, in which case you might accept a higher risk of data loss.

          indexes diverge because some replicas commit a change while others do not

          It's an area we have not fully worked out (though Yonik has likely thought about a lot of this more than I have yet) - initially though, Yonik's point was that you can usually expect success on all nodes unless the issue is something that would require the node come down and then come back in recovery mode I think. We certainly want to be resilient here eventually though. As we work through recovery scenarios, I think this will become more clear.

          Long, short, we have been discussing and thinking about these various scenarios, but largely we are also taking things an issue at a time.

          Show
          Mark Miller added a comment - Initially, a request will be fully synchronous and will not return success to the client until the request is sent to each replica. So if a leader goes down before all replicas receive and ACK the request, the client will not get an ACK. A new leader will be elected. When the downed, previous leader comes back, he will come up in recovery mode. I expect recovery to be a difficult part and we have not fully worked it out yet. To recover, the node will have to talk to the leader and figure out what it has that it should not, what it doesn't have, etc. Then the recovering node either receives replays, or replaces the entire index. Lot's of details to work out here. You have an interesting problem in that some replica leader candidates may have an update while others don't, as the leader may have died in the middle of relaying requests. We might prefer a new leader with the greatest versioned doc? Most client retries in this case will be fine (global unique id's are required, so no worry about dupes). Then replicas talk to the leader and sync up. Or when a new leader is elected, replicas just talk amongst each other and sync up, or… If the leader fails right before sending an ACK, the client will likely repeat the request. In the case of doc adds/updates and the same id it will just replace the previous success or will be able to use optimistic locking to figure out that either its update or someone else's actually went through already. The client would already know that perhaps its update went through because the connection would have timed out rather than receive a failure. Eventually, we might consider a mode where the request is ACK'd before it's on all replicas, in which case you might accept a higher risk of data loss. indexes diverge because some replicas commit a change while others do not It's an area we have not fully worked out (though Yonik has likely thought about a lot of this more than I have yet) - initially though, Yonik's point was that you can usually expect success on all nodes unless the issue is something that would require the node come down and then come back in recovery mode I think. We certainly want to be resilient here eventually though. As we work through recovery scenarios, I think this will become more clear. Long, short, we have been discussing and thinking about these various scenarios, but largely we are also taking things an issue at a time.
          Hide
          Ted Dunning added a comment -

          Mark,

          How do you handle failure scenarios?

          The failures I am curious about are:

          • the leader fails, but a transaction is still sent to it because the client didn't get the memo in time
          • the leader fails but has already written a transaction locally without having a chance to forward it to the followers
          • the leader fails after writing locally and to the replicas but before sending an ACK
          • a replica is partitioned from the cluster, a transaction is received and committed by all live replicas and then the failed index returns from the land of the living dead.

          The bad behaviors that need to be avoided include

          • document acked but not inserted
          • document not acked, inserted again and two copies wind up in the index
          • indexes diverge because some replicas commit a change while others do not

          Two phase commit is not generally a viable solution for this in a cluster where failures can occur because it requires locks to be taken. Once these locks are taken, the cluster cannot proceed until the locks are cleared and this cannot be done reliably in the presence of failures.

          Zookeeper avoids this to a large degree by making updates idempotent before they are inserted into the update queue. This means that if the updates are done more than once, most importantly during error recovery, that no error actually occurs. This is what makes ZK able to take snapshots without stopping the world. It does not entirely resolve the case of transactions that are committed but not acked.

          Show
          Ted Dunning added a comment - Mark, How do you handle failure scenarios? The failures I am curious about are: the leader fails, but a transaction is still sent to it because the client didn't get the memo in time the leader fails but has already written a transaction locally without having a chance to forward it to the followers the leader fails after writing locally and to the replicas but before sending an ACK a replica is partitioned from the cluster, a transaction is received and committed by all live replicas and then the failed index returns from the land of the living dead. The bad behaviors that need to be avoided include document acked but not inserted document not acked, inserted again and two copies wind up in the index indexes diverge because some replicas commit a change while others do not Two phase commit is not generally a viable solution for this in a cluster where failures can occur because it requires locks to be taken. Once these locks are taken, the cluster cannot proceed until the locks are cleared and this cannot be done reliably in the presence of failures. Zookeeper avoids this to a large degree by making updates idempotent before they are inserted into the update queue. This means that if the updates are done more than once, most importantly during error recovery, that no error actually occurs. This is what makes ZK able to take snapshots without stopping the world. It does not entirely resolve the case of transactions that are committed but not acked.
          Hide
          Ted Dunning added a comment -

          I think locks should be completely out of bounds if only because they are hell to deal with in the presence of failures. This is a major reason that ZK is not a lock manager but supports atomic updates at a fundamental level.

          State of a node doesn't need a lock. The node should just update it's own state and that state should be ephemeral so if the node disappears, the state reflects that. Anybody who cares in a real-time kind of way about the state of that node should put a watch on that node's state.

          Creating a new collection is relatively trivial without a lock as well. One of the simplest ways is to simply put a specification of the new collection into a collections directory in ZK. The cluster overseer sees the addition and it parcels out shard assignments to nodes. The nodes see the assignments change and they take actions to conform to the specification, advertising their progress in their state files. All that is needed here is atomic update which ZK does just fine.

          If it helps, there is a simplified form of this in Chapter 16 of Mahout in action. The source code for this example is available at https://github.com/tdunning/Chapter-16. This example only has nodes, but the basic idea of parcelling out assignments is the same.

          A summary of what I would suggest is this:

          • three directories:
                /collections
                /node-assignments
                /node-states
            

            The /collections directory is updated by anybody wishing to advertise or delete a collection. The node-assignments directory is updated only by the overseer. The node-states directory is updated by each node.

          • one leader election file
                /cluster-leader
            

            All of the potential overseers try to create this file (ephemerally) and insert their IP and port. The one that succeeds is the overseer, the others watch for the file to disappear. On disconnect from ZK, the overseer stops acting as overseer, but does not tear down local state. On reconnect, the overseer continues acting as overseer. On session expiration, the overseer tears down local state and attempts to regain the leadership position.

          The cluster overseer never needs to grab locks since atomic read-modify-write to node state is all that is required.

          Again for emphasis,

          1) cluster-wide locks are a bug in a scalable clustered system. Leader election is an allowable special case.

          2) locks are not required for clustered SOLR.

          3) a lock-free design is incredibly simple to implement.

          Show
          Ted Dunning added a comment - I think locks should be completely out of bounds if only because they are hell to deal with in the presence of failures. This is a major reason that ZK is not a lock manager but supports atomic updates at a fundamental level. State of a node doesn't need a lock. The node should just update it's own state and that state should be ephemeral so if the node disappears, the state reflects that. Anybody who cares in a real-time kind of way about the state of that node should put a watch on that node's state. Creating a new collection is relatively trivial without a lock as well. One of the simplest ways is to simply put a specification of the new collection into a collections directory in ZK. The cluster overseer sees the addition and it parcels out shard assignments to nodes. The nodes see the assignments change and they take actions to conform to the specification, advertising their progress in their state files. All that is needed here is atomic update which ZK does just fine. If it helps, there is a simplified form of this in Chapter 16 of Mahout in action. The source code for this example is available at https://github.com/tdunning/Chapter-16 . This example only has nodes, but the basic idea of parcelling out assignments is the same. A summary of what I would suggest is this: three directories: /collections /node-assignments /node-states The /collections directory is updated by anybody wishing to advertise or delete a collection. The node-assignments directory is updated only by the overseer. The node-states directory is updated by each node. one leader election file /cluster-leader All of the potential overseers try to create this file (ephemerally) and insert their IP and port. The one that succeeds is the overseer, the others watch for the file to disappear. On disconnect from ZK, the overseer stops acting as overseer, but does not tear down local state. On reconnect, the overseer continues acting as overseer. On session expiration, the overseer tears down local state and attempts to regain the leadership position. The cluster overseer never needs to grab locks since atomic read-modify-write to node state is all that is required. Again for emphasis, 1) cluster-wide locks are a bug in a scalable clustered system. Leader election is an allowable special case. 2) locks are not required for clustered SOLR. 3) a lock-free design is incredibly simple to implement.
          Hide
          Yonik Seeley added a comment -

          As far as locking vs leader, I think maybe both can make sense.
          Some things are logically more node specific and a lock can make more sense there (so that a node can modify it's own state).
          Also, something like a command to create a new collection might be easier with a cluster lock. The node that received the command can just do it, rather than introducing logic to forward the command to the cluster leader (or put the request in a ZK queue or something, to be pulled by someone, which still needs coordination to make sure only one node is trying to do it).

          On the other hand, cluster overseer code that might want to watch the cluster and change the configuration... a single cluster leader makes sense there (and they may end up also grabbing some sort of lock to avoid conflicts with what other nodes may do).

          Show
          Yonik Seeley added a comment - As far as locking vs leader, I think maybe both can make sense. Some things are logically more node specific and a lock can make more sense there (so that a node can modify it's own state). Also, something like a command to create a new collection might be easier with a cluster lock. The node that received the command can just do it, rather than introducing logic to forward the command to the cluster leader (or put the request in a ZK queue or something, to be pulled by someone, which still needs coordination to make sure only one node is trying to do it). On the other hand, cluster overseer code that might want to watch the cluster and change the configuration... a single cluster leader makes sense there (and they may end up also grabbing some sort of lock to avoid conflicts with what other nodes may do).
          Hide
          Grant Ingersoll added a comment -

          I've been reading through the design docs on the wiki a bit. Not sure if this issue is the best place to discuss, but I'm curious/concerned about the locking stuff. Seems like it might be simpler/faster/more scalable to go with what is described in the "Alternate to Cluster Lock" section. Generally speaking, it seems like we should avoid locks as much as possible. Should be more scalable and it should be less code, right?

          Show
          Grant Ingersoll added a comment - I've been reading through the design docs on the wiki a bit. Not sure if this issue is the best place to discuss, but I'm curious/concerned about the locking stuff. Seems like it might be simpler/faster/more scalable to go with what is described in the "Alternate to Cluster Lock" section. Generally speaking, it seems like we should avoid locks as much as possible. Should be more scalable and it should be less code, right?
          Hide
          Mark Miller added a comment -

          Actually, commit will be a bit delayed - new test likes to hang when running in parallel to others with ant test - will have to dig...

          Show
          Mark Miller added a comment - Actually, commit will be a bit delayed - new test likes to hang when running in parallel to others with ant test - will have to dig...
          Hide
          Mark Miller added a comment -

          Okay, I'm going to commit some really early stuff to the branch here...ugly code and lots of system.out's still there...but we can start tying in versioning and what not...

          Commit adds the distrib update processor and makes it cloud aware.

          If you add a doc to a replica, it forwards it to the leader. If a doc comes to the leader, it versions it (super mock/fake at the moment - param is set to docversion=yes) and forwards it to each replica in the shard (including itself).

          Also a couple basic tests added around this, and other little fixes that where found/needed along the way...

          The current main test for this fires up a control and 3 shards, each with 1 replia (6 cores total). Indexing is then round robin'd to each shard (randomly adding either to the leader or the replica). Then the standard distrib search tests are run (with load balancing across replicas) and results compared with control.

          Early, early, stuff - but it's a start. None of the hashing stuff we will be doing involved yet.

          Show
          Mark Miller added a comment - Okay, I'm going to commit some really early stuff to the branch here...ugly code and lots of system.out's still there...but we can start tying in versioning and what not... Commit adds the distrib update processor and makes it cloud aware. If you add a doc to a replica, it forwards it to the leader. If a doc comes to the leader, it versions it (super mock/fake at the moment - param is set to docversion=yes) and forwards it to each replica in the shard (including itself). Also a couple basic tests added around this, and other little fixes that where found/needed along the way... The current main test for this fires up a control and 3 shards, each with 1 replia (6 cores total). Indexing is then round robin'd to each shard (randomly adding either to the leader or the replica). Then the standard distrib search tests are run (with load balancing across replicas) and results compared with control. Early, early, stuff - but it's a start. None of the hashing stuff we will be doing involved yet.
          Hide
          Erick Erickson added a comment -

          Fat fingers, I didn't intend to assign this to myself in the first place!

          Show
          Erick Erickson added a comment - Fat fingers, I didn't intend to assign this to myself in the first place!
          Hide
          Jan Høydahl added a comment -

          I'm not sure if DirectUpdateHandler2 is the right location either. My point is that the user should not need to manually make sure that the UpdateProcessor is present in all his UpdateChains for distributed indexing to work. See new issue SOLR-2370 for a suggestion on how to tackle this.

          Show
          Jan Høydahl added a comment - I'm not sure if DirectUpdateHandler2 is the right location either. My point is that the user should not need to manually make sure that the UpdateProcessor is present in all his UpdateChains for distributed indexing to work. See new issue SOLR-2370 for a suggestion on how to tackle this.
          Hide
          Alex Cowell added a comment -

          Since this functionality is core to Solr and should always be present, it would be natural to either build it into the DirectUpdateHandler2 or to add this processor to the set of default UpdateProcessors that are executed if no update.processor parameter is specified.

          What advantage would we gain from moving this functionality into DirectUpdateHandler2? From what I understand, the UpdateHandler deals directly with the index whereas the DistributedUpdateRequestProcessor merely takes requests deemed to be distributed by the request handler and distributes them to a list of shards based on a distribution policy.

          Show
          Alex Cowell added a comment - Since this functionality is core to Solr and should always be present, it would be natural to either build it into the DirectUpdateHandler2 or to add this processor to the set of default UpdateProcessors that are executed if no update.processor parameter is specified. What advantage would we gain from moving this functionality into DirectUpdateHandler2? From what I understand, the UpdateHandler deals directly with the index whereas the DistributedUpdateRequestProcessor merely takes requests deemed to be distributed by the request handler and distributes them to a list of shards based on a distribution policy.
          Hide
          Jan Høydahl added a comment -

          See SOLR-2293 for some thoughts.

          Since this functionality is core to Solr and should always be present, it would be natural to either build it into the DirectUpdateHandler2 or to add this processor to the set of default UpdateProcessors that are executed if no update.processor parameter is specified.

          Show
          Jan Høydahl added a comment - See SOLR-2293 for some thoughts. Since this functionality is core to Solr and should always be present, it would be natural to either build it into the DirectUpdateHandler2 or to add this processor to the set of default UpdateProcessors that are executed if no update.processor parameter is specified.
          Hide
          Jan Høydahl added a comment -

          There was an existing issue on this already, marking the other issue as duplicate since there is already a patch over here

          Show
          Jan Høydahl added a comment - There was an existing issue on this already, marking the other issue as duplicate since there is already a patch over here
          Hide
          Alex Cowell added a comment -

          Added a patch which handles distributed add and commit update requests.

          Please see the 'Distributed Indexing' thread on the dev mailing list for more info.

          Show
          Alex Cowell added a comment - Added a patch which handles distributed add and commit update requests. Please see the 'Distributed Indexing' thread on the dev mailing list for more info.
          Hide
          Yonik Seeley added a comment -

          See SOLR-2355 for a very simple implementation.

          Show
          Yonik Seeley added a comment - See SOLR-2355 for a very simple implementation.

            People

            • Assignee:
              Unassigned
              Reporter:
              William Mayor
            • Votes:
              10 Vote for this issue
              Watchers:
              28 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development