Solr
  1. Solr
  2. SOLR-1698

load balanced distributed search

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: None
    • Labels:
      None

      Description

      Provide syntax and implementation of load-balancing across shard replicas.

      1. SOLR-1698.patch
        38 kB
        Yonik Seeley
      2. SOLR-1698.patch
        38 kB
        Yonik Seeley
      3. SOLR-1698.patch
        28 kB
        Yonik Seeley
      4. SOLR-1698.patch
        20 kB
        Yonik Seeley
      5. SOLR-1698.patch
        10 kB
        Yonik Seeley

        Issue Links

          Activity

          Hide
          Shalin Shekhar Mangar added a comment -

          This has been incorporated in SolrCloud.

          Show
          Shalin Shekhar Mangar added a comment - This has been incorporated in SolrCloud.
          Hide
          Jan Høydahl added a comment -

          I think this issue could be closed?

          Show
          Jan Høydahl added a comment - I think this issue could be closed?
          Hide
          patrick o'leary added a comment -

          How does this work with search domains in resolv.conf ?

          Show
          patrick o'leary added a comment - How does this work with search domains in resolv.conf ?
          Hide
          Yonik Seeley added a comment - - edited

          note: it appears that hosts that don't exist (DNS failure) are loadbalanced fine on Windows, but cause the tests to fail (specifically TestDistributedSearch) on at least Ubuntu. Looking into it...

          update: looks like the issue was down to the DNS provider returning a "302 Found" and an actual response for the bad host. If I add a ".com" to the bad host, everything starts working again. That's what I've done for now.

          Show
          Yonik Seeley added a comment - - edited note: it appears that hosts that don't exist (DNS failure) are loadbalanced fine on Windows, but cause the tests to fail (specifically TestDistributedSearch) on at least Ubuntu. Looking into it... update: looks like the issue was down to the DNS provider returning a "302 Found" and an actual response for the bad host. If I add a ".com" to the bad host, everything starts working again. That's what I've done for now.
          Hide
          Yonik Seeley added a comment -

          committed no-retries on solr_cloud branch. Has the nice side-effect of speeding up some of the tests too.

          Show
          Yonik Seeley added a comment - committed no-retries on solr_cloud branch. Has the nice side-effect of speeding up some of the tests too.
          Hide
          Yonik Seeley added a comment -

          This is now part of the solr_cloud branch.
          Seems like we should change the default number of retries from 3 to 1 in our HttpClient... otherwise failover can take a little while.

          Show
          Yonik Seeley added a comment - This is now part of the solr_cloud branch. Seems like we should change the default number of retries from 3 to 1 in our HttpClient... otherwise failover can take a little while.
          Hide
          Uri Boness added a comment -

          yep.. that works

          Show
          Uri Boness added a comment - yep.. that works
          Hide
          Yonik Seeley added a comment -

          OK, here's a new patch - update trunk first to get the removed $Id

          Show
          Yonik Seeley added a comment - OK, here's a new patch - update trunk first to get the removed $Id
          Hide
          Yonik Seeley added a comment -

          Hmmm, the rejection had $Id - that might be the cause. I'll see if I can get rid of it first and generate a new patch.

          Show
          Yonik Seeley added a comment - Hmmm, the rejection had $Id - that might be the cause. I'll see if I can get rid of it first and generate a new patch.
          Hide
          Uri Boness added a comment -

          I think the patch doesn't work. I just checkout the trunk and applying the patch fails with a conflict for LBHttpSolrServer.java

          Show
          Uri Boness added a comment - I think the patch doesn't work. I just checkout the trunk and applying the patch fails with a conflict for LBHttpSolrServer.java
          Hide
          Yonik Seeley added a comment -

          Latest patch, everything back to working. This fixes the load balancing algorithm to try dead servers if no live servers are available, adds tests for this case, and a bunch of other minor cleanups.

          I'll merge on the cloud branch very soon and give a little time for others to review before committing to trunk.

          Show
          Yonik Seeley added a comment - Latest patch, everything back to working. This fixes the load balancing algorithm to try dead servers if no live servers are available, adds tests for this case, and a bunch of other minor cleanups. I'll merge on the cloud branch very soon and give a little time for others to review before committing to trunk.
          Hide
          Yonik Seeley added a comment -

          Not sure what's going on... but it seems like if I have more than one load balancer instance that only one does zombie checks. Really strange - I've been banging my head against the wall for a while now.

          Show
          Yonik Seeley added a comment - Not sure what's going on... but it seems like if I have more than one load balancer instance that only one does zombie checks. Really strange - I've been banging my head against the wall for a while now.
          Hide
          Yonik Seeley added a comment -

          Another problem with the current load balancing algorithm is that if someone bounces both servers (or there is a temporary loss of network connectivity or whatever), they won't be marked up again until a successful ping (by default 1 minute later).

          Instead, it seems like we should first try all live servers, and then try all dead servers to see if they are still truly dead before failing a request.

          Show
          Yonik Seeley added a comment - Another problem with the current load balancing algorithm is that if someone bounces both servers (or there is a temporary loss of network connectivity or whatever), they won't be marked up again until a successful ping (by default 1 minute later). Instead, it seems like we should first try all live servers, and then try all dead servers to see if they are still truly dead before failing a request.
          Hide
          Yonik Seeley added a comment -

          New patch that hooks in load balancing to distributed search.
          equivalent servers are separated by "|".
          Example: shards=localhost:8983,localhost:8985|localhost:9985

          Show
          Yonik Seeley added a comment - New patch that hooks in load balancing to distributed search. equivalent servers are separated by "|". Example: shards=localhost:8983,localhost:8985|localhost:9985
          Hide
          Yonik Seeley added a comment -

          Attaching new patch, still limited to LBHttpSolrServer at this point.

          • includes tests
          • adds a new expert-level API:
            public Rsp request(Req req) throws SolrServerException, IOException
            I chose objects (Rsp and Req) since I imagine we will need to continue to add new parameters and controls to both the request and the response (esp the request... things like timeout, max number of servers to query, etc). The Rsp also contains info about which server returned the response and will allow us to stick with the same server for all phases of a distributed request.
          • adds the concept of "standard" servers (those provided by the constructor or addServer)... a server on the zombie list that isn't a standard server won't be added to the alive list if it wakes up, and will not be pinged forever.
          Show
          Yonik Seeley added a comment - Attaching new patch, still limited to LBHttpSolrServer at this point. includes tests adds a new expert-level API: public Rsp request(Req req) throws SolrServerException, IOException I chose objects (Rsp and Req) since I imagine we will need to continue to add new parameters and controls to both the request and the response (esp the request... things like timeout, max number of servers to query, etc). The Rsp also contains info about which server returned the response and will allow us to stick with the same server for all phases of a distributed request. adds the concept of "standard" servers (those provided by the constructor or addServer)... a server on the zombie list that isn't a standard server won't be added to the alive list if it wakes up, and will not be pinged forever.
          Hide
          Yonik Seeley added a comment -

          Draft patch to LBHttpSolrServer that lays the groundwork for being able to more easily use it in distributed search. This also removes much of the locking.

          Next step will be to add a method that allows one to query an arbitrary server list in the given order. This will involve the zombie list, but not the "alive" list. We also need to avoid never-ending growth of the zombie list (and never-ending pinging of servers that are gone) by setting a ping limit.

          Show
          Yonik Seeley added a comment - Draft patch to LBHttpSolrServer that lays the groundwork for being able to more easily use it in distributed search. This also removes much of the locking. Next step will be to add a method that allows one to query an arbitrary server list in the given order. This will involve the zombie list, but not the "alive" list. We also need to avoid never-ending growth of the zombie list (and never-ending pinging of servers that are gone) by setting a ping limit.
          Hide
          Noble Paul added a comment -

          LBHttpSolrServer can have the concept of a sticky session and the session object can be used for all shard requests made in a single solr request.

          Show
          Noble Paul added a comment - LBHttpSolrServer can have the concept of a sticky session and the session object can be used for all shard requests made in a single solr request.
          Hide
          Yonik Seeley added a comment -

          Looking into LBHttpSolrServer more, it looks like we have some serious concurrency issues. When a request does fail, a global lock is aquired to move from alive to zombie - but this same lock is used while making requests to check if zombies have come back (i.e. actual requests to zombies are being made with the lock held!).

          For distributed search use (SearchHandler) I'm thinking of going with option #2 from my previous message (have a single static LBHttpSolrServer instance that's shared for all requests, with an extra method that allows passing of a list of addresses on a per-request basis.). I'll address the concurrency issues at the same time.

          Show
          Yonik Seeley added a comment - Looking into LBHttpSolrServer more, it looks like we have some serious concurrency issues. When a request does fail, a global lock is aquired to move from alive to zombie - but this same lock is used while making requests to check if zombies have come back (i.e. actual requests to zombies are being made with the lock held!). For distributed search use (SearchHandler) I'm thinking of going with option #2 from my previous message (have a single static LBHttpSolrServer instance that's shared for all requests, with an extra method that allows passing of a list of addresses on a per-request basis.). I'll address the concurrency issues at the same time.
          Hide
          Yonik Seeley added a comment -

          Looking into LBHttpSolrServer more, it looks like we have some serious concurrency issues. When a request does fail, a global lock is aquired to move from alive to zombie - but this same lock is used while making requests to check if zombies have come back (i.e. actual requests to zombies are being made with the lock held!).

          For distributed search use (SearchHandler) I'm thinking of going with option #2 from my previous message (have a single static LBHttpSolrServer instance that's shared for all requests, with an extra method that allows passing of a list of addresses on a per-request basis.). I'll address the concurrency issues at the same time.

          Show
          Yonik Seeley added a comment - Looking into LBHttpSolrServer more, it looks like we have some serious concurrency issues. When a request does fail, a global lock is aquired to move from alive to zombie - but this same lock is used while making requests to check if zombies have come back (i.e. actual requests to zombies are being made with the lock held!). For distributed search use (SearchHandler) I'm thinking of going with option #2 from my previous message (have a single static LBHttpSolrServer instance that's shared for all requests, with an extra method that allows passing of a list of addresses on a per-request basis.). I'll address the concurrency issues at the same time.
          Hide
          Yonik Seeley added a comment - - edited

          Another big question is: can we use LBHttpSolrServer for this, or are the needs too different?

          Some of the issues:

          • need control over order (so same server will be used in a single request)
          • if we have a big cluster (100 shards), we don't want every node or every core to have 100 background threads checking liveness
          • one request may want to hit addresses [A,B] while another may want [A,B,C] - a single LBHttpSolrServer can't currently do both at once, and separate instances wouldn't share liveness info.

          One way: have many LBHttpSolrServer instances (one per shard group) but have them share certain things like the liveness of a shard and the background cleaning threads

          Another way: have a single static LBHttpSolrServer instance that's shared for all requests, with an extra method that allows passing of a list of addresses on a per-request basis.

          Show
          Yonik Seeley added a comment - - edited Another big question is: can we use LBHttpSolrServer for this, or are the needs too different? Some of the issues: need control over order (so same server will be used in a single request) if we have a big cluster (100 shards), we don't want every node or every core to have 100 background threads checking liveness one request may want to hit addresses [A,B] while another may want [A,B,C] - a single LBHttpSolrServer can't currently do both at once, and separate instances wouldn't share liveness info. One way: have many LBHttpSolrServer instances (one per shard group) but have them share certain things like the liveness of a shard and the background cleaning threads Another way: have a single static LBHttpSolrServer instance that's shared for all requests, with an extra method that allows passing of a list of addresses on a per-request basis.
          Hide
          Noble Paul added a comment -

          is this related to SOLR-1431 . I though we can have custom ShardComponents for these things

          Show
          Noble Paul added a comment - is this related to SOLR-1431 . I though we can have custom ShardComponents for these things
          Hide
          Yonik Seeley added a comment -

          Picking which shard replica to request can be random (round-robin or whatever, and customizable in the future), but a single distributed request should use the same replica for all phases of the request when possible.

          Show
          Yonik Seeley added a comment - Picking which shard replica to request can be random (round-robin or whatever, and customizable in the future), but a single distributed request should use the same replica for all phases of the request when possible.
          Hide
          Yonik Seeley added a comment -

          This is related to the solr cloud branch, but is perhaps separable enough that I thought I'd try integrating into trunk.
          We could add pipe delimiters between equivalent shards - useful for testing, troubleshooting, and ad hoc requests, even when shard info will be retrieved from zookeeper.

          Example shards param:
          shards=localhost:8983/solr|localhost:8985/solr,localhost:7574/solr|localhost:7576/solr

          Show
          Yonik Seeley added a comment - This is related to the solr cloud branch, but is perhaps separable enough that I thought I'd try integrating into trunk. We could add pipe delimiters between equivalent shards - useful for testing, troubleshooting, and ad hoc requests, even when shard info will be retrieved from zookeeper. Example shards param: shards=localhost:8983/solr|localhost:8985/solr,localhost:7574/solr|localhost:7576/solr

            People

            • Assignee:
              Unassigned
              Reporter:
              Yonik Seeley
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development