Solr
  1. Solr
  2. SOLR-1277

Implement a Solr specific naming service (using Zookeeper)

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Won't Fix
    • Affects Version/s: 1.4
    • Fix Version/s: 3.2
    • Component/s: None
    • Labels:
      None

      Description

      The goal is to give Solr server clusters self-healing attributes
      where if a server fails, indexing and searching don't stop and
      all of the partitions remain searchable. For configuration, the
      ability to centrally deploy a new configuration without servers
      going offline.

      We can start with basic failover and start from there?

      Features:

      • Automatic failover (i.e. when a server fails, clients stop
        trying to index to or search it)
      • Centralized configuration management (i.e. new solrconfig.xml
        or schema.xml propagates to a live Solr cluster)
      • Optionally allow shards of a partition to be moved to another
        server (i.e. if a server gets hot, move the hot segments out to
        cooler servers). Ideally we'd have a way to detect hot segments
        and move them seamlessly. With NRT this becomes somewhat more
        difficult but not impossible?
      1. zookeeper-3.2.1.jar
        892 kB
        Mark Miller
      2. SOLR-1277.patch
        35 kB
        Grant Ingersoll
      3. SOLR-1277.patch
        67 kB
        Mark Miller
      4. SOLR-1277.patch
        73 kB
        Mark Miller
      5. SOLR-1277.patch
        44 kB
        Mark Miller
      6. log4j-1.2.15.jar
        383 kB
        Grant Ingersoll

        Issue Links

          Activity

          Gavin made changes -
          Link This issue depends upon SOLR-1585 [ SOLR-1585 ]
          Gavin made changes -
          Link This issue depends on SOLR-1585 [ SOLR-1585 ]
          Hide
          Antony Stubbs added a comment -

          What happened in regards to multicore support and zookeeper?

          http://wiki.apache.org/solr/SolrCloud
          http://wiki.apache.org/solr/ZooKeeperIntegration

          but it's as clear as mud. Do either systems support multicore? Or do I need to setup 4 servers, 2 for each core? (I have a single server with 2 cores - I want redundancy).

          Does SolrCloud supersede ZooKeeperIntegration?

          Show
          Antony Stubbs added a comment - What happened in regards to multicore support and zookeeper? http://wiki.apache.org/solr/SolrCloud http://wiki.apache.org/solr/ZooKeeperIntegration but it's as clear as mud. Do either systems support multicore? Or do I need to setup 4 servers, 2 for each core? (I have a single server with 2 cores - I want redundancy). Does SolrCloud supersede ZooKeeperIntegration?
          Robert Muir made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Hide
          Robert Muir added a comment -

          Bulk close for 3.2

          Show
          Robert Muir added a comment - Bulk close for 3.2
          Hoss Man made changes -
          Fix Version/s 3.2 [ 12316172 ]
          Fix Version/s Next [ 12315093 ]
          Grant Ingersoll made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Won't Fix [ 2 ]
          Hide
          Grant Ingersoll added a comment -

          I'm going to mark this as "won't fix" in lieu of several smaller issues that are taking care of incorporating ZooKeeper into Solr. See http://wiki.apache.org/solr/SolrCloud

          Show
          Grant Ingersoll added a comment - I'm going to mark this as "won't fix" in lieu of several smaller issues that are taking care of incorporating ZooKeeper into Solr. See http://wiki.apache.org/solr/SolrCloud
          Hoss Man made changes -
          Fix Version/s Next [ 12315093 ]
          Fix Version/s 1.5 [ 12313566 ]
          Hide
          Hoss Man added a comment -

          Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email...

          http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

          Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed.

          A unique token for finding these 240 issues in the future: hossversioncleanup20100527

          Show
          Hoss Man added a comment - Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed. A unique token for finding these 240 issues in the future: hossversioncleanup20100527
          Yonik Seeley made changes -
          Link This issue is related to SOLR-1698 [ SOLR-1698 ]
          Hide
          Patrick Hunt added a comment -

          Yonik, others, you might find this interesting:
          http://github.com/phunt/zookeeper_dashboard
          It's apache licensed, based on python/django. I had thoughts of having "plugins" for things like
          hbase, solr etc... based on their usage pattern (also for std recipes that might benefit from similar)

          Show
          Patrick Hunt added a comment - Yonik, others, you might find this interesting: http://github.com/phunt/zookeeper_dashboard It's apache licensed, based on python/django. I had thoughts of having "plugins" for things like hbase, solr etc... based on their usage pattern (also for std recipes that might benefit from similar)
          Hide
          Yonik Seeley added a comment -

          FYI, I've added a basic zookeeper browser to the solr admin. My HTML efforts give real web developers migraines, but hey, it works
          Haven't added an admin link yet... just hit
          http://localhost:8983/solr/admin/zookeeper.jsp

          Show
          Yonik Seeley added a comment - FYI, I've added a basic zookeeper browser to the solr admin. My HTML efforts give real web developers migraines, but hey, it works Haven't added an admin link yet... just hit http://localhost:8983/solr/admin/zookeeper.jsp
          Hide
          Grant Ingersoll added a comment -

          Please make all timeout and failure-handling policies configurable.

          +1. Sensible defaults w/ the ability to override is the key.

          Show
          Grant Ingersoll added a comment - Please make all timeout and failure-handling policies configurable. +1. Sensible defaults w/ the ability to override is the key.
          Hide
          Lance Norskog added a comment -

          Back when I had a 500m document index, I did data-mining queries that took 20-30 minutes. They finished. I got my results.

          Please make all timeout and failure-handling policies configurable.

          Show
          Lance Norskog added a comment - Back when I had a 500m document index, I did data-mining queries that took 20-30 minutes. They finished. I got my results. Please make all timeout and failure-handling policies configurable.
          Hide
          Mark Miller added a comment -

          We should iterate on the wiki, but I think this is what we should be working from: http://wiki.apache.org/solr/SolrCloud

          Show
          Mark Miller added a comment - We should iterate on the wiki, but I think this is what we should be working from: http://wiki.apache.org/solr/SolrCloud
          Hide
          Jason Rutherglen added a comment -

          Zookeeper gives us the layout of the cluster. It doesn't
          seem like we need (yet) fast failure detection from zookeeper -
          other nodes can do this synchronously themselves (and would need
          to anyway) on things like connection failures. App-level
          timeouts should not mark the node as failed since we don't know
          how long the request was supposed to take.

          Google Chubby when used in conjunction with search sets a high
          timeout of 60 seconds I believe?

          Fast failover is difficult so it'll be best to enable fast
          re-requesting to adjacent slave servers on request failure.

          Mahadev has some good advise about how we can separate the logic
          into different znodes. Going further I think we'll want to allow
          cores to register themselves, then listen to a separate
          directory as to what state each should be in. We'll need to
          insure the architecture allows for defining multiple tiers (like a pyramid).

          At http://wiki.apache.org/solr/ZooKeeperIntegration is a node a
          core or a server/corecontainer?

          To move ahead we'll really need to define and settle on the
          directory and file structure. I believe the requirement of
          grouping cores so that one may issue a search against a group
          name, instead of individual shard names will be useful. The
          ability to move cores to different nodes will be necessary, as
          is the ability to replicate cores (i.e. have multiple copies
          available on different servers).

          Today I deploy lots of cores today from HDFS across quite a few
          servers containing 1.6 billion documents representing at least
          2.4 TB of data. I mention this because a lot can potentially go
          wrong in this type of setup (i.e. server's going down, corrupted
          data, intermittent network, etc) I generate a file that contains
          all the information as to which core should go to which Solr
          server using size based balancing. Ideally I'd be able to
          generate a new file, perhaps for load balancing the cores across
          new Solr servers or to define that hot cores should be
          replicated, and the Solr cluster would move the cores to the
          defined servers automatically. This doesn't include the separate
          set of servers system that handles incremental updates (i.e.
          master -> slave).

          There's a bit of trepidation in moving forward on this because
          we don't want to engineer ourselves into a hole, however if we
          need to change the structure of the znodes in the future, we'll
          need a healthy a versioning plan such that one may upgrade a
          cluster while maintaining backwards compatibility on a live
          system. Lets think of a basic plan for this.

          In conclusion, lets iterate on the directory structure via the
          wiki or this issue?

          A search node can have very large caches tied to readers
          that all drop at once on commit, and can require a much larger
          heap to accommodate these caches. I think thats a more common
          scenario that creates these longer pauses.

          The large cache issue should be fixable with the various NRT
          changes SOLR-1606. They're collectively not much different than
          the search and sort per segment changes made to Lucene 2.9.

          Show
          Jason Rutherglen added a comment - Zookeeper gives us the layout of the cluster. It doesn't seem like we need (yet) fast failure detection from zookeeper - other nodes can do this synchronously themselves (and would need to anyway) on things like connection failures. App-level timeouts should not mark the node as failed since we don't know how long the request was supposed to take. Google Chubby when used in conjunction with search sets a high timeout of 60 seconds I believe? Fast failover is difficult so it'll be best to enable fast re-requesting to adjacent slave servers on request failure. Mahadev has some good advise about how we can separate the logic into different znodes. Going further I think we'll want to allow cores to register themselves, then listen to a separate directory as to what state each should be in. We'll need to insure the architecture allows for defining multiple tiers (like a pyramid). At http://wiki.apache.org/solr/ZooKeeperIntegration is a node a core or a server/corecontainer? To move ahead we'll really need to define and settle on the directory and file structure. I believe the requirement of grouping cores so that one may issue a search against a group name, instead of individual shard names will be useful. The ability to move cores to different nodes will be necessary, as is the ability to replicate cores (i.e. have multiple copies available on different servers). Today I deploy lots of cores today from HDFS across quite a few servers containing 1.6 billion documents representing at least 2.4 TB of data. I mention this because a lot can potentially go wrong in this type of setup (i.e. server's going down, corrupted data, intermittent network, etc) I generate a file that contains all the information as to which core should go to which Solr server using size based balancing. Ideally I'd be able to generate a new file, perhaps for load balancing the cores across new Solr servers or to define that hot cores should be replicated, and the Solr cluster would move the cores to the defined servers automatically. This doesn't include the separate set of servers system that handles incremental updates (i.e. master -> slave). There's a bit of trepidation in moving forward on this because we don't want to engineer ourselves into a hole, however if we need to change the structure of the znodes in the future, we'll need a healthy a versioning plan such that one may upgrade a cluster while maintaining backwards compatibility on a live system. Lets think of a basic plan for this. In conclusion, lets iterate on the directory structure via the wiki or this issue? A search node can have very large caches tied to readers that all drop at once on commit, and can require a much larger heap to accommodate these caches. I think thats a more common scenario that creates these longer pauses. The large cache issue should be fixable with the various NRT changes SOLR-1606 . They're collectively not much different than the search and sort per segment changes made to Lucene 2.9.
          Hide
          Grant Ingersoll added a comment -

          Yeah, I wonder myself if thats true anymore. Indexing is not the object spewing process it used to be with all the reuse that goes on now - and it can operate in much less RAM with more constant garbage created to collect. When I was profiling a while back looking for a possible indexing leak, I didn't see much for pausing. A search node can have very large caches tied to readers that all drop at once on commit, and can require a much larger heap to accommodate these caches. I think thats a more common scenario that creates these longer pauses.

          Yep, this is consistent with what I've seen in a good chunk of either: really high end systems (people really pushing the limits of not only Solr, but their hardware, etc) or really poorly configured systems.

          Show
          Grant Ingersoll added a comment - Yeah, I wonder myself if thats true anymore. Indexing is not the object spewing process it used to be with all the reuse that goes on now - and it can operate in much less RAM with more constant garbage created to collect. When I was profiling a while back looking for a possible indexing leak, I didn't see much for pausing. A search node can have very large caches tied to readers that all drop at once on commit, and can require a much larger heap to accommodate these caches. I think thats a more common scenario that creates these longer pauses. Yep, this is consistent with what I've seen in a good chunk of either: really high end systems (people really pushing the limits of not only Solr, but their hardware, etc) or really poorly configured systems.
          Hide
          Mark Miller added a comment -

          Yeah, I wonder myself if thats true anymore. Indexing is not the object spewing process it used to be with all the reuse that goes on now - and it can operate in much less RAM with more constant garbage created to collect. When I was profiling a while back looking for a possible indexing leak, I didn't see much for pausing. A search node can have very large caches tied to readers that all drop at once on commit, and can require a much larger heap to accommodate these caches. I think thats a more common scenario that creates these longer pauses.

          Show
          Mark Miller added a comment - Yeah, I wonder myself if thats true anymore. Indexing is not the object spewing process it used to be with all the reuse that goes on now - and it can operate in much less RAM with more constant garbage created to collect. When I was profiling a while back looking for a possible indexing leak, I didn't see much for pausing. A search node can have very large caches tied to readers that all drop at once on commit, and can require a much larger heap to accommodate these caches. I think thats a more common scenario that creates these longer pauses.
          Hide
          Grant Ingersoll added a comment -

          indexing nodes are generally more likely to have long GC pauses than searcher nodes

          Not so sure here, most full collections that I've seen in Solr seem to happen around commit, whether on a indexer or a searcher. In fact, if the indexer is configured right, it doesn't even need to bring a new searcher up, which is usually where the GC's come from due to there being two searchers in memory. This is especially the case when people haven't properly tuned their system, which is going to be harder to do on a dist. system, I would think.

          Show
          Grant Ingersoll added a comment - indexing nodes are generally more likely to have long GC pauses than searcher nodes Not so sure here, most full collections that I've seen in Solr seem to happen around commit, whether on a indexer or a searcher. In fact, if the indexer is configured right, it doesn't even need to bring a new searcher up, which is usually where the GC's come from due to there being two searchers in memory. This is especially the case when people haven't properly tuned their system, which is going to be harder to do on a dist. system, I would think.
          Hide
          Mahadev konar added a comment -

          hi all,
          this is mahadev from the zookeeper team. One of our users does similar things that you guys have been talking about in the above comments. I am not sure how close I am to your scenario but Ill give it a shot. Feel free to ignore my comments if they sound stupid. One of the things that they do is - lets say you have a machine A that is running a process P and is part of your cluster. The way they track the status of this machine is by having 2 znodes (ZNODE1, ZNODE2) in zookeeper. ZNODE1 is an ephemeral node (created by P) and the other one (ZNODE2) is a normal node which contains process P specific data which is updated from time to time by process P (like last time of update, status of process P - good/bad/ok). If an application/user wants to access P on machine A, they look at the ephemeral node and the data is ZNODE2 to see if process P has any problems (not related to zookeeper) and then the application can decide if process P actually needs to be marked dead or not. Say the ephemeral node ZNODE1 is alive but ZNODE2 shows that process P is in a really bad state, then application will go ahead and mark process P as dead. hope this information is of some help!

          Show
          Mahadev konar added a comment - hi all, this is mahadev from the zookeeper team. One of our users does similar things that you guys have been talking about in the above comments. I am not sure how close I am to your scenario but Ill give it a shot. Feel free to ignore my comments if they sound stupid. One of the things that they do is - lets say you have a machine A that is running a process P and is part of your cluster. The way they track the status of this machine is by having 2 znodes (ZNODE1, ZNODE2) in zookeeper. ZNODE1 is an ephemeral node (created by P) and the other one (ZNODE2) is a normal node which contains process P specific data which is updated from time to time by process P (like last time of update, status of process P - good/bad/ok). If an application/user wants to access P on machine A, they look at the ephemeral node and the data is ZNODE2 to see if process P has any problems (not related to zookeeper) and then the application can decide if process P actually needs to be marked dead or not. Say the ephemeral node ZNODE1 is alive but ZNODE2 shows that process P is in a really bad state, then application will go ahead and mark process P as dead. hope this information is of some help!
          Hide
          Yonik Seeley added a comment -

          How are we addressing a failed connection to a slave server, and instead of failing the request, re-making the request to an adjacent slave?

          Yes, I didn't spell it out, but that's the HA part of why you have multiple copies of a shard (in addition to increasing capacity).

          The way things work now, if someone searched during the GC, theyd get all the results back, the search would just take longer. They'd see the hour glass spinning, know the results where slow for this search, but still coming. I was/am not sure if we wanted to replicate that.

          I think we always need to support that. If/when a solr request should time out should be on a per-request basis, and the default should probably be to not time out at all (or at least have a very high timeout). This really doesn't have anything to do with zookeeper.

          Zookeeper gives us the layout of the cluster. It doesn't seem like we need (yet) fast failure detection from zookeeper - other nodes can do this synchronously themselves (and would need to anyway) on things like connection failures. App-level timeouts should not mark the node as failed since we don't know how long the request was supposed to take.

          Show
          Yonik Seeley added a comment - How are we addressing a failed connection to a slave server, and instead of failing the request, re-making the request to an adjacent slave? Yes, I didn't spell it out, but that's the HA part of why you have multiple copies of a shard (in addition to increasing capacity). The way things work now, if someone searched during the GC, theyd get all the results back, the search would just take longer. They'd see the hour glass spinning, know the results where slow for this search, but still coming. I was/am not sure if we wanted to replicate that. I think we always need to support that. If/when a solr request should time out should be on a per-request basis, and the default should probably be to not time out at all (or at least have a very high timeout). This really doesn't have anything to do with zookeeper. Zookeeper gives us the layout of the cluster. It doesn't seem like we need (yet) fast failure detection from zookeeper - other nodes can do this synchronously themselves (and would need to anyway) on things like connection failures. App-level timeouts should not mark the node as failed since we don't know how long the request was supposed to take.
          Hide
          Mark Miller added a comment -


          I think the timeouts are going to have to be different depending on the role of the particular node. In a really distributed setup, indexing nodes are generally more likely to have long GC pauses than searcher nodes, and a lengthy GC pause on an indexer is usually not a problem. However, if a searcher node goes out on a long GC pause then you need to find out fast and bypass the box before too many queries back up and need to be retried (though even this depends on throughput, response time, and number of other available nodes.)

          Currently, I've got a default timeout, with the ability to override it at any node in solr.xml. Do you think thats enough?

          I can imagine putting the timeout for different roles in ZooKeeper, and then a node gets its timeout there based on its role - but then it would have to make multiple connections - one with a default timeout to get its timeout, and then another with the correct timeout.

          Show
          Mark Miller added a comment - I think the timeouts are going to have to be different depending on the role of the particular node. In a really distributed setup, indexing nodes are generally more likely to have long GC pauses than searcher nodes, and a lengthy GC pause on an indexer is usually not a problem. However, if a searcher node goes out on a long GC pause then you need to find out fast and bypass the box before too many queries back up and need to be retried (though even this depends on throughput, response time, and number of other available nodes.) Currently, I've got a default timeout, with the ability to override it at any node in solr.xml. Do you think thats enough? I can imagine putting the timeout for different roles in ZooKeeper, and then a node gets its timeout there based on its role - but then it would have to make multiple connections - one with a default timeout to get its timeout, and then another with the correct timeout.
          Hide
          Mark Miller added a comment -

          maybe this is already in the spec

          Nothing is completely nailed down in the spec - Yonik has done a bunch of work on the SolrCloud page, but a lot of that is: we could do this, or we could do that, or we might do this. We haven't really nailed much down firmly. Still pretty high level at the moment.

          How are we addressing a failed connection to a slave server, and instead of failing the request, re-making the request to an adjacent slave?

          We haven't really gotten there. But we want to cover that. What do you propose?

          The more we get these discussions going, the faster things will start getting nailed down ...

          A failure is a failure and whether it's the GC or something else, it's really the same thing.

          Its kind of arbitrary distinctions. Your saying, we would say a GC pause of 4 seconds (under the ZK client timeout) is not a failure, and a GC timeout of 6 seconds (over the ZK client timeout) is a failure. I'm not claiming any distinction is better than another though - just trying to work out the directions we want to go so I can start paddling.

          I can code till the cows come home with no input, but you might not like the results

          Show
          Mark Miller added a comment - maybe this is already in the spec Nothing is completely nailed down in the spec - Yonik has done a bunch of work on the SolrCloud page, but a lot of that is: we could do this, or we could do that, or we might do this. We haven't really nailed much down firmly. Still pretty high level at the moment. How are we addressing a failed connection to a slave server, and instead of failing the request, re-making the request to an adjacent slave? We haven't really gotten there. But we want to cover that. What do you propose? The more we get these discussions going, the faster things will start getting nailed down ... A failure is a failure and whether it's the GC or something else, it's really the same thing. Its kind of arbitrary distinctions. Your saying, we would say a GC pause of 4 seconds (under the ZK client timeout) is not a failure, and a GC timeout of 6 seconds (over the ZK client timeout) is a failure. I'm not claiming any distinction is better than another though - just trying to work out the directions we want to go so I can start paddling. I can code till the cows come home with no input, but you might not like the results
          Hide
          Brian Pinkerton added a comment -

          I think the timeouts are going to have to be different depending on the role of the particular node. In a really distributed setup, indexing nodes are generally more likely to have long GC pauses than searcher nodes, and a lengthy GC pause on an indexer is usually not a problem. However, if a searcher node goes out on a long GC pause then you need to find out fast and bypass the box before too many queries back up and need to be retried (though even this depends on throughput, response time, and number of other available nodes.)

          Show
          Brian Pinkerton added a comment - I think the timeouts are going to have to be different depending on the role of the particular node. In a really distributed setup, indexing nodes are generally more likely to have long GC pauses than searcher nodes, and a lengthy GC pause on an indexer is usually not a problem. However, if a searcher node goes out on a long GC pause then you need to find out fast and bypass the box before too many queries back up and need to be retried (though even this depends on throughput, response time, and number of other available nodes.)
          Hide
          Jason Rutherglen added a comment -

          as two types of failures, possibly

          A failure is a failure and whether it's the GC or something
          else, it's really the same thing. Sounds like we're defining the
          expectation of the client handling of a failure?

          I think we'll need to define groups of shards (maybe this is
          already in the spec), and allow a configurable failure setting
          per group. For example, group "live" would be allowed to return
          partial results because the user always wants results returned
          quickly. Group "archive" would always return complete results
          (if a node is down it can be configured to retry the request N
          times until it succeeds under a given max timeout).

          Also a request could be addressed to a group of shards, which
          would allow one set of replicated Zookeeper servers for N Solr
          clusters (instead of a Zookeeper server per Solr cluster).

          How are we addressing a failed connection to a slave server, and
          instead of failing the request, re-making the request to an
          adjacent slave?

          Show
          Jason Rutherglen added a comment - as two types of failures, possibly A failure is a failure and whether it's the GC or something else, it's really the same thing. Sounds like we're defining the expectation of the client handling of a failure? I think we'll need to define groups of shards (maybe this is already in the spec), and allow a configurable failure setting per group. For example, group "live" would be allowed to return partial results because the user always wants results returned quickly. Group "archive" would always return complete results (if a node is down it can be configured to retry the request N times until it succeeds under a given max timeout). Also a request could be addressed to a group of shards, which would allow one set of replicated Zookeeper servers for N Solr clusters (instead of a Zookeeper server per Solr cluster). How are we addressing a failed connection to a slave server, and instead of failing the request, re-making the request to an adjacent slave?
          Hide
          Mark Miller added a comment -

          Seems like we need to handle these types of failures anyway

          Right, but I was seeing them as two types of failures, possibly. One, the node is really gone - its not coming back, or its not coming back for many minutes. Two, GC took 8 seconds and the timeout is 5 seconds.

          The way things work now, if someone searched during the GC, theyd get all the results back, the search would just take longer. They'd see the hour glass spinning, know the results where slow for this search, but still coming. I was/am not sure if we wanted to replicate that.

          One option obviously is to treat a 15 second timeout the same as if the node went down. It seems to depend though - in a lot of cases, if I have large GC's often enough, I'd prefer slower search at those moments over users seeing daily failures / partial results. Thats the behavior you currently get without ZooKeeper.

          It just depends on whether we treat pauses slightly over the default timeout as outages. I can see that making sense in many cases, but not in others, depending on who is using the system and how much redundancy they have setup.

          Show
          Mark Miller added a comment - Seems like we need to handle these types of failures anyway Right, but I was seeing them as two types of failures, possibly. One, the node is really gone - its not coming back, or its not coming back for many minutes. Two, GC took 8 seconds and the timeout is 5 seconds. The way things work now, if someone searched during the GC, theyd get all the results back, the search would just take longer. They'd see the hour glass spinning, know the results where slow for this search, but still coming. I was/am not sure if we wanted to replicate that. One option obviously is to treat a 15 second timeout the same as if the node went down. It seems to depend though - in a lot of cases, if I have large GC's often enough, I'd prefer slower search at those moments over users seeing daily failures / partial results. Thats the behavior you currently get without ZooKeeper. It just depends on whether we treat pauses slightly over the default timeout as outages. I can see that making sense in many cases, but not in others, depending on who is using the system and how much redundancy they have setup.
          Hide
          Yonik Seeley added a comment -

          Right - thats the problem I want to address. Ephemeral nodes go away when the client times out - with a low timeout, you can learn relatively fast that a node is down.

          My assumption was to use a longer timeout on zookeeper (the default seems fine) to define who was active.

          When a node makes a request to a node that is down, it will fail relatively quickly, and can use a local policy to avoid that node for a certain amount of time. Seems like we need to handle these types of failures anyway, regardless of how low we set the zookeeper timeout.

          Show
          Yonik Seeley added a comment - Right - thats the problem I want to address. Ephemeral nodes go away when the client times out - with a low timeout, you can learn relatively fast that a node is down. My assumption was to use a longer timeout on zookeeper (the default seems fine) to define who was active. When a node makes a request to a node that is down, it will fail relatively quickly, and can use a local policy to avoid that node for a certain amount of time. Seems like we need to handle these types of failures anyway, regardless of how low we set the zookeeper timeout.
          Hide
          Patrick Hunt added a comment -

          You guys are asking the right questions. In particular the issue about "how expensive is it to lose a solr node" is a good one to think about. Unfort I don't know enough about solr to advise you, but if it's not very expensive to lose/regain a node then just let it timeout. The rest of the system will see this quickly (via ephemeral node/watch) and when the solr node is active again (comes out of the gc pause) it will talk to the zk server, see that it's session has been expired, and re-bootstrap into the solr "cloud".

          Another thing to ask yourself is this "if a Solr node pauses for 4 minutes due to GC pause, how different is that from a network partition or crash/reboot of that node?" What I'm saying here is, the node is gone for 4 minutes – what effect does that have on the rest of your system. Say you are expecting some very low SLA from that node, then upping the timeout is not useful here. Loss of the solr node due to gc is no diff than network partition or crash/reboot of the host.

          Show
          Patrick Hunt added a comment - You guys are asking the right questions. In particular the issue about "how expensive is it to lose a solr node" is a good one to think about. Unfort I don't know enough about solr to advise you, but if it's not very expensive to lose/regain a node then just let it timeout. The rest of the system will see this quickly (via ephemeral node/watch) and when the solr node is active again (comes out of the gc pause) it will talk to the zk server, see that it's session has been expired, and re-bootstrap into the solr "cloud". Another thing to ask yourself is this "if a Solr node pauses for 4 minutes due to GC pause, how different is that from a network partition or crash/reboot of that node?" What I'm saying here is, the node is gone for 4 minutes – what effect does that have on the rest of your system. Say you are expecting some very low SLA from that node, then upping the timeout is not useful here. Loss of the solr node due to gc is no diff than network partition or crash/reboot of the host.
          Hide
          Yonik Seeley added a comment -

          I suppose what I am worried about is when you don't have duplicate shards - or when two shards with the same data have a long gc pause together - if they just drop out, you get results back that are not from the full index.

          Ahhh, good point. We can't let that happen. But if NodeA said it had ShardX, and then it's ephemeral node went away, it's not dropping out of the cluster... it's just that it's currently unavailable (and we need to return partial results or fail the request).

          Show
          Yonik Seeley added a comment - I suppose what I am worried about is when you don't have duplicate shards - or when two shards with the same data have a long gc pause together - if they just drop out, you get results back that are not from the full index. Ahhh, good point. We can't let that happen. But if NodeA said it had ShardX, and then it's ephemeral node went away, it's not dropping out of the cluster... it's just that it's currently unavailable (and we need to return partial results or fail the request).
          Hide
          Mark Miller added a comment -

          so I guess for that brief period, we drop out of other distrib requests, and if we get hit, we just use the old shards list for requests that hit the dropped server?

          I suppose what I am worried about is when you don't have duplicate shards - or when two shards with the same data have a long gc pause together - if they just drop out, you get results back that are not from the full index. Many would prefer the search just take a bit longer (as it normally would with a gc) than losing results.

          Show
          Mark Miller added a comment - so I guess for that brief period, we drop out of other distrib requests, and if we get hit, we just use the old shards list for requests that hit the dropped server? I suppose what I am worried about is when you don't have duplicate shards - or when two shards with the same data have a long gc pause together - if they just drop out, you get results back that are not from the full index. Many would prefer the search just take a bit longer (as it normally would with a gc) than losing results.
          Hide
          Mark Miller added a comment -

          From our experience with hbase (which is the only place we've seen this issue so far, at least to this extent) you need to think about:

          1) client timeout value tradeoffs
          2) effects of session expiration due to gc pause, potential ways to mitigate

          for 1) there is a tradeoff (the good thing is that not all clients need to use the same timeout, so you can tune based on the client type, you can even have multiple sessions for a single client, each with it's own timeout) You can set the timeout higher, so if your zk client pauses you don't get expired, however this also means that if your client crashes the session won't be expired until the timeout expires. This means that the rest of your system will not be notified of the change (say you are doing leader election) for longer than you might like.

          for 2) you need to think about the potential failure cases and their effects. a) Say your ZK client (solr component X) fails (the host crashes), do you need to know about this in 5 seconds, or 30sec? b) Say the host is network partitioned due to a burp in the network that lasts 5 seconds, is this ok, or does the rest of the solr system need to know about this? c) Say component X gc pauses for 4 minutes, do you want the rest of the system to react immed, or consider this "ok" and just wait around for a while for X to come back.... but keep in mind that from the perspective of "the rest of your system" you don't know the difference between a) or b or c (etc...), from their viewpoint X is gone and they don't know why (unless it eventually comes back)

          In hbase case session expiration is expensive as the region server master will reallocate the table (or some such). In your case the effects of X going down may not be very expensive. If this is the case then having a low(er) session timeout for X may not be a problem. (just deal with the session timeout when it does happen, X will eventually come back)

          If X recovery is expensive you may want to set the timeout very high. but as I said this makes the system less responsive if X has a real problem. Another option we explored with hbase is to use a "lease" recipe instead. Set a very high timeout, but have X update the znode (still ephemeral) every N seconds. If the rest of the system (whoever is interested in X status) doesn't see an update from X in T seconds, then perhaps you log a warning ("where is X?"). Say you don't see an update from X in T*2 seconds, then page the operator "warning, maybe problems with X". Say you don't see in T*3 seconds (perhaps this is the timeout you use, in which case the znode is removed), consider X down, cleanup and enact recovery. These are madeup actions/times, but you can see what I'm getting at. With lease it's not "all or nothing". You (solr) have the option to take actions based on the lease time, rather than just the znode being deleted in the typical case (all or nothing). The tradeoff here is that it's a bit more complicted for you - you need to implement the lease rather than just relying on the znode being deleted - you would of course set a watch on the znode to get notified when the znode is removed (etc...)

          Show
          Mark Miller added a comment - From our experience with hbase (which is the only place we've seen this issue so far, at least to this extent) you need to think about: 1) client timeout value tradeoffs 2) effects of session expiration due to gc pause, potential ways to mitigate for 1) there is a tradeoff (the good thing is that not all clients need to use the same timeout, so you can tune based on the client type, you can even have multiple sessions for a single client, each with it's own timeout) You can set the timeout higher, so if your zk client pauses you don't get expired, however this also means that if your client crashes the session won't be expired until the timeout expires. This means that the rest of your system will not be notified of the change (say you are doing leader election) for longer than you might like. for 2) you need to think about the potential failure cases and their effects. a) Say your ZK client (solr component X) fails (the host crashes), do you need to know about this in 5 seconds, or 30sec? b) Say the host is network partitioned due to a burp in the network that lasts 5 seconds, is this ok, or does the rest of the solr system need to know about this? c) Say component X gc pauses for 4 minutes, do you want the rest of the system to react immed, or consider this "ok" and just wait around for a while for X to come back.... but keep in mind that from the perspective of "the rest of your system" you don't know the difference between a) or b or c (etc...), from their viewpoint X is gone and they don't know why (unless it eventually comes back) In hbase case session expiration is expensive as the region server master will reallocate the table (or some such). In your case the effects of X going down may not be very expensive. If this is the case then having a low(er) session timeout for X may not be a problem. (just deal with the session timeout when it does happen, X will eventually come back) If X recovery is expensive you may want to set the timeout very high. but as I said this makes the system less responsive if X has a real problem. Another option we explored with hbase is to use a "lease" recipe instead. Set a very high timeout, but have X update the znode (still ephemeral) every N seconds. If the rest of the system (whoever is interested in X status) doesn't see an update from X in T seconds, then perhaps you log a warning ("where is X?"). Say you don't see an update from X in T*2 seconds, then page the operator "warning, maybe problems with X". Say you don't see in T*3 seconds (perhaps this is the timeout you use, in which case the znode is removed), consider X down, cleanup and enact recovery. These are madeup actions/times, but you can see what I'm getting at. With lease it's not "all or nothing". You (solr) have the option to take actions based on the lease time, rather than just the znode being deleted in the typical case (all or nothing). The tradeoff here is that it's a bit more complicted for you - you need to implement the lease rather than just relying on the znode being deleted - you would of course set a watch on the znode to get notified when the znode is removed (etc...)
          Hide
          Mark Miller added a comment -

          Not sure I understand... for group membership, I had assumed there would be an ephemeral znode per node. Zookeeper does pings, and deletes the znode when the session expires, but those aren't "updates" per se.

          Right - thats the problem I want to address. Ephemeral nodes go away when the client times out - with a low timeout, you can learn relatively fast that a node is down. But because we may have long gc pauses, a low timeout will cause false "down" reports. And we have to handle reconnection's. But if we raise the timeout to get around these gc pauses, if there really is a problem, it will take a long time to learn about it. One of the recommendations above was to use a lease system instead, where each node does these updates. I'm trying to determine which strategy we actually want to use. Another option given was to let the gc cause a timeout, and then reconnect - but Solr has to "wait" for the reconnection to occur before it can access ZooKeeper again.

          Zookeeper client->server timeouts? Or Solr node->node request timeouts?
          Zookeeper timeouts need to be handled on a per-case basis - we should design such that most of the time we can continue operating even if we can't talk to zookeeper.

          Zookeeper client->server timeouts

          But as you say above, if a client times out, its ephemeral node goes down, and that shard will no longer be participating in distrib requests hitting other servers (presumably). How can we continue operating? We won't know which shards to hit (I guess we could use the "old" shards list?) and we won't be part of distributed requests from other shards, because our ephemeral node will be removed ...

          I'm ref'ing to Patrick Hunt's comments above. Perhaps, because recovery won't be expensive, thats what we want to do - but Solr won't be able to access ZooKeeper until its recovered - so I guess for that brief period, we drop out of other distrib requests, and if we get hit, we just use the old shards list for requests that hit the dropped server?

          Show
          Mark Miller added a comment - Not sure I understand... for group membership, I had assumed there would be an ephemeral znode per node. Zookeeper does pings, and deletes the znode when the session expires, but those aren't "updates" per se. Right - thats the problem I want to address. Ephemeral nodes go away when the client times out - with a low timeout, you can learn relatively fast that a node is down. But because we may have long gc pauses, a low timeout will cause false "down" reports. And we have to handle reconnection's. But if we raise the timeout to get around these gc pauses, if there really is a problem, it will take a long time to learn about it. One of the recommendations above was to use a lease system instead, where each node does these updates. I'm trying to determine which strategy we actually want to use. Another option given was to let the gc cause a timeout, and then reconnect - but Solr has to "wait" for the reconnection to occur before it can access ZooKeeper again. Zookeeper client->server timeouts? Or Solr node->node request timeouts? Zookeeper timeouts need to be handled on a per-case basis - we should design such that most of the time we can continue operating even if we can't talk to zookeeper. Zookeeper client->server timeouts But as you say above, if a client times out, its ephemeral node goes down, and that shard will no longer be participating in distrib requests hitting other servers (presumably). How can we continue operating? We won't know which shards to hit (I guess we could use the "old" shards list?) and we won't be part of distributed requests from other shards, because our ephemeral node will be removed ... I'm ref'ing to Patrick Hunt's comments above. Perhaps, because recovery won't be expensive, thats what we want to do - but Solr won't be able to access ZooKeeper until its recovered - so I guess for that brief period, we drop out of other distrib requests, and if we get hit, we just use the old shards list for requests that hit the dropped server?
          Hide
          Yonik Seeley added a comment -

          I don't necessarily like the idea of all of the nodes updating all the time to note their existence, but it seems like our best option from what I gather now.

          Not sure I understand... for group membership, I had assumed there would be an ephemeral znode per node. Zookeeper does pings, and deletes the znode when the session expires, but those aren't "updates" per se.

          My main concern at the moment is coming up with a plan for these timeouts though.

          Zookeeper client->server timeouts? Or Solr node->node request timeouts?
          Zookeeper timeouts need to be handled on a per-case basis - we should design such that most of the time we can continue operating even if we can't talk to zookeeper.

          Show
          Yonik Seeley added a comment - I don't necessarily like the idea of all of the nodes updating all the time to note their existence, but it seems like our best option from what I gather now. Not sure I understand... for group membership, I had assumed there would be an ephemeral znode per node. Zookeeper does pings, and deletes the znode when the session expires, but those aren't "updates" per se. My main concern at the moment is coming up with a plan for these timeouts though. Zookeeper client->server timeouts? Or Solr node->node request timeouts? Zookeeper timeouts need to be handled on a per-case basis - we should design such that most of the time we can continue operating even if we can't talk to zookeeper.
          Hide
          Mark Miller added a comment - - edited

          Yeah, I'm not trying to tackle node selection yet - just client timeouts. But if a client is going to be periodically updating a node to state its still in good shape, it seems like it might as well make the update include its current load. Not that thats not something that can'y be easily added later - I mostly through that in because it was part of the previous recommendation on how to handle client timeouts.

          I don't necessarily like the idea of all of the nodes updating all the time to note their existence, but it seems like our best option from what I gather now. Otherwise, nodes will be timing out all the time - and handling the reconnection seems like a pain - if Solr needs something from ZooKeeper after a GC ends, its going to have to pause and wait for the reconnect. Or I guess, on every ZooKeeper request, build in a timed retry?

          My main concern at the moment is coming up with a plan for these timeouts though. If we raise the timeout limits, we need another method for determining nodes are down.

          I suppose another option might be, its up to a node that can't reach another node to tag it as unresponsive?

          Show
          Mark Miller added a comment - - edited Yeah, I'm not trying to tackle node selection yet - just client timeouts. But if a client is going to be periodically updating a node to state its still in good shape, it seems like it might as well make the update include its current load. Not that thats not something that can'y be easily added later - I mostly through that in because it was part of the previous recommendation on how to handle client timeouts. I don't necessarily like the idea of all of the nodes updating all the time to note their existence, but it seems like our best option from what I gather now. Otherwise, nodes will be timing out all the time - and handling the reconnection seems like a pain - if Solr needs something from ZooKeeper after a GC ends, its going to have to pause and wait for the reconnect. Or I guess, on every ZooKeeper request, build in a timed retry? My main concern at the moment is coming up with a plan for these timeouts though. If we raise the timeout limits, we need another method for determining nodes are down. I suppose another option might be, its up to a node that can't reach another node to tag it as unresponsive?
          Hide
          Yonik Seeley added a comment -

          While our designs shouldn't preclude load based node selection, I don't think we should tackle it now - it's fraught with peril.

          We should allow the configuration of "capacity" for a node (or host?) and eventually implement a load balancing mechanism that takes such capacity into account. If one node has half the capacity of another, it will be sent half the number of requests. This type of static balancing is easier to predict and test.

          The other issue with updating statistics is the write cost on zookeeper - we may not want to do it by default, and if we do, we wouldn't want to do it with a high frequency.

          Some other considerations when choosing nodes for distributed search:

          • the same node should be used for a particular shard for the multiple phases of a distributed search, both for better consistency between phases, and better caching.
          • zookeeper could be used to take a node out of service (and other nodes should immediately stop making requests to that node), but each node also needs to be able to determine failure of another node and retry a different node independent of zookeeper.

          Everything (search traffic) should work when disconnected from zookeeper, based on the last cluster configuration seen.

          Show
          Yonik Seeley added a comment - While our designs shouldn't preclude load based node selection, I don't think we should tackle it now - it's fraught with peril. We should allow the configuration of "capacity" for a node (or host?) and eventually implement a load balancing mechanism that takes such capacity into account. If one node has half the capacity of another, it will be sent half the number of requests. This type of static balancing is easier to predict and test. The other issue with updating statistics is the write cost on zookeeper - we may not want to do it by default, and if we do, we wouldn't want to do it with a high frequency. Some other considerations when choosing nodes for distributed search: the same node should be used for a particular shard for the multiple phases of a distributed search, both for better consistency between phases, and better caching. zookeeper could be used to take a node out of service (and other nodes should immediately stop making requests to that node), but each node also needs to be able to determine failure of another node and retry a different node independent of zookeeper. Everything (search traffic) should work when disconnected from zookeeper, based on the last cluster configuration seen.
          Mark Miller made changes -
          Link This issue is related to SOLR-1663 [ SOLR-1663 ]
          Hide
          Mark Miller added a comment -

          I wonder how we might track load -

          Currently, wouldn't we have to grab every request handler and add up the requests and track the change in a given period of time?

          Would it make sense to add total requests received tracking (across handlers), so we don't have to keep polling each/every request handler?

          Show
          Mark Miller added a comment - I wonder how we might track load - Currently, wouldn't we have to grab every request handler and add up the requests and track the change in a given period of time? Would it make sense to add total requests received tracking (across handlers), so we don't have to keep polling each/every request handler?
          Hide
          Mark Miller added a comment -

          So based on what we know, it sounds like we are going to have to use a very high timeout for the ZooKeeper client?

          Then each node will run a thread that periodically updates its availability? When a node chooses its shards for a distributed search, it can look at how long its been since each shard updated itself, and choose or drop based on that? In the event that a very long time out period has passed, the client will timeout and the znode will actually be removed?

          This seems like it will be easier than trying to reconnect after timeouts and managing Solr during the "disconnected" period?

          Sound like the update itself might be the current load on that node - then nodes choosing other nodes for a distrib search can use both how recently nodes where updated as well as their reported loads to choose which nodes to select for a search?

          Does this sound right?

          Show
          Mark Miller added a comment - So based on what we know, it sounds like we are going to have to use a very high timeout for the ZooKeeper client? Then each node will run a thread that periodically updates its availability? When a node chooses its shards for a distributed search, it can look at how long its been since each shard updated itself, and choose or drop based on that? In the event that a very long time out period has passed, the client will timeout and the znode will actually be removed? This seems like it will be easier than trying to reconnect after timeouts and managing Solr during the "disconnected" period? Sound like the update itself might be the current load on that node - then nodes choosing other nodes for a distrib search can use both how recently nodes where updated as well as their reported loads to choose which nodes to select for a search? Does this sound right?
          Hide
          Yonik Seeley added a comment -

          I think I'll try to start implementing some bootstrap code - it will make it simple for new users to get a cluster up and running, in addition to making further development easier. I'm thinking of enabling via -Dboostrap_collection that will do everything necessary including copying local files to zk.

          Show
          Yonik Seeley added a comment - I think I'll try to start implementing some bootstrap code - it will make it simple for new users to get a cluster up and running, in addition to making further development easier. I'm thinking of enabling via -Dboostrap_collection that will do everything necessary including copying local files to zk.
          Hide
          Mark Miller added a comment -

          Regarding connection to ZooKeeper, this example shows one way to wait for the connection to be established using a re-usable "CountdownWatcher"

          Thanks! I've popped this in for now. I had introduced a simple similar mechanism to wait, but hadn't added in a timeout. This gives us a good start.

          Show
          Mark Miller added a comment - Regarding connection to ZooKeeper, this example shows one way to wait for the connection to be established using a re-usable "CountdownWatcher" Thanks! I've popped this in for now. I had introduced a simple similar mechanism to wait, but hadn't added in a timeout. This gives us a good start.
          Hide
          Patrick Hunt added a comment -

          Regarding connection to ZooKeeper, this example shows one way to wait for the connection to
          be established using a re-usable "CountdownWatcher"

          http://github.com/phunt/zkexamples/blob/master/src/test_session_expiration/TestSessionExpiration.java

          You might also want to review the faq for more insight into zk state transitions and their effects
          http://wiki.apache.org/hadoop/ZooKeeper/FAQ

          FF to open a dialog on zookeeper-user if you have questions.

          Show
          Patrick Hunt added a comment - Regarding connection to ZooKeeper, this example shows one way to wait for the connection to be established using a re-usable "CountdownWatcher" http://github.com/phunt/zkexamples/blob/master/src/test_session_expiration/TestSessionExpiration.java You might also want to review the faq for more insight into zk state transitions and their effects http://wiki.apache.org/hadoop/ZooKeeper/FAQ FF to open a dialog on zookeeper-user if you have questions.
          Hide
          Noble Paul added a comment -

          why should the ZookeeperController reference be kept in SolrCore also? Can it not be fetched just in time from CoreContainer?

          Show
          Noble Paul added a comment - why should the ZookeeperController reference be kept in SolrCore also? Can it not be fetched just in time from CoreContainer?
          Hide
          Mark Miller added a comment -

          I started hacking in a wait rather than a pause on client connections. Its not a complete and thorough solution, but it starts us down that path.

          Interesting that both of us only see the issue on windows and not linux - your linux box must be really fast too

          Show
          Mark Miller added a comment - I started hacking in a wait rather than a pause on client connections. Its not a complete and thorough solution, but it starts us down that path. Interesting that both of us only see the issue on windows and not linux - your linux box must be really fast too
          Hide
          Mark Miller added a comment - - edited

          Ok, continued being a rude host and looked right now

          Yeah, the ref in the core is currently just a convenience. The client belongs to the CoreContainer though. Most of this code is pretty exploratory. Thats why I put the //nocommit above the getZooKeeper method on core - that all has to be considered. Of course you can just use the coreDescriptor to get the corecontainer, and then get the ZooKeeper component. But just for ease right now, the core is holding its own ref - not its own client though.

          edit

          Though at a minimum, it doesn't make sense to keep the overloaded constructors as is - even if the core is to keep its own ref (which is not necessary, just saves a couple method calls), it would make more sense to pull it off the coredescriptor.getCoreContainer. I just patched that all in real quick at the beginning - currently nothing even uses it. I just put it there real quick when I was considering how component and handlers would access the ZooKeeper component.

          Show
          Mark Miller added a comment - - edited Ok, continued being a rude host and looked right now Yeah, the ref in the core is currently just a convenience. The client belongs to the CoreContainer though. Most of this code is pretty exploratory. Thats why I put the //nocommit above the getZooKeeper method on core - that all has to be considered. Of course you can just use the coreDescriptor to get the corecontainer, and then get the ZooKeeper component. But just for ease right now, the core is holding its own ref - not its own client though. edit Though at a minimum, it doesn't make sense to keep the overloaded constructors as is - even if the core is to keep its own ref (which is not necessary, just saves a couple method calls), it would make more sense to pull it off the coredescriptor.getCoreContainer. I just patched that all in real quick at the beginning - currently nothing even uses it. I just put it there real quick when I was considering how component and handlers would access the ZooKeeper component.
          Hide
          Mark Miller added a comment -

          Mark - I notice that you're currently using a zookeeper client per core...

          Hmm ... I shouldn't be. I'll have to look at the code, but I should be making one ZooKeeper component class that holds the ZooKeeper client and then giving the same instance to each core? So each core has a ref to that instance, but it should be the same? I'll have to look at the code again later ...

          Show
          Mark Miller added a comment - Mark - I notice that you're currently using a zookeeper client per core... Hmm ... I shouldn't be. I'll have to look at the code, but I should be making one ZooKeeper component class that holds the ZooKeeper client and then giving the same instance to each core? So each core has a ref to that instance, but it should be the same? I'll have to look at the code again later ...
          Hide
          Henry Robinson added a comment -

          Yonik -

          You're right, the way to correctly connect is to have a condition variable that can get notified by a watcher fired when the connection to ZK is established. You will also of course need to worry about what happens if the connection is established before you can issue the wait on the condition variable, and you lose the wake-up We should maybe think about adding a synchronous connection API to ZooKeeper...

          Henry

          Show
          Henry Robinson added a comment - Yonik - You're right, the way to correctly connect is to have a condition variable that can get notified by a watcher fired when the connection to ZK is established. You will also of course need to worry about what happens if the connection is established before you can issue the wait on the condition variable, and you lose the wake-up We should maybe think about adding a synchronous connection API to ZooKeeper... Henry
          Hide
          Yonik Seeley added a comment -

          Mark - I notice that you're currently using a zookeeper client per core... was this just easiest to start with, or were there other advantages?
          A zk client is another connection, and if people run a lot of cores (like AOL does?) then that's a lot of network connections and a lot of ZK pings.
          Also, we need one ZK client associated with the core container and not any specific core anyway.

          Show
          Yonik Seeley added a comment - Mark - I notice that you're currently using a zookeeper client per core... was this just easiest to start with, or were there other advantages? A zk client is another connection, and if people run a lot of cores (like AOL does?) then that's a lot of network connections and a lot of ZK pings. Also, we need one ZK client associated with the core container and not any specific core anyway.
          Hide
          Yonik Seeley added a comment -

          OK, I fixed it for now (rather hacked around it).
          The zookeeper client startup is asynchronous - if you use it fast enough after creation, you can get a ConnectionLoss (really it never finished establishing it I believe). For now I've just added a sleep of 200ms (100 often wasn't enough). I suppose we should switch to using a Watcher and some sort of wait mechanism when we really need to do a zookeeper operation synchronously.

          It does bring up the point that we need to think about what we can do when disconnected from zk to... some things we will need to wait for zk, some things we can remember and update zk when we become connected again, and some things we could simply drop.

          Show
          Yonik Seeley added a comment - OK, I fixed it for now (rather hacked around it). The zookeeper client startup is asynchronous - if you use it fast enough after creation, you can get a ConnectionLoss (really it never finished establishing it I believe). For now I've just added a sleep of 200ms (100 often wasn't enough). I suppose we should switch to using a Watcher and some sort of wait mechanism when we really need to do a zookeeper operation synchronously. It does bring up the point that we need to think about what we can do when disconnected from zk to... some things we will need to wait for zk, some things we can remember and update zk when we become connected again, and some things we could simply drop.
          Hide
          Mark Miller added a comment -

          It looks to me like the ZooKeeper server is shutting down before the test is finished. Will have to get Windows up and running to see why that might be.

          Show
          Mark Miller added a comment - It looks to me like the ZooKeeper server is shutting down before the test is finished. Will have to get Windows up and running to see why that might be.
          Hide
          Yonik Seeley added a comment -

          Update: seems to pass for me on Linux, but it's failing much of the time on Win7.
          It's not consistent - it probably passes 20% of the time, but often fails in the middle.

          Show
          Yonik Seeley added a comment - Update: seems to pass for me on Linux, but it's failing much of the time on Win7. It's not consistent - it probably passes 20% of the time, but often fails in the middle.
          Hide
          Yonik Seeley added a comment -

          Been trying to get the cloud branch going... but I'm running into issues somewhere (BasicZooKeeperTest):

           "log4j:WARN No appenders could be found for logger (org.apache.zookeeper.server.ZooKeeperServerMain).
          log4j:WARN Please initialize the log4j system properly.
          make:/solr
          make:/collections/collection1/config=collection1
          put:/configs/collection1/solrconfig.xml solr\conf\solrconfig.xml
          make:/configs/collection1/solrconfig.xml
          put:/configs/collection1/schema.xml solr\conf\schema.xml
          make:/configs/collection1/schema.xml
          put:/configs/collection1/stopwords.txt solr\conf\stopwords.txt
          make:/configs/collection1/stopwords.txt
          put:/configs/collection1/protwords.txt solr\conf\protwords.txt
          make:/configs/collection1/protwords.txt
          put:/configs/collection1/mapping-ISOLatin1Accent.txt solr\conf\mapping-ISOLatin1Accent.txt
          make:/configs/collection1/mapping-ISOLatin1Accent.txt
          put:/configs/collection1/old_synonyms.txt solr\conf\old_synonyms.txt
          make:/configs/collection1/old_synonyms.txt
          Dec 12, 2009 11:17:42 AM org.apache.solr.AbstractZooKeeperTestCase setUp
          INFO: ####SETUP_START testBasic
          look for collection config:/collections/collection1
          
          java.lang.RuntimeException: org.apache.solr.common.SolrException: ZooKeeper Exception
          	at org.apache.solr.util.TestHarness.<init>(TestHarness.java:152)
          	at org.apache.solr.AbstractZooKeeperTestCase.setUp(AbstractZooKeeperTestCase.java:81)
          	at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:40)
          	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
          	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:90)
          Caused by: org.apache.solr.common.SolrException: ZooKeeper Exception
          	at org.apache.solr.core.ZooKeeperController.loadConfigPath(ZooKeeperController.java:103)
          	at org.apache.solr.core.ZooKeeperController.<init>(ZooKeeperController.java:48)
          	at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:132)
          	at org.apache.solr.util.TestHarness.<init>(TestHarness.java:139)
          	... 19 more
          Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /collections/collection1
          	at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
          	at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
          	at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1214)
          	at org.apache.solr.core.ZooKeeperController.loadConfigPath(ZooKeeperController.java:91)
          	... 22 more
          
          Show
          Yonik Seeley added a comment - Been trying to get the cloud branch going... but I'm running into issues somewhere (BasicZooKeeperTest): "log4j:WARN No appenders could be found for logger (org.apache.zookeeper.server.ZooKeeperServerMain). log4j:WARN Please initialize the log4j system properly. make:/solr make:/collections/collection1/config=collection1 put:/configs/collection1/solrconfig.xml solr\conf\solrconfig.xml make:/configs/collection1/solrconfig.xml put:/configs/collection1/schema.xml solr\conf\schema.xml make:/configs/collection1/schema.xml put:/configs/collection1/stopwords.txt solr\conf\stopwords.txt make:/configs/collection1/stopwords.txt put:/configs/collection1/protwords.txt solr\conf\protwords.txt make:/configs/collection1/protwords.txt put:/configs/collection1/mapping-ISOLatin1Accent.txt solr\conf\mapping-ISOLatin1Accent.txt make:/configs/collection1/mapping-ISOLatin1Accent.txt put:/configs/collection1/old_synonyms.txt solr\conf\old_synonyms.txt make:/configs/collection1/old_synonyms.txt Dec 12, 2009 11:17:42 AM org.apache.solr.AbstractZooKeeperTestCase setUp INFO: ####SETUP_START testBasic look for collection config:/collections/collection1 java.lang.RuntimeException: org.apache.solr.common.SolrException: ZooKeeper Exception at org.apache.solr.util.TestHarness.<init>(TestHarness.java:152) at org.apache.solr.AbstractZooKeeperTestCase.setUp(AbstractZooKeeperTestCase.java:81) at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:40) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:90) Caused by: org.apache.solr.common.SolrException: ZooKeeper Exception at org.apache.solr.core.ZooKeeperController.loadConfigPath(ZooKeeperController.java:103) at org.apache.solr.core.ZooKeeperController.<init>(ZooKeeperController.java:48) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:132) at org.apache.solr.util.TestHarness.<init>(TestHarness.java:139) ... 19 more Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /collections/collection1 at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1214) at org.apache.solr.core.ZooKeeperController.loadConfigPath(ZooKeeperController.java:91) ... 22 more
          Hide
          Noble Paul added a comment - - edited

          A few comments on the http://wiki.apache.org/solr/SolrCloud#Layout

          When we talk about zookeeper schema We have to add more data on what are the contents. such as,

          • What is the node type ? (ephemeral|sequential etc)
          • What is the node data?

          What does the config look like? mycluster/configs/collection1_config/v1/
          What are these? solrconfig.xml, schema.xml, stopwords.txt, etc

          • are they nodes? if yes, of what type?
          • What is the data?

          contents of
          /collections/collection1/shards

          localhost:8983/solr/collection1=shardX,shardY,shardZ
          localhost:7574/solr/collection1=shardX,shardZ

          what are these ? nodes ? what is the data?

          Moreover we need to document on how the configuration on solr.xml/solrconfig.xml will look like so that we know which component consumes what information

          Show
          Noble Paul added a comment - - edited A few comments on the http://wiki.apache.org/solr/SolrCloud#Layout When we talk about zookeeper schema We have to add more data on what are the contents. such as, What is the node type ? (ephemeral|sequential etc) What is the node data? What does the config look like? mycluster/configs/collection1_config/v1/ What are these? solrconfig.xml, schema.xml, stopwords.txt, etc are they nodes? if yes, of what type? What is the data? contents of /collections/collection1/shards localhost:8983/solr/collection1=shardX,shardY,shardZ localhost:7574/solr/collection1=shardX,shardZ what are these ? nodes ? what is the data? Moreover we need to document on how the configuration on solr.xml/solrconfig.xml will look like so that we know which component consumes what information
          Hide
          Patrick Hunt added a comment -

          I'm not familiar with solr requirements but, at a higher level, I wanted to point out that when
          designing your ZooKeeper model you should keep scaling issue in mind, also identifying "patterns" is very useful
          see this link for some background (discussions we are having with hbase on similar vein)
          http://wiki.apache.org/hadoop/ZooKeeper/HBaseUseCases

          The basic concerns users should think about:

          1) number of sessions (clients) the ZK service will maintain (for solr I think this is small, 10's or 100's of sessions right?)

          2) number of znodes and size of znodes.
          a) Memory requirements mainly, also gc effects as current sun vms are poor wrt gc pauses
          b) we generally discourage large data size on the znodes (generally we suggest < 10k, <1k even better) as the ZK service copies this data from server->leader->followers as part of a write - so what I'm saying is that large data can slow your write performance as a result)
          c) second issue re. data size on znodes - we don't have partial read/write operations so you want to partition your data into multiple znodes rather than having 1 large znode

          3) number of watches - typically you'll be using watches to dynamically update solr based on changes to the system. You want to think about the watches you are setting (in particular you would like to limit "herd" effect)

          Show
          Patrick Hunt added a comment - I'm not familiar with solr requirements but, at a higher level, I wanted to point out that when designing your ZooKeeper model you should keep scaling issue in mind, also identifying "patterns" is very useful see this link for some background (discussions we are having with hbase on similar vein) http://wiki.apache.org/hadoop/ZooKeeper/HBaseUseCases The basic concerns users should think about: 1) number of sessions (clients) the ZK service will maintain (for solr I think this is small, 10's or 100's of sessions right?) 2) number of znodes and size of znodes. a) Memory requirements mainly, also gc effects as current sun vms are poor wrt gc pauses b) we generally discourage large data size on the znodes (generally we suggest < 10k, <1k even better) as the ZK service copies this data from server->leader->followers as part of a write - so what I'm saying is that large data can slow your write performance as a result) c) second issue re. data size on znodes - we don't have partial read/write operations so you want to partition your data into multiple znodes rather than having 1 large znode 3) number of watches - typically you'll be using watches to dynamically update solr based on changes to the system. You want to think about the watches you are setting (in particular you would like to limit "herd" effect)
          Hide
          Yonik Seeley added a comment -

          Nice work Mark! I'll try and get what you have up and running.

          So then the idea would: a user sets up everything in the model (we need good tools for this if thats the case), then the system builds the state automatically? When a search request comes in, we grab which shards to hit, cache them, and use them until a Watch event tells us to look again?

          Yep... but there are race conditions, so in our request to each node, it should specify what shard it is querying on that node. The node needs to notify us if it no longer has that shard.

          How about how a host registers itself?

          Seems simple, but I ended up retyping and deleting my answer to you 3 times. Some complicating factors:

          • a machine may have multiple network interfaces
          • a network interface may have multiple IP addresses (and either IPv4 or IPv6)
          • there may be multiple NIS/DNS entries for an IP
          • there may be multiple virtual machines on a single physical box

          I do think that a node should be able to register itself though, and that a user should be able to override that.
          We could perhaps start off with just identifying a node by the IP address + port the servlet container is bound to (or if multiple, just the first IP?) and model physical_box like like other topology items... rack, switch, datacenter, etc.

          Show
          Yonik Seeley added a comment - Nice work Mark! I'll try and get what you have up and running. So then the idea would: a user sets up everything in the model (we need good tools for this if thats the case), then the system builds the state automatically? When a search request comes in, we grab which shards to hit, cache them, and use them until a Watch event tells us to look again? Yep... but there are race conditions, so in our request to each node, it should specify what shard it is querying on that node. The node needs to notify us if it no longer has that shard. How about how a host registers itself? Seems simple, but I ended up retyping and deleting my answer to you 3 times. Some complicating factors: a machine may have multiple network interfaces a network interface may have multiple IP addresses (and either IPv4 or IPv6) there may be multiple NIS/DNS entries for an IP there may be multiple virtual machines on a single physical box I do think that a node should be able to register itself though, and that a user should be able to override that. We could perhaps start off with just identifying a node by the IP address + port the servlet container is bound to (or if multiple, just the first IP?) and model physical_box like like other topology items... rack, switch, datacenter, etc.
          Hide
          Mark Miller added a comment -

          I've committed some base code to the cloud branch.

          This will make it easy for anyone to play with how things work. I haven't put in anything thats difficult to shift around. Basic ZooKeeper support, basic config loading from ZooKeeper, and some basic test scaffolding for ZooKeeper. I'm also playing around with pulling nodes for distrib search from ZooKeeper, but I'm going to wait to put any of that in until we have a little more nailed down.

          Lets get the discussion going and push this forward.

          Show
          Mark Miller added a comment - I've committed some base code to the cloud branch. This will make it easy for anyone to play with how things work. I haven't put in anything thats difficult to shift around. Basic ZooKeeper support, basic config loading from ZooKeeper, and some basic test scaffolding for ZooKeeper. I'm also playing around with pulling nodes for distrib search from ZooKeeper, but I'm going to wait to put any of that in until we have a little more nailed down. Lets get the discussion going and push this forward.
          Hide
          Mark Miller added a comment -

          Missed the latest comment - I'm going to commit this base to branch. I've been playing around with further details, but I really think we need to start nailing things down a little more - not so that we can't pull the nails up, but so that we can start pushing forward.

          Has anyone looked at Yonik's SolrCloud wiki page? I know some have - any comments?

          I'd like to be more clear on the layout.

          And how the model/state distinction is going to work. Two paths that are essentially the same? model/state, both with the same layout? Then you setup what you'd like with the model, but the state reports whats actually going on? eg the model will list all of the hosts, but the state may show that only n of them are available? I know thats the idea, but whats the impl?

          How about how a host registers itself? It would be nice to do this automatically, but as Yonik mentions in the wiki, a host can have multiple names. Do we just require that a user uses the name that we choose? Do we make it so that any name the users tries works? (by enumerating every hostname). Do we allow the user to manually override what to use as that host address?

          So then the idea would: a user sets up everything in the model (we need good tools for this if thats the case), then the system builds the state automatically? When a search request comes in, we grab which shards to hit, cache them, and use them until a Watch event tells us to look again? I know thats starting to get a little low level, but just trying to spark some forward momentum discussion.

          Any opinions on going shards to nodes or nodes to shards in the layout?

          Show
          Mark Miller added a comment - Missed the latest comment - I'm going to commit this base to branch. I've been playing around with further details, but I really think we need to start nailing things down a little more - not so that we can't pull the nails up, but so that we can start pushing forward. Has anyone looked at Yonik's SolrCloud wiki page? I know some have - any comments? I'd like to be more clear on the layout. And how the model/state distinction is going to work. Two paths that are essentially the same? model/state, both with the same layout? Then you setup what you'd like with the model, but the state reports whats actually going on? eg the model will list all of the hosts, but the state may show that only n of them are available? I know thats the idea, but whats the impl? How about how a host registers itself? It would be nice to do this automatically, but as Yonik mentions in the wiki, a host can have multiple names. Do we just require that a user uses the name that we choose? Do we make it so that any name the users tries works? (by enumerating every hostname). Do we allow the user to manually override what to use as that host address? So then the idea would: a user sets up everything in the model (we need good tools for this if thats the case), then the system builds the state automatically? When a search request comes in, we grab which shards to hit, cache them, and use them until a Watch event tells us to look again? I know thats starting to get a little low level, but just trying to spark some forward momentum discussion. Any opinions on going shards to nodes or nodes to shards in the layout?
          Hide
          Yonik Seeley added a comment -

          I just created a branch: http://svn.apache.org/viewvc/lucene/solr/branches/cloud/

          I like the direction you've been going Mark, do you think it's ready to do a check-in on the branch?

          Show
          Yonik Seeley added a comment - I just created a branch: http://svn.apache.org/viewvc/lucene/solr/branches/cloud/ I like the direction you've been going Mark, do you think it's ready to do a check-in on the branch?
          Hide
          Yonik Seeley added a comment -

          We should probably start a ZooKeeper branch since this issue is likely to get quite large and hopefully have many contributors

          +1, that will help both direct developers and power users who want to try it out (and thus lower the bar for small contributions)

          Show
          Yonik Seeley added a comment - We should probably start a ZooKeeper branch since this issue is likely to get quite large and hopefully have many contributors +1, that will help both direct developers and power users who want to try it out (and thus lower the bar for small contributions)
          Mark Miller made changes -
          Attachment SOLR-1277.patch [ 12427253 ]
          Hide
          Mark Miller added a comment -

          Inching forward as we try and nail down the layout.

          • moves the configs to /solr/configs/collection1 in the tests
          • which config to load is discovered from
            /solr/collections/collection1/config=collection1
          • system property for the name of the collection to work with
          • consolidated zookeeper host and solr path sys properties into one ie localhost:2181/solr

          I still expect everything in this patch to be very fluid and change as we move forward - but its something to give us a base to play with.

          We should probably start a ZooKeeper branch since this issue is likely to get quite large and hopefully have many contributors - that model has worked quite well with the flexible indexing issue in Lucene, and I have gotten quite handy at quick merging from my practice there

          Show
          Mark Miller added a comment - Inching forward as we try and nail down the layout. moves the configs to /solr/configs/collection1 in the tests which config to load is discovered from /solr/collections/collection1/config=collection1 system property for the name of the collection to work with consolidated zookeeper host and solr path sys properties into one ie localhost:2181/solr I still expect everything in this patch to be very fluid and change as we move forward - but its something to give us a base to play with. We should probably start a ZooKeeper branch since this issue is likely to get quite large and hopefully have many contributors - that model has worked quite well with the flexible indexing issue in Lucene, and I have gotten quite handy at quick merging from my practice there
          Noble Paul made changes -
          Link This issue depends upon SOLR-1621 [ SOLR-1621 ]
          Hide
          Noble Paul added a comment -

          we are hijacking the original issue here. I have opened SOLR-1621 for this

          Show
          Noble Paul added a comment - we are hijacking the original issue here. I have opened SOLR-1621 for this
          Hide
          Noble Paul added a comment -

          We could sub a name that acts in that manner - one thats not likely to be out there currently like default_solr_core or something.

          yep Just the way tomcat keeps the default context as ROOT we can keep something like DEFAULT_CORE .

          Show
          Noble Paul added a comment - We could sub a name that acts in that manner - one thats not likely to be out there currently like default_solr_core or something. yep Just the way tomcat keeps the default context as ROOT we can keep something like DEFAULT_CORE .
          Hide
          Noble Paul added a comment -

          That page looks a little low level - seems like it describes the design of a specific implementation.

          W/o the usecases it is hard to decide upon the features of the ZookeeperComponent. According to me the most common usecases are distributed search and master/slave setup. This atleast spells out what is expected of whom

          Show
          Noble Paul added a comment - That page looks a little low level - seems like it describes the design of a specific implementation. W/o the usecases it is hard to decide upon the features of the ZookeeperComponent. According to me the most common usecases are distributed search and master/slave setup. This atleast spells out what is expected of whom
          Hide
          Otis Gospodnetic added a comment -

          How about this idea for the "what to do with the default core name".
          What if the default/empty-named core always pointed to the Solr admin/dashboard page, something that shows all the info about the system (pulled from ZK)?

          Show
          Otis Gospodnetic added a comment - How about this idea for the "what to do with the default core name". What if the default/empty-named core always pointed to the Solr admin/dashboard page, something that shows all the info about the system (pulled from ZK)?
          Hide
          Mark Miller added a comment -

          but worry about the details of managing it with core admin with a name of ""

          We could sub a name that acts in that manner - one thats not likely to be out there currently like default_solr_core or something.

          Show
          Mark Miller added a comment - but worry about the details of managing it with core admin with a name of "" We could sub a name that acts in that manner - one thats not likely to be out there currently like default_solr_core or something.
          Hide
          Mark Miller added a comment -

          I'd be happier if we went down the path of a solr/cores/xxx structure instead, and using solr.xml as an over ride like servlet container's use webapps and context.xml files

          Thats an appealing idea. You could use defaults, and just get the corename from the discovered instance dir. If you want to change the defaults, you can add the solr.xml. This would also work good in the ZooKeeper case, as cores could be discovered by default in the same manner. Removes the need for solr.xml in multicore and still allows for the singlecore/multicore merge in the manner explored above.

          I don't think the back compat part of it would be very difficult at all either.

          Show
          Mark Miller added a comment - I'd be happier if we went down the path of a solr/cores/xxx structure instead, and using solr.xml as an over ride like servlet container's use webapps and context.xml files Thats an appealing idea. You could use defaults, and just get the corename from the discovered instance dir. If you want to change the defaults, you can add the solr.xml. This would also work good in the ZooKeeper case, as cores could be discovered by default in the same manner. Removes the need for solr.xml in multicore and still allows for the singlecore/multicore merge in the manner explored above. I don't think the back compat part of it would be very difficult at all either.
          Hide
          Yonik Seeley added a comment -

          What I'm suggesting here is a little different though - that Solr.xml allows a core config that acts like single core now - a core name of "" and an instance dir of "" or "." that sets everything to work as single core now.

          I like the idea - but worry about the details of managing it with core admin with a name of "" (I think the SolrParams class that handles HTTP params treats zero length params as a null - as if it hasn't been passed. I believe this was done to support forms).

          but if you want to use zookeeper with singlecore, add the solr.xml and setup the required config

          My first thought was that it should be possible to store pretty much everything in zookeeper... but then again - the nodes (solr servers) contain the actual indexes, so we should probably persist information about that at least.

          Still seems like it would be nice to start a solr server and point it at a cluster, w/o the need to write a solr.xml file for it. Solr could then write out a solr.xml with the needed info. This would be useful for adding a new server to a cluster.

          Show
          Yonik Seeley added a comment - What I'm suggesting here is a little different though - that Solr.xml allows a core config that acts like single core now - a core name of "" and an instance dir of "" or "." that sets everything to work as single core now. I like the idea - but worry about the details of managing it with core admin with a name of "" (I think the SolrParams class that handles HTTP params treats zero length params as a null - as if it hasn't been passed. I believe this was done to support forms). but if you want to use zookeeper with singlecore, add the solr.xml and setup the required config My first thought was that it should be possible to store pretty much everything in zookeeper... but then again - the nodes (solr servers) contain the actual indexes, so we should probably persist information about that at least. Still seems like it would be nice to start a solr server and point it at a cluster, w/o the need to write a solr.xml file for it. Solr could then write out a solr.xml with the needed info. This would be useful for adding a new server to a cluster.
          Hide
          Mark Miller added a comment -

          Yep, me too. That doesn't necessarily make solr.xml required though.

          Right, as I said above, I don't think so either - we could have a dummy simple solr.xml that would be used if you don't want to customize anything (pull from the jar or a String, or whatever) - that would allow for the same code paths in either case.

          It would also be very nice if we could avoid breaking all of the existing URLs by having a core that could also act as a default core or something. IIRC this was discussed in the past - not sure what the outcome was though.

          Thats essentially the change I'm talking about - though I think in a different way. You can do that now by just adding an alias for a core with a name of "". We don't do it out of the box now - don't remember exactly why either, but I think Hoss ended up putting the kibosh on it for various reasons.

          What I'm suggesting here is a little different though - that Solr.xml allows a core config that acts like single core now - a core name of "" and an instance dir of "" or "." that sets everything to work as single core now. That simple config would be the "dummy" Solr.xml that we load when we don't find one. It also lets a multicore user setup whatever cores he wants, plus still have a core that acts as single core does now in terms of URLs, config. Its a fairly simple change too.

          Then we can say, okay, you can use single core with no solr.xml, but if you want to use zookeeper with singlecore, add the solr.xml and setup the required config - but you don't have to switch to the multicore structure - it would be a seamless change.

          Show
          Mark Miller added a comment - Yep, me too. That doesn't necessarily make solr.xml required though. Right, as I said above, I don't think so either - we could have a dummy simple solr.xml that would be used if you don't want to customize anything (pull from the jar or a String, or whatever) - that would allow for the same code paths in either case. It would also be very nice if we could avoid breaking all of the existing URLs by having a core that could also act as a default core or something. IIRC this was discussed in the past - not sure what the outcome was though. Thats essentially the change I'm talking about - though I think in a different way. You can do that now by just adding an alias for a core with a name of "". We don't do it out of the box now - don't remember exactly why either, but I think Hoss ended up putting the kibosh on it for various reasons. What I'm suggesting here is a little different though - that Solr.xml allows a core config that acts like single core now - a core name of "" and an instance dir of "" or "." that sets everything to work as single core now. That simple config would be the "dummy" Solr.xml that we load when we don't find one. It also lets a multicore user setup whatever cores he wants, plus still have a core that acts as single core does now in terms of URLs, config. Its a fairly simple change too. Then we can say, okay, you can use single core with no solr.xml, but if you want to use zookeeper with singlecore, add the solr.xml and setup the required config - but you don't have to switch to the multicore structure - it would be a seamless change.
          Hide
          Yonik Seeley added a comment -

          The wiki page is my attempt to capture it. I wish to have a debate around it and finally reach a consensus on the design.

          That page looks a little low level - seems like it describes the design of a specific implementation.
          I also want to brainstorm about requirements, high-level use-cases, and zookeeper layout/schemas.
          I'm not sure if everyone is on the same page enough to attempt editing the same wiki page.

          Show
          Yonik Seeley added a comment - The wiki page is my attempt to capture it. I wish to have a debate around it and finally reach a consensus on the design. That page looks a little low level - seems like it describes the design of a specific implementation. I also want to brainstorm about requirements, high-level use-cases, and zookeeper layout/schemas. I'm not sure if everyone is on the same page enough to attempt editing the same wiki page.
          Hide
          Yonik Seeley added a comment -

          I'd really love to see the single core sutff get subsumed under the multicore - a lot cleaner than the multiple code paths, especially as zookeeper comes into play.

          Yep, me too. That doesn't necessarily make solr.xml required though.
          It would also be very nice if we could avoid breaking all of the existing URLs by having a core that could also act as a default core or something. IIRC this was discussed in the past - not sure what the outcome was though.

          Show
          Yonik Seeley added a comment - I'd really love to see the single core sutff get subsumed under the multicore - a lot cleaner than the multiple code paths, especially as zookeeper comes into play. Yep, me too. That doesn't necessarily make solr.xml required though. It would also be very nice if we could avoid breaking all of the existing URLs by having a core that could also act as a default core or something. IIRC this was discussed in the past - not sure what the outcome was though.
          Hide
          Mark Miller added a comment - - edited

          less required configuration please.

          You wouldn't have to configure anything. Ideally, the example would come with it simply configured for one core with everything how it works now (same urls, etc). Optionally, we could allow someone to not use it, and just assume that simple configuration. The example would want it though, so we can put in commented out example config (eg for zookeeper setup). Then if you didnt have it, we would essentially read a dummy simple cfg (though I don't think thats really necessary myself), and we can have singlecore/multicore more in line with each other.

          There is no getting around the config we are looking for - zookeeper will need it one way or another, and not in solrconfig.

          edit

          Of course, you could say using multicore is required for zookeeper - but its an odd artificial limitation. Having multi-core subsume single core in a more seamless is a nice goal in either case.

          Show
          Mark Miller added a comment - - edited less required configuration please. You wouldn't have to configure anything. Ideally, the example would come with it simply configured for one core with everything how it works now (same urls, etc). Optionally, we could allow someone to not use it, and just assume that simple configuration. The example would want it though, so we can put in commented out example config (eg for zookeeper setup). Then if you didnt have it, we would essentially read a dummy simple cfg (though I don't think thats really necessary myself), and we can have singlecore/multicore more in line with each other. There is no getting around the config we are looking for - zookeeper will need it one way or another, and not in solrconfig. edit Of course, you could say using multicore is required for zookeeper - but its an odd artificial limitation. Having multi-core subsume single core in a more seamless is a nice goal in either case.
          Hide
          Mark Miller added a comment -

          Why don't we open an issue and track this down.

          Is there an issue out there already that we can piggy back on? I seem to remember the idea of just moving to multicore coming up before - did we not get to an issue for it though?

          I'd really love to see the single core sutff get subsumed under the multicore - a lot cleaner than the multiple code paths, especially as zookeeper comes into play.

          Show
          Mark Miller added a comment - Why don't we open an issue and track this down. Is there an issue out there already that we can piggy back on? I seem to remember the idea of just moving to multicore coming up before - did we not get to an issue for it though? I'd really love to see the single core sutff get subsumed under the multicore - a lot cleaner than the multiple code paths, especially as zookeeper comes into play.
          Hide
          patrick o'leary added a comment - - edited

          I've been thinking about the idea of perhaps "deprecating" the old single core mode and requiring a solr.xml

          -1, less required configuration please. I'd be happier if we went down the path of a solr/cores/xxx structure instead, and using solr.xml as an over ride like servlet container's use webapps and context.xml files

          But that's for another issue

          Show
          patrick o'leary added a comment - - edited I've been thinking about the idea of perhaps "deprecating" the old single core mode and requiring a solr.xml -1, less required configuration please. I'd be happier if we went down the path of a solr/cores/xxx structure instead, and using solr.xml as an over ride like servlet container's use webapps and context.xml files But that's for another issue
          Hide
          Noble Paul added a comment -

          I've been thinking about the idea of perhaps "deprecating" the old single core mode and requiring a solr.xm

          +1 . Why don't we open an issue and track this down.

          and getting a good model in ZK seems really important

          The wiki page is my attempt to capture it. I wish to have a debate around it and finally reach a consensus on the design.

          Show
          Noble Paul added a comment - I've been thinking about the idea of perhaps "deprecating" the old single core mode and requiring a solr.xm +1 . Why don't we open an issue and track this down. and getting a good model in ZK seems really important The wiki page is my attempt to capture it. I wish to have a debate around it and finally reach a consensus on the design.
          Hide
          Yonik Seeley added a comment -

          I have time to start looking into this zookeeper stuff now... it's great to see the number of people already involved! Hopefully we can capture everyone's requirements, use cases, and expertise to give us a really solid starting point.

          While this may not be functionality we need in an initial release, it's important to insure our initial design does not limit future functionality.

          +1, and getting a good model in ZK seems really important. I have a feeling we'll be building off of it for years to come.

          Solr 1.5: The Cloud Edition!

          Show
          Yonik Seeley added a comment - I have time to start looking into this zookeeper stuff now... it's great to see the number of people already involved! Hopefully we can capture everyone's requirements, use cases, and expertise to give us a really solid starting point. While this may not be functionality we need in an initial release, it's important to insure our initial design does not limit future functionality. +1, and getting a good model in ZK seems really important. I have a feeling we'll be building off of it for years to come. Solr 1.5: The Cloud Edition!
          Hide
          Mark Harwood added a comment -

          Not intimately familiar with Solr but just thought I'd add some comments based on my experiences building a Zookeeper-managed Lucene cluster.

          • Zookeeper can be used to hold the definitive config plan for a cluster (how many logical indexes, numbers of shards, replicas etc). "Dumb" search servers can watch and respond to centralised config changes in Zookeeper to assume roles.
          • An admin console can be used to:
            a) Change and publish a desired config plan to Zookeeper
            b) Monitor the implementation of a config plan (which servers are active, which server is the current shard master, what data version servers hold etc)

          Overall this works well but what I think was a step too far was trying to use Zookeeper to coordinate distributed transactions across a cluster (writers synching commits, all readers consistently synched at the same version).
          This transaction management is a complex beast and when you encounter issues like the rogue GC mentioned earlier things start to fall apart quickly. As far as "C.A.P theorem" goes (Consistency, Availability, Partitioning - pick 2) I'm definitely favouring Availability over Consistency when managing a Partitioned system.

          Cheers,
          Mark

          Show
          Mark Harwood added a comment - Not intimately familiar with Solr but just thought I'd add some comments based on my experiences building a Zookeeper-managed Lucene cluster. Zookeeper can be used to hold the definitive config plan for a cluster (how many logical indexes, numbers of shards, replicas etc). "Dumb" search servers can watch and respond to centralised config changes in Zookeeper to assume roles. An admin console can be used to: a) Change and publish a desired config plan to Zookeeper b) Monitor the implementation of a config plan (which servers are active, which server is the current shard master, what data version servers hold etc) Overall this works well but what I think was a step too far was trying to use Zookeeper to coordinate distributed transactions across a cluster (writers synching commits, all readers consistently synched at the same version). This transaction management is a complex beast and when you encounter issues like the rogue GC mentioned earlier things start to fall apart quickly. As far as "C.A.P theorem" goes (Consistency, Availability, Partitioning - pick 2) I'm definitely favouring Availability over Consistency when managing a Partitioned system. Cheers, Mark
          Hide
          Mark Miller added a comment - - edited

          These system properties are actually already bugging me. They are essentially for the case of a single core - in general, many of them won't work in the multicore case, because you will want to be able to specify different values for different cores - thats fine, this can be done in solr.xml. But its annoying to have the two different mechanisms.

          I've been thinking about the idea of perhaps "deprecating" the old single core mode and requiring a solr.xml? One of the major changes would be the URLs and directory structure - but it seems like we could allow a core with a name of "", that has the same dir structure as single core now? Then the only real change would be having solr.xml for many. We could leave the old support for no solr.xml, but require it for new features like ZooKeeper integration.

          That allows us to add whatever properties we need in a more simple manner. For example, it might be nice if local config files would override zookeeper config files - but I'd really like to make that optional. But I wouldn't want to add another sys property for it, and then also support it in solr.xml per core.

          Show
          Mark Miller added a comment - - edited These system properties are actually already bugging me. They are essentially for the case of a single core - in general, many of them won't work in the multicore case, because you will want to be able to specify different values for different cores - thats fine, this can be done in solr.xml. But its annoying to have the two different mechanisms. I've been thinking about the idea of perhaps "deprecating" the old single core mode and requiring a solr.xml? One of the major changes would be the URLs and directory structure - but it seems like we could allow a core with a name of "", that has the same dir structure as single core now? Then the only real change would be having solr.xml for many. We could leave the old support for no solr.xml, but require it for new features like ZooKeeper integration. That allows us to add whatever properties we need in a more simple manner. For example, it might be nice if local config files would override zookeeper config files - but I'd really like to make that optional. But I wouldn't want to add another sys property for it, and then also support it in solr.xml per core.
          Hide
          Mark Miller added a comment -

          What all are we trying to achieve in this issue?

          I think Jason did a good job of capturing that in the summary.

          Show
          Mark Miller added a comment - What all are we trying to achieve in this issue? I think Jason did a good job of capturing that in the summary.
          Hide
          Noble Paul added a comment -

          hi Everyone.

          What all are we trying to achieve in this issue? Should we handle the resource loading in this issue?

          Can we limit the scope of this issue and add more issues for other things?

          Show
          Noble Paul added a comment - hi Everyone. What all are we trying to achieve in this issue? Should we handle the resource loading in this issue? Can we limit the scope of this issue and add more issues for other things?
          Hide
          Jason Rutherglen added a comment -

          The question then becomes what do you want to make automatic

          vs those things that require operator intervention.

          Right, I'd like the distributed Solr + ZK system to
          automatically failover to another server if there's a functional
          software failure. Also, with a search system query times are
          very important and if they suddenly drop off on a replicated
          server, the node needs to be removed and a new server brought
          online (hopefully automatically). If Solr + ZK doesn't take out
          a server whose query times are 10 times the average of the other
          comparable replicated slave servers, then it 's harder to
          justify going live with it, in my humble opinion because it's
          not really solving the main reason to use a naming service.

          While this may not be functionality we need in an initial
          release, it's important to insure our initial design does not
          limit future functionality.

          Show
          Jason Rutherglen added a comment - The question then becomes what do you want to make automatic vs those things that require operator intervention. Right, I'd like the distributed Solr + ZK system to automatically failover to another server if there's a functional software failure. Also, with a search system query times are very important and if they suddenly drop off on a replicated server, the node needs to be removed and a new server brought online (hopefully automatically). If Solr + ZK doesn't take out a server whose query times are 10 times the average of the other comparable replicated slave servers, then it 's harder to justify going live with it, in my humble opinion because it's not really solving the main reason to use a naming service. While this may not be functionality we need in an initial release, it's important to insure our initial design does not limit future functionality.
          Hide
          Patrick Hunt added a comment -

          Obv monitoring the system as a whole is important, there are some great tools for that already (ganglia/nagios, I have seen comments from hbase users that indicate they are doing something like this, on case-by-case basis)

          The question then becomes what do you want to make automatic vs those things that require operator intervention.

          Show
          Patrick Hunt added a comment - Obv monitoring the system as a whole is important, there are some great tools for that already (ganglia/nagios, I have seen comments from hbase users that indicate they are doing something like this, on case-by-case basis) The question then becomes what do you want to make automatic vs those things that require operator intervention.
          Hide
          Jason Rutherglen added a comment -

          If we're detecting node failure, it seems the functionality of
          Solr should also be detected for failure. The discussions thus
          far seem to be around network or process failure which is
          usually either intermittent or terminal. Detecting measurable
          increase/decreases in CPU, RAM consumption, OOMs, query
          failures, indexing failures due to bugs are probably more important than the
          network being down because they are harder to detect and fix.

          How is HBase handling the detection of functional issues in
          relation to ZK?

          Show
          Jason Rutherglen added a comment - If we're detecting node failure, it seems the functionality of Solr should also be detected for failure. The discussions thus far seem to be around network or process failure which is usually either intermittent or terminal. Detecting measurable increase/decreases in CPU, RAM consumption, OOMs, query failures, indexing failures due to bugs are probably more important than the network being down because they are harder to detect and fix. How is HBase handling the detection of functional issues in relation to ZK?
          Hide
          Patrick Hunt added a comment -

          Patrick, how low is it feasible to set the timeout? Could it be set low enough that it could be the only input to a failover decision in the case of a very high query load? That is, say a cluster with 3 query slaves is handling 600 queries per second, which means each is getting 200qps, or one every 5ms on average. If a slave were to fail, queries will start backing up pretty quickly unless a decision is made to drop the failed node within 500ms or so. Clearly, whatever node is distributing the queries to the slaves can make the failed node down (say, in the case of a HW load balancer), but could we rely on ZK to handle this for us?

          See https://issues.apache.org/jira/browse/ZOOKEEPER-601 for background

          Typically you will have a server ticktime of 2 seconds, so min that the server allows currently is 4 seconds. This means that the client will send a ping every 4/3 seconds, waiting up to 4/3 seconds for a response before it considers the server down. The server of course will expire the session after 4 seconds in this case.

          It should work (say 601 is fixed) but I would not encourage you to go down this road, instead you can do something better (although I don't know enough about solr, perhaps this is worse, it may also depend on whether/what hw load balancer you have)

          Rather I would suggest that you do something similar to the lease - periodically publish some load information from the query slaves to zk. Every 250ms your query slave could push an update that says "I am doing Xqps currentl" If you don't see an update in 500ms maybe you consider the slave dead till it comes back (updates the znode again). If you don't have a hwLB you might even be able to take advantage of this information when passing queries to slaves. Worst case scenario you could expose this information through a dashboard, giving good insight into solr workings to an operator.

          Each slave is doing 4 updates to zk per second in this case. You are more reliant on having a stable framework for ZK, keep that in mind (the cluster must be performant, low gc pauses in zk itself (ie tune the gc properly) etc...)

          See my zk service latency review for what you should expect re latencies in some situations: http://bit.ly/4ekN8G

          Show
          Patrick Hunt added a comment - Patrick, how low is it feasible to set the timeout? Could it be set low enough that it could be the only input to a failover decision in the case of a very high query load? That is, say a cluster with 3 query slaves is handling 600 queries per second, which means each is getting 200qps, or one every 5ms on average. If a slave were to fail, queries will start backing up pretty quickly unless a decision is made to drop the failed node within 500ms or so. Clearly, whatever node is distributing the queries to the slaves can make the failed node down (say, in the case of a HW load balancer), but could we rely on ZK to handle this for us? See https://issues.apache.org/jira/browse/ZOOKEEPER-601 for background Typically you will have a server ticktime of 2 seconds, so min that the server allows currently is 4 seconds. This means that the client will send a ping every 4/3 seconds, waiting up to 4/3 seconds for a response before it considers the server down. The server of course will expire the session after 4 seconds in this case. It should work (say 601 is fixed) but I would not encourage you to go down this road, instead you can do something better (although I don't know enough about solr, perhaps this is worse, it may also depend on whether/what hw load balancer you have) Rather I would suggest that you do something similar to the lease - periodically publish some load information from the query slaves to zk. Every 250ms your query slave could push an update that says "I am doing Xqps currentl" If you don't see an update in 500ms maybe you consider the slave dead till it comes back (updates the znode again). If you don't have a hwLB you might even be able to take advantage of this information when passing queries to slaves. Worst case scenario you could expose this information through a dashboard, giving good insight into solr workings to an operator. Each slave is doing 4 updates to zk per second in this case. You are more reliant on having a stable framework for ZK, keep that in mind (the cluster must be performant, low gc pauses in zk itself (ie tune the gc properly) etc...) See my zk service latency review for what you should expect re latencies in some situations: http://bit.ly/4ekN8G
          Hide
          Grant Ingersoll added a comment -

          Is there a benefit to refactoring the shards piece to a component rather than a simple helper class or something?

          Yes. For starters, not all requests require the QueryComponent, but still may require distributed (TermsComponent) caps. Second, I think it is cleaner and allows others to plugin/override with their own capabilities.

          Show
          Grant Ingersoll added a comment - Is there a benefit to refactoring the shards piece to a component rather than a simple helper class or something? Yes. For starters, not all requests require the QueryComponent, but still may require distributed (TermsComponent) caps. Second, I think it is cleaner and allows others to plugin/override with their own capabilities.
          Hide
          Brian Pinkerton added a comment -

          Yes, we'll definitely want to have different timeouts. For instance, I can see having the indexer have a relatively long timeout, while query slaves would have very short timeouts.

          Patrick, how low is it feasible to set the timeout? Could it be set low enough that it could be the only input to a failover decision in the case of a very high query load? That is, say a cluster with 3 query slaves is handling 600 queries per second, which means each is getting 200qps, or one every 5ms on average. If a slave were to fail, queries will start backing up pretty quickly unless a decision is made to drop the failed node within 500ms or so. Clearly, whatever node is distributing the queries to the slaves can make the failed node down (say, in the case of a HW load balancer), but could we rely on ZK to handle this for us?

          Show
          Brian Pinkerton added a comment - Yes, we'll definitely want to have different timeouts. For instance, I can see having the indexer have a relatively long timeout, while query slaves would have very short timeouts. Patrick, how low is it feasible to set the timeout? Could it be set low enough that it could be the only input to a failover decision in the case of a very high query load? That is, say a cluster with 3 query slaves is handling 600 queries per second, which means each is getting 200qps, or one every 5ms on average. If a slave were to fail, queries will start backing up pretty quickly unless a decision is made to drop the failed node within 500ms or so. Clearly, whatever node is distributing the queries to the slaves can make the failed node down (say, in the case of a HW load balancer), but could we rely on ZK to handle this for us?
          Hide
          Patrick Hunt added a comment -

          Any pointers on ways to deal with this?

          From our experience with hbase (which is the only place we've seen this issue so far, at least to this extent) you need to think about:

          1) client timeout value tradeoffs
          2) effects of session expiration due to gc pause, potential ways to mitigate

          for 1) there is a tradeoff (the good thing is that not all clients need to use the same timeout, so you can tune based on the client type, you can even have multiple sessions for a single client, each with it's own timeout) You can set the timeout higher, so if your zk client pauses you don't get expired, however this also means that if your client crashes the session won't be expired until the timeout expires. This means that the rest of your system will not be notified of the change (say you are doing leader election) for longer than you might like.

          for 2) you need to think about the potential failure cases and their effects. a) Say your ZK client (solr component X) fails (the host crashes), do you need to know about this in 5 seconds, or 30sec? b) Say the host is network partitioned due to a burp in the network that lasts 5 seconds, is this ok, or does the rest of the solr system need to know about this? c) Say component X gc pauses for 4 minutes, do you want the rest of the system to react immed, or consider this "ok" and just wait around for a while for X to come back.... but keep in mind that from the perspective of "the rest of your system" you don't know the difference between a) or b or c (etc...), from their viewpoint X is gone and they don't know why (unless it eventually comes back)

          In hbase case session expiration is expensive as the region server master will reallocate the table (or some such). In your case the effects of X going down may not be very expensive. If this is the case then having a low(er) session timeout for X may not be a problem. (just deal with the session timeout when it does happen, X will eventually come back)

          If X recovery is expensive you may want to set the timeout very high. but as I said this makes the system less responsive if X has a real problem. Another option we explored with hbase is to use a "lease" recipe instead. Set a very high timeout, but have X update the znode (still ephemeral) every N seconds. If the rest of the system (whoever is interested in X status) doesn't see an update from X in T seconds, then perhaps you log a warning ("where is X?"). Say you don't see an update from X in T*2 seconds, then page the operator "warning, maybe problems with X". Say you don't see in T*3 seconds (perhaps this is the timeout you use, in which case the znode is removed), consider X down, cleanup and enact recovery. These are madeup actions/times, but you can see what I'm getting at. With lease it's not "all or nothing". You (solr) have the option to take actions based on the lease time, rather than just the znode being deleted in the typical case (all or nothing). The tradeoff here is that it's a bit more complicted for you - you need to implement the lease rather than just relying on the znode being deleted - you would of course set a watch on the znode to get notified when the znode is removed (etc...)

          Show
          Patrick Hunt added a comment - Any pointers on ways to deal with this? From our experience with hbase (which is the only place we've seen this issue so far, at least to this extent) you need to think about: 1) client timeout value tradeoffs 2) effects of session expiration due to gc pause, potential ways to mitigate for 1) there is a tradeoff (the good thing is that not all clients need to use the same timeout, so you can tune based on the client type, you can even have multiple sessions for a single client, each with it's own timeout) You can set the timeout higher, so if your zk client pauses you don't get expired, however this also means that if your client crashes the session won't be expired until the timeout expires. This means that the rest of your system will not be notified of the change (say you are doing leader election) for longer than you might like. for 2) you need to think about the potential failure cases and their effects. a) Say your ZK client (solr component X) fails (the host crashes), do you need to know about this in 5 seconds, or 30sec? b) Say the host is network partitioned due to a burp in the network that lasts 5 seconds, is this ok, or does the rest of the solr system need to know about this? c) Say component X gc pauses for 4 minutes, do you want the rest of the system to react immed, or consider this "ok" and just wait around for a while for X to come back.... but keep in mind that from the perspective of "the rest of your system" you don't know the difference between a) or b or c (etc...), from their viewpoint X is gone and they don't know why (unless it eventually comes back) In hbase case session expiration is expensive as the region server master will reallocate the table (or some such). In your case the effects of X going down may not be very expensive. If this is the case then having a low(er) session timeout for X may not be a problem. (just deal with the session timeout when it does happen, X will eventually come back) If X recovery is expensive you may want to set the timeout very high. but as I said this makes the system less responsive if X has a real problem. Another option we explored with hbase is to use a "lease" recipe instead. Set a very high timeout, but have X update the znode (still ephemeral) every N seconds. If the rest of the system (whoever is interested in X status) doesn't see an update from X in T seconds, then perhaps you log a warning ("where is X?"). Say you don't see an update from X in T*2 seconds, then page the operator "warning, maybe problems with X". Say you don't see in T*3 seconds (perhaps this is the timeout you use, in which case the znode is removed), consider X down, cleanup and enact recovery. These are madeup actions/times, but you can see what I'm getting at. With lease it's not "all or nothing". You (solr) have the option to take actions based on the lease time, rather than just the znode being deleted in the typical case (all or nothing). The tradeoff here is that it's a bit more complicted for you - you need to implement the lease rather than just relying on the znode being deleted - you would of course set a watch on the znode to get notified when the znode is removed (etc...)
          Hide
          Mark Miller added a comment -

          and the shards refactoring

          Is there a benefit to refactoring the shards piece to a component rather than a simple helper class or something?

          Show
          Mark Miller added a comment - and the shards refactoring Is there a benefit to refactoring the shards piece to a component rather than a simple helper class or something?
          Mark Miller made changes -
          Attachment SOLR-1277.patch [ 12426670 ]
          Hide
          Mark Miller added a comment - - edited

          I've add a ZKSolrResourceLoader to allow loading other config from ZooKeeper as well (stopwords, protwords, etc).

          It doesn't yet handle the issue where some code uses getConfigDir to load a file themselves, but handles all of the cases needed to get the current ZooKeeper tests to pass. So for the current ZooKeeper tests, I think all or almost all config is now being loaded from ZooKeeper.

          Show
          Mark Miller added a comment - - edited I've add a ZKSolrResourceLoader to allow loading other config from ZooKeeper as well (stopwords, protwords, etc). It doesn't yet handle the issue where some code uses getConfigDir to load a file themselves, but handles all of the cases needed to get the current ZooKeeper tests to pass. So for the current ZooKeeper tests, I think all or almost all config is now being loaded from ZooKeeper.
          Hide
          Yonik Seeley added a comment -

          in particular the issue they have faced is that the JVM GC can pause the ZK client vm (hbase region server) for >> 1 minute (in some cases we saw 4 minutes). If this is an issue for you (does solr ever see pauses like this?) we may need to discuss.

          I've seen multi-minute pauses myself in the past (on an indexing master during an optimize). Lucene and the JVMs have improved since then, but we probably can't rule it out. Any pointers on ways to deal with this?

          Show
          Yonik Seeley added a comment - in particular the issue they have faced is that the JVM GC can pause the ZK client vm (hbase region server) for >> 1 minute (in some cases we saw 4 minutes). If this is an issue for you (does solr ever see pauses like this?) we may need to discuss. I've seen multi-minute pauses myself in the past (on an indexing master during an optimize). Lucene and the JVMs have improved since then, but we probably can't rule it out. Any pointers on ways to deal with this?
          Hide
          Mark Miller added a comment -

          This has been an issue for the hbase team - in particular the issue they have faced is that the JVM GC can pause the ZK client vm (hbase region server) for >> 1 minute (in some cases we saw 4 minutes). If this is an issue for you (does solr ever see pauses like this?) we may need to discuss.

          This can be an issue on large indexes - users generally don't like to see GC pause times of more than a few seconds (if that), and you can usually tweak things down (using CMS, tuning config), but I'm sure there are users out there that are still running into longer GC pauses (especially those with huge heaps that have not tuned there solr config or GC settings).

          Show
          Mark Miller added a comment - This has been an issue for the hbase team - in particular the issue they have faced is that the JVM GC can pause the ZK client vm (hbase region server) for >> 1 minute (in some cases we saw 4 minutes). If this is an issue for you (does solr ever see pauses like this?) we may need to discuss. This can be an issue on large indexes - users generally don't like to see GC pause times of more than a few seconds (if that), and you can usually tweak things down (using CMS, tuning config), but I'm sure there are users out there that are still running into longer GC pauses (especially those with huge heaps that have not tuned there solr config or GC settings).
          Hide
          Grant Ingersoll added a comment -

          Mark,

          I think this makes sense. I think you can grab my ZK admin ReqHandler and the shards refactoring, too and pull that into this patch, as most of them are independent of the actual startup/config part. If you don't get to it, I will try to next week.

          Show
          Grant Ingersoll added a comment - Mark, I think this makes sense. I think you can grab my ZK admin ReqHandler and the shards refactoring, too and pull that into this patch, as most of them are independent of the actual startup/config part. If you don't get to it, I will try to next week.
          Hide
          Patrick Hunt added a comment -

          Hi, Patrick from ZooKeeper here. Great start on this (jira and wiki)! Wanted to point out a few things (a bit of a brain dump but...). Please keep in mind that I know very little of the inner working of solr itself, so pardon any dumb questions :

          As I mentioned previously please take advantage of the work we've been doing/documenting with the hbase team:
          http://wiki.apache.org/hadoop/ZooKeeper/HBaseAndZooKeeper
          In particular they have been learning a hard lesson wrt how ZooKeeper sessions work. When you establish a session you specify the "timeout" parameter (I see you have default listed as 10 sec which is great). Be aware that this controls two things: 1) the client timeout, 2) the session timeout. The client is heartbeating the server to keep the session alive. Every 1/3 the timeout the client sends a heartbeat. If the client does not hear back in an additional 1/3 of the timeout it will consider the server unavailable and attempt to connect to another server in the cluser (based on the server list you provided during session creation). If the server does not hear from the client within the timeout period it will consider the client unavailable and cleanup the session. This includes deleting any ephemeral nodes owned by the expired session. This has been an issue for the hbase team - in particular the issue they have faced is that the JVM GC can pause the ZK client vm (hbase region server) for >> 1 minute (in some cases we saw 4 minutes). If this is an issue for you (does solr ever see pauses like this?) we may need to discuss.

          I think you are correct in having 2 distinct configurations; 1) ZK cluster (ensemble or standalone) configuration, 2) your client configuration

          I see this in your example:
          <str name="zkhostPorts">localhost:2181</str>
          which is great for "sole quickstart with zk" - basically this would be a "standalone" zk installation, vs something like
          <str name="zkhostPorts">host1:2181,host2:2181,host3,2181</str>
          which a user might run for a production system (supporting a single point of failure, ie 1 ZK server can go down and the cluster will still be available)

          You should think now about how users will interact with the system once ZK is introduced. In particular troubleshooting. This is an issue that has been vexing hbase as well - how to educate and support users. How to provide enough information, but not too much (ie "go learn zk") to troubleshoot basic problems such as mis-configuration.

          Will "ZooKeeperAwareShardHandler" set watches? Some of the text on the wiki page implies watches to monitor state in zk, it would be good to call this out explicitly.

          I saw mention of "ZKClient", does this just mean the "official" ZooKeeper client/class we ship with the release, your own wrapper, or something else?

          I also saw this comment "(this is dependent on ZooKeeper supporting getFirstChild() which it currently does not)".
          We have no plans to add this in the near future afaik (there is a jira for something similar but I'm not aware of anyone working on it recently) – however typically this can be done through the use of the sequential flag.

          1) Create your znodes with the sequential flag
          2) to "getFirstChild()" just call "getChildren()" and sort on the sequence number

          will this work for you? (this is the simplest I could think of, there are other options if this doesn't work that we could discuss)

          What does this refer to? "The trick here is how to keep all the masters in a group in sync" Something that ZK itself could help to mediate?

          Show
          Patrick Hunt added a comment - Hi, Patrick from ZooKeeper here. Great start on this (jira and wiki)! Wanted to point out a few things (a bit of a brain dump but...). Please keep in mind that I know very little of the inner working of solr itself, so pardon any dumb questions : As I mentioned previously please take advantage of the work we've been doing/documenting with the hbase team: http://wiki.apache.org/hadoop/ZooKeeper/HBaseAndZooKeeper In particular they have been learning a hard lesson wrt how ZooKeeper sessions work. When you establish a session you specify the "timeout" parameter (I see you have default listed as 10 sec which is great). Be aware that this controls two things: 1) the client timeout, 2) the session timeout. The client is heartbeating the server to keep the session alive. Every 1/3 the timeout the client sends a heartbeat. If the client does not hear back in an additional 1/3 of the timeout it will consider the server unavailable and attempt to connect to another server in the cluser (based on the server list you provided during session creation). If the server does not hear from the client within the timeout period it will consider the client unavailable and cleanup the session. This includes deleting any ephemeral nodes owned by the expired session. This has been an issue for the hbase team - in particular the issue they have faced is that the JVM GC can pause the ZK client vm (hbase region server) for >> 1 minute (in some cases we saw 4 minutes). If this is an issue for you (does solr ever see pauses like this?) we may need to discuss. I think you are correct in having 2 distinct configurations; 1) ZK cluster (ensemble or standalone) configuration, 2) your client configuration I see this in your example: <str name="zkhostPorts">localhost:2181</str> which is great for "sole quickstart with zk" - basically this would be a "standalone" zk installation, vs something like <str name="zkhostPorts">host1:2181,host2:2181,host3,2181</str> which a user might run for a production system (supporting a single point of failure, ie 1 ZK server can go down and the cluster will still be available) You should think now about how users will interact with the system once ZK is introduced. In particular troubleshooting. This is an issue that has been vexing hbase as well - how to educate and support users. How to provide enough information, but not too much (ie "go learn zk") to troubleshoot basic problems such as mis-configuration. Will "ZooKeeperAwareShardHandler" set watches? Some of the text on the wiki page implies watches to monitor state in zk, it would be good to call this out explicitly. I saw mention of "ZKClient", does this just mean the "official" ZooKeeper client/class we ship with the release, your own wrapper, or something else? I also saw this comment "(this is dependent on ZooKeeper supporting getFirstChild() which it currently does not)". We have no plans to add this in the near future afaik (there is a jira for something similar but I'm not aware of anyone working on it recently) – however typically this can be done through the use of the sequential flag. 1) Create your znodes with the sequential flag 2) to "getFirstChild()" just call "getChildren()" and sort on the sequence number will this work for you? (this is the simplest I could think of, there are other options if this doesn't work that we could discuss) What does this refer to? "The trick here is how to keep all the masters in a group in sync" Something that ZK itself could help to mediate?
          Hide
          Noble Paul added a comment -

          Should we let the users choose the way they wish to create the zookeeper clients

          • put it in solr.xml if it is shared across all cores
          • put it in solrconfig.xml if it is to be created on a per core basis (or if it is a single-core deployment).

          The components can always access the ZookeeperComponent using SolrCore#getZookeeperComponent().

          All the responsibilities of shard/master-slave etc are delegated to appropriate components (plz see the wiki).

          Show
          Noble Paul added a comment - Should we let the users choose the way they wish to create the zookeeper clients put it in solr.xml if it is shared across all cores put it in solrconfig.xml if it is to be created on a per core basis (or if it is a single-core deployment). The components can always access the ZookeeperComponent using SolrCore#getZookeeperComponent(). All the responsibilities of shard/master-slave etc are delegated to appropriate components (plz see the wiki).
          Hide
          Noble Paul added a comment -


          Mark, we are building the wiki http://wiki.apache.org/solr/ZooKeeperIntegration so that every one is on thesame page regarding the design. Because this is a huge feature I wish to reach a consensus on how each aspect of this is done . if you wish to change something on the design go and edit the page also.

          Show
          Noble Paul added a comment - Mark, we are building the wiki http://wiki.apache.org/solr/ZooKeeperIntegration so that every one is on thesame page regarding the design. Because this is a huge feature I wish to reach a consensus on how each aspect of this is done . if you wish to change something on the design go and edit the page also.
          Hide
          Mark Miller added a comment -

          zookeeper component can be kept by CoreContainer.

          It is in this patch.

          Zookeeper has a standard conf file. Why don't we use the same thing instead of inventing new system properties.

          Huh? These properties are not what you would set with the ZooKeeper conf property. If we get enough properties, I can see moving them out to a conf file and specifying that, but if we stick to a few, my preference would be to avoid the conf file until we determine it makes sense to use one. But these properties are not duplicating anything in the zookeeper conf file, so I'm not sure I get your point.

          The ZooKeeper conf is for configuring the ZooKeeper quorum - that is and should be separate from whats going on this patch, which deals with the ZooKeeper client. You start the quorum separately (using a ZooKeeper conf), and then Solr will connect to it. When starting Solr, you will want to be able to give it the address of the quorum you want to connect to.

          Show
          Mark Miller added a comment - zookeeper component can be kept by CoreContainer. It is in this patch. Zookeeper has a standard conf file. Why don't we use the same thing instead of inventing new system properties. Huh? These properties are not what you would set with the ZooKeeper conf property. If we get enough properties, I can see moving them out to a conf file and specifying that, but if we stick to a few, my preference would be to avoid the conf file until we determine it makes sense to use one. But these properties are not duplicating anything in the zookeeper conf file, so I'm not sure I get your point. The ZooKeeper conf is for configuring the ZooKeeper quorum - that is and should be separate from whats going on this patch, which deals with the ZooKeeper client. You start the quorum separately (using a ZooKeeper conf), and then Solr will connect to it. When starting Solr, you will want to be able to give it the address of the quorum you want to connect to.
          Hide
          Noble Paul added a comment -

          zookeeper component can be kept by CoreContainer. Zookeeper has a standard conf file. Why don't we use the same thing instead of inventing new system properties.

          Show
          Noble Paul added a comment - zookeeper component can be kept by CoreContainer. Zookeeper has a standard conf file. Why don't we use the same thing instead of inventing new system properties.
          Mark Miller made changes -
          Attachment zookeeper-3.2.0.jar [ 12413566 ]
          Mark Miller made changes -
          Attachment zookeeper-3.2.1.jar [ 12426558 ]
          Mark Miller made changes -
          Attachment SOLR-1277.patch [ 12426557 ]
          Hide
          Mark Miller added a comment - - edited

          I'd like to reboot this issue a bit. I've started a fresh experimental pass.

          I think we want to move ZooKeeper's home out of SolrCore and into CoreContainer - one of the big benefits of using ZooKeeper is the ability to config from a central location. This implies that a ZooKeeper client should be up and running before any SolrCore is.

          So this new patch explores that path - its just an initial stab in that direction, so a lot is left undone, but I figured I'd put it up to get feedback on the new approach.

          This initial pass allows basic loading of config/schema from ZooKeeper and adds some ground work for ZooKeeper unit tests.

          Some simple notes below:

          ZooKeeper Integration
          
          Features:
          
          load solrconfig/schema from zookeeper (currently no other config files - want to be able to load any config from ZK)
          
          TODO: handle distributed search node choices, fault tolerance
          TODO: integrate with replication
          TODO: fault tolerant indexing
          TODO: more ;)
          
          Startup?
          
          System Properties? -DzkHost=localhost:2127 {-DzkPath=/solr} - if no zkHost found, run non ZooKeeper mode
          
          Optional Properties? -Dsolrconfig=solrconfig.xml -Dschema=schema.xml - multicore config: comes from solr.xml
          
          Groups? -Dgroup= ? -Dshard= ? How about multicore? Set in solr.xml?
          
          Distinguish master from slave?
           Another sys property?
          
          Cache zk info locally and 'watch' for changes
          
          
          Layout - configurable prefix - defaults to solr
          
          /solr/conf
                   /solrconfig.xml
                   /schema.xml
          
          multicore?
          
          /solr/core0/conf
          /solr/core1/conf
          
          
          ZooKeeper Limitations:
          
          1 MB of data per node
          wants zk log on its own device (not just partition)
           (perhaps not so necessary with fewer nodes?)
          must not swap to maintain performance
          
          Show
          Mark Miller added a comment - - edited I'd like to reboot this issue a bit. I've started a fresh experimental pass. I think we want to move ZooKeeper's home out of SolrCore and into CoreContainer - one of the big benefits of using ZooKeeper is the ability to config from a central location. This implies that a ZooKeeper client should be up and running before any SolrCore is. So this new patch explores that path - its just an initial stab in that direction, so a lot is left undone, but I figured I'd put it up to get feedback on the new approach. This initial pass allows basic loading of config/schema from ZooKeeper and adds some ground work for ZooKeeper unit tests. Some simple notes below: ZooKeeper Integration Features: load solrconfig/schema from zookeeper (currently no other config files - want to be able to load any config from ZK) TODO: handle distributed search node choices, fault tolerance TODO: integrate with replication TODO: fault tolerant indexing TODO: more ;) Startup? System Properties? -DzkHost=localhost:2127 {-DzkPath=/solr} - if no zkHost found, run non ZooKeeper mode Optional Properties? -Dsolrconfig=solrconfig.xml -Dschema=schema.xml - multicore config: comes from solr.xml Groups? -Dgroup= ? -Dshard= ? How about multicore? Set in solr.xml? Distinguish master from slave? Another sys property? Cache zk info locally and 'watch' for changes Layout - configurable prefix - defaults to solr /solr/conf /solrconfig.xml /schema.xml multicore? /solr/core0/conf /solr/core1/conf ZooKeeper Limitations: 1 MB of data per node wants zk log on its own device (not just partition) (perhaps not so necessary with fewer nodes?) must not swap to maintain performance
          Noble Paul made changes -
          Link This issue is blocked by SOLR-1431 [ SOLR-1431 ]
          Grant Ingersoll made changes -
          Link This issue depends on SOLR-1585 [ SOLR-1585 ]
          Hide
          Noble Paul added a comment -

          I shall do that. I shall edit the page
          http://wiki.apache.org/solr/ZooKeeperIntegration and let you know once
          it is done.


          -----------------------------------------------------
          Noble Paul | Principal Engineer| AOL | http://aol.com

          Show
          Noble Paul added a comment - I shall do that. I shall edit the page http://wiki.apache.org/solr/ZooKeeperIntegration and let you know once it is done. – ----------------------------------------------------- Noble Paul | Principal Engineer| AOL | http://aol.com
          Hide
          Patrick Hunt added a comment -

          Unfortunately I'm not very familiar with Solr, I might be able to better comment if we put together something like what we are doing with HBase:

          http://wiki.apache.org/hadoop/ZooKeeper/HBaseAndZooKeeper
          http://wiki.apache.org/hadoop/ZooKeeper/HBaseUseCases

          Feel free to create a

          http://wiki.apache.org/hadoop/ZooKeeper/SolrAndZooKeeper
          or more to the point:
          http://wiki.apache.org/hadoop/ZooKeeper/SolrUseCases

          where we could focus the discussion at a high level in terms of you intended use, current and future.

          Show
          Patrick Hunt added a comment - Unfortunately I'm not very familiar with Solr, I might be able to better comment if we put together something like what we are doing with HBase: http://wiki.apache.org/hadoop/ZooKeeper/HBaseAndZooKeeper http://wiki.apache.org/hadoop/ZooKeeper/HBaseUseCases Feel free to create a http://wiki.apache.org/hadoop/ZooKeeper/SolrAndZooKeeper or more to the point: http://wiki.apache.org/hadoop/ZooKeeper/SolrUseCases where we could focus the discussion at a high level in terms of you intended use, current and future.
          Hide
          Noble Paul added a comment -

          hi Patrick,Thanks

          we can let the user specify any path string there as they wish . This should let them use a dedicated zookeeper cluster or they can share the zookeeper.

          You may review the approach I have outlined and let us know if there are any flaws in it. If it is not clear I'll be glad to more details

          Show
          Noble Paul added a comment - hi Patrick,Thanks we can let the user specify any path string there as they wish . This should let them use a dedicated zookeeper cluster or they can share the zookeeper. You may review the approach I have outlined and let us know if there are any flaws in it. If it is not clear I'll be glad to more details
          Hide
          Patrick Hunt added a comment -

          Hi phunt here from ZK team, let me say that we are psyched to see you considering use of ZK in solr! Feel
          free to ping us http://hadoop.apache.org/zookeeper/mailing_lists.html if you have any questions.

          One suggestion, you may want a solr specific root:

          /solr/$

          {solr.domain}/....

          (prefix with /solr) so that if end users want to use ZK for things in addition to solr they can setup
          a single zk cluster and still have information segregated nicely.

          You can always use zk client's "chroot" feature when connecting, this would map /solr as / from the perspective
          of the clients. You might consider chrooting /solr/${solr.domain}

          if the client for $

          {solr.domain}

          only looks at data from it's domain.

          Show
          Patrick Hunt added a comment - Hi phunt here from ZK team, let me say that we are psyched to see you considering use of ZK in solr! Feel free to ping us http://hadoop.apache.org/zookeeper/mailing_lists.html if you have any questions. One suggestion, you may want a solr specific root: /solr/$ {solr.domain}/.... (prefix with /solr) so that if end users want to use ZK for things in addition to solr they can setup a single zk cluster and still have information segregated nicely. You can always use zk client's "chroot" feature when connecting, this would map /solr as / from the perspective of the clients. You might consider chrooting /solr/${solr.domain} if the client for $ {solr.domain} only looks at data from it's domain.
          Hide
          Noble Paul added a comment - - edited

          I propose a hierarchical node structure in ZooKeeper.

          eg:

          /<zookeeper_rootdir>
                 ${solr.domain}
                     /shard1
                            /master -> url=myurl:8080,rep_url=/replication,a=b,c=d
                             /slave
                                    /slave00 -> url=myurl:8080,a=b,c=d
                                    /slave00 -> url=myurl:8080,a=b,c=d                                                     
          

          Master is only one. so we can have a special node.
          Slaves are many. So, we can use a sequential node name . Zookeeper will take care of assigning a name . We can add data such as url etc as the node data.

          We must be able to use the same zookeeper component for distributed search for discovering other shards .SOLR-1431 can have a zookeeper based implementation too

          Show
          Noble Paul added a comment - - edited I propose a hierarchical node structure in ZooKeeper. eg: /<zookeeper_rootdir> ${solr.domain} /shard1 /master -> url=myurl:8080,rep_url=/replication,a=b,c=d /slave /slave00 -> url=myurl:8080,a=b,c=d /slave00 -> url=myurl:8080,a=b,c=d Master is only one. so we can have a special node. Slaves are many. So, we can use a sequential node name . Zookeeper will take care of assigning a name . We can add data such as url etc as the node data. We must be able to use the same zookeeper component for distributed search for discovering other shards . SOLR-1431 can have a zookeeper based implementation too
          Hide
          Noble Paul added a comment -

          No... a zookeeper installation will be more static, but creating a new core should be very dynamic. One should even be able to run multiple solr clusters using the same zookeeper cluster.

          I guess I confused everyone. By zookeeper instance I meant an instance of ZooKeeperRequestHandler.

          Show
          Noble Paul added a comment - No... a zookeeper installation will be more static, but creating a new core should be very dynamic. One should even be able to run multiple solr clusters using the same zookeeper cluster. I guess I confused everyone. By zookeeper instance I meant an instance of ZooKeeperRequestHandler.
          Hide
          Grant Ingersoll added a comment -

          The ZK client is actually very lightweight. I believe it needs to be per core, as you could have multiple layers and levels and each core could belong to different levels. Also, at least the way it currently works it relies on the full URL to know where it is.

          Most of the configuration is not for ZK itself, but instead for a core to properly register with ZK.

          Show
          Grant Ingersoll added a comment - The ZK client is actually very lightweight. I believe it needs to be per core, as you could have multiple layers and levels and each core could belong to different levels. Also, at least the way it currently works it relies on the full URL to know where it is. Most of the configuration is not for ZK itself, but instead for a core to properly register with ZK.
          Hide
          Yonik Seeley added a comment -

          Seems like we should have the ability to get pretty much all the configuration from zookeeper?

          Show
          Yonik Seeley added a comment - Seems like we should have the ability to get pretty much all the configuration from zookeeper?
          Hide
          Noble Paul added a comment -

          That is the point. if we specify everything in the solrconfig.xml that automatically will make it belong to a core. how do we take care of that problem>

          Show
          Noble Paul added a comment - That is the point. if we specify everything in the solrconfig.xml that automatically will make it belong to a core. how do we take care of that problem>
          Hide
          Yonik Seeley added a comment -

          Another important design aspect is that , do we want to have separate zookeeper instances for each core in a multicore environment?

          No... a zookeeper installation will be more static, but creating a new core should be very dynamic. One should even be able to run multiple solr clusters using the same zookeeper cluster.

          Show
          Yonik Seeley added a comment - Another important design aspect is that , do we want to have separate zookeeper instances for each core in a multicore environment? No... a zookeeper installation will be more static, but creating a new core should be very dynamic. One should even be able to run multiple solr clusters using the same zookeeper cluster.
          Hide
          Noble Paul added a comment -

          Grant a couple of points.
          Let the zookeeper configuration be put into a zookeeper plugin itself. Let us use the standard configuration syntax. Custom syntax for each of the plugins is leading to too much of xml parsing.

          <requestHandler name="/zoo" class="solr.ZooKeeperRequestHandler">
              <!-- See the ZooKeeper docs -->
              <str name="hostPorts">localhost:2181</str>
              <!-- TODO: figure out how to do this programmatically -->
              <str name="me">localhost:8983/solr</str>
              <!-- Timeout for the ZooKeeper.  Optional.  Default 10000 -->
              <!-- Timeout in ms -->
              <str name="timeout">5000</str>
              <str name="shardsNodeName">/solr_shards</str>
              <str name="mastersNodeName">/solr_masters</str>  
              <bool name="shard">true</bool>
              <str name="master">master_group_1</str>
            </requestHandler>
          

          users can drive the values of each of these from an external properties file anyway (using solrcore.properties SOLR-1335)

          Another important design aspect is that , do we want to have separate zookeeper instances for each core in a multicore environment? or is it possible to have a zookeeper component at the CoreContainer level.

          Show
          Noble Paul added a comment - Grant a couple of points. Let the zookeeper configuration be put into a zookeeper plugin itself. Let us use the standard configuration syntax. Custom syntax for each of the plugins is leading to too much of xml parsing. <requestHandler name= "/zoo" class= "solr.ZooKeeperRequestHandler" > <!-- See the ZooKeeper docs --> <str name= "hostPorts" > localhost:2181 </str> <!-- TODO: figure out how to do this programmatically --> <str name= "me" > localhost:8983/solr </str> <!-- Timeout for the ZooKeeper. Optional. Default 10000 --> <!-- Timeout in ms --> <str name= "timeout" > 5000 </str> <str name= "shardsNodeName" > /solr_shards </str> <str name= "mastersNodeName" > /solr_masters </str> <bool name= "shard" > true </bool> <str name= "master" > master_group_1 </str> </requestHandler> users can drive the values of each of these from an external properties file anyway (using solrcore.properties SOLR-1335 ) Another important design aspect is that , do we want to have separate zookeeper instances for each core in a multicore environment? or is it possible to have a zookeeper component at the CoreContainer level.
          Hide
          Noble Paul added a comment -

          Do you mean to fix this disadvantage of solr by ZooKeeper in this project?

          We would definitely want to. Every issue is work in progress . As users, you can participate and influence the design and implementation at all phases.

          Show
          Noble Paul added a comment - Do you mean to fix this disadvantage of solr by ZooKeeper in this project? We would definitely want to. Every issue is work in progress . As users, you can participate and influence the design and implementation at all phases.
          Hide
          Keven Xun added a comment -

          nice work. I read your article 'What's new with Apache Solr'. As you said, "the master node isn't fault-tolerant, so if it goes down, the system can't index new documents or perform replication". Do you mean to fix this disadvantage of solr by ZooKeeper in this project? That sounds a great idea!

          Show
          Keven Xun added a comment - nice work. I read your article 'What's new with Apache Solr'. As you said, "the master node isn't fault-tolerant, so if it goes down, the system can't index new documents or perform replication". Do you mean to fix this disadvantage of solr by ZooKeeper in this project? That sounds a great idea!
          Hide
          Noble Paul added a comment -

          It's configured in Solr Config, so whichever core does the configuration.

          in the sample the url does not specify the core that is why I asked. Like the replication url it can have a complete path [192.168.0.1:8080/solr/corename/zoo].

          Show
          Noble Paul added a comment - It's configured in Solr Config, so whichever core does the configuration. in the sample the url does not specify the core that is why I asked. Like the replication url it can have a complete path [192.168.0.1:8080/solr/corename/zoo] .
          Hide
          Grant Ingersoll added a comment -

          The same information is repeated in the name

          Ideally, the name would contain all the info and you wouldn't have to use the data bit here at all, but the Solr URL also looks like a ZK path, so I escaped the string for the name and also stored it on the data. Of course, the name could be anything, but I thought it should at least be something that is going to be human readable.

          If there are multiple cores which zkrequesthandler will be used?

          It's configured in Solr Config, so whichever core does the configuration. I suppose this requires you to pick a core that has the Req Handler, but the ZooKeeper client is lightweight, so it isn't a big deal if there are multiple Req Handlers having one. So, just configure it on every core if you want. I believe the current impl is that the ZooKeeper is attached to the SolrCore, so I think it should just work, but I haven't tested it.

          Show
          Grant Ingersoll added a comment - The same information is repeated in the name Ideally, the name would contain all the info and you wouldn't have to use the data bit here at all, but the Solr URL also looks like a ZK path, so I escaped the string for the name and also stored it on the data. Of course, the name could be anything, but I thought it should at least be something that is going to be human readable. If there are multiple cores which zkrequesthandler will be used? It's configured in Solr Config, so whichever core does the configuration. I suppose this requires you to pick a core that has the Req Handler, but the ZooKeeper client is lightweight, so it isn't a big deal if there are multiple Req Handlers having one. So, just configure it on every core if you want. I believe the current impl is that the ZooKeeper is attached to the SolrCore, so I think it should just work, but I haven't tested it.
          Hide
          Noble Paul added a comment -

          Grant, the shards entry is a little confusing

          The same information is repeated in the name

          192.168.0.1_8080_solr [192.168.0.1:8080/solr]  
          

          basically the ip and the port are repeated.

          If there are multiple cores which zkrequesthandler will be used?

          Show
          Noble Paul added a comment - Grant, the shards entry is a little confusing The same information is repeated in the name 192.168.0.1_8080_solr [192.168.0.1:8080/solr] basically the ip and the port are repeated. If there are multiple cores which zkrequesthandler will be used?
          Hide
          Grant Ingersoll added a comment -

          Solr has up until 1.3 focused on query load and usability. Then it added distributed support, which right now handles 95%+ of most large index cases without a problem. With the addition of ZooKeeper and a few other pieces it will handle the other pieces as well it should handle the last 5% of large cases, thus combining usability with scale. It can already do index size, but it just needs a bit better management and automation. And yes, master failover, rack awareness, etc. are needed.

          Show
          Grant Ingersoll added a comment - Solr has up until 1.3 focused on query load and usability. Then it added distributed support, which right now handles 95%+ of most large index cases without a problem. With the addition of ZooKeeper and a few other pieces it will handle the other pieces as well it should handle the last 5% of large cases, thus combining usability with scale. It can already do index size, but it just needs a bit better management and automation. And yes, master failover, rack awareness, etc. are needed.
          Hide
          Stefan Groschupf added a comment -

          Hi Grant,
          Just for the archive... I think slor and katta have two different use cases they originally coming from.
          I understand Solr is an enterprise search server, kind of document server with high level http based API (no lucene api interaction) adding now support for distributed environments. Katta was ground up designed as distributed sharding services with focus on scalability for index size and query load. We try to expose as much lucene API as possible.
          Katta for example has rack and data center awareness, master failover etc does Solr has this too?

          Show
          Stefan Groschupf added a comment - Hi Grant, Just for the archive... I think slor and katta have two different use cases they originally coming from. I understand Solr is an enterprise search server, kind of document server with high level http based API (no lucene api interaction) adding now support for distributed environments. Katta was ground up designed as distributed sharding services with focus on scalability for index size and query load. We try to expose as much lucene API as possible. Katta for example has rack and data center awareness, master failover etc does Solr has this too?
          Hide
          Grant Ingersoll added a comment -

          There really is no need for Katta. Solr has all of those mechanisms in place, it just needs ZooKeeper for coordination.

          Show
          Grant Ingersoll added a comment - There really is no need for Katta. Solr has all of those mechanisms in place, it just needs ZooKeeper for coordination.
          Hide
          Jason Rutherglen added a comment -

          Thanks for posting this Grant to get it started. I got tied up in other things including how/if we integrate with Katta which also uses Zookeeper.

          Show
          Jason Rutherglen added a comment - Thanks for posting this Grant to get it started. I got tied up in other things including how/if we integrate with Katta which also uses Zookeeper.
          Hide
          Linbin Chen added a comment -

          good idea and good project

          Show
          Linbin Chen added a comment - good idea and good project
          Jason Rutherglen made changes -
          Description The goal is to give SOLR server clusters self-healing attributes
          where if a server fails, indexing and searching don't stop and
          all of the partitions remain searchable. For configuration, the
          ability to centrally deploy a new configuration without servers
          going offline.

          We can start with basic failover and start from there?

          Features:

          * Automatic failover (i.e. when a server fails, clients stop
          trying to index to or search it)

          * Centralized configuration management (i.e. new solrconfig.xml
          or schema.xml propagates to a live SOLR cluster)

          * Optionally allow shards of a partition to be moved to another
          server (i.e. if a server gets hot, move the hot segments out to
          cooler servers). Ideally we'd have a way to detect hot segments
          and move them seamlessly. With NRT this becomes somewhat more
          difficult but not impossible?
          The goal is to give Solr server clusters self-healing attributes
          where if a server fails, indexing and searching don't stop and
          all of the partitions remain searchable. For configuration, the
          ability to centrally deploy a new configuration without servers
          going offline.

          We can start with basic failover and start from there?

          Features:

          * Automatic failover (i.e. when a server fails, clients stop
          trying to index to or search it)

          * Centralized configuration management (i.e. new solrconfig.xml
          or schema.xml propagates to a live Solr cluster)

          * Optionally allow shards of a partition to be moved to another
          server (i.e. if a server gets hot, move the hot segments out to
          cooler servers). Ideally we'd have a way to detect hot segments
          and move them seamlessly. With NRT this becomes somewhat more
          difficult but not impossible?
          Grant Ingersoll made changes -
          Attachment SOLR-1277.patch [ 12413565 ]
          Attachment zookeeper-3.2.0.jar [ 12413566 ]
          Attachment log4j-1.2.15.jar [ 12413567 ]
          Hide
          Grant Ingersoll added a comment -

          First draft. Addresses the newSearcher event in a minimal way.

          Add the two jar files to the lib directory.

          Then, setup two shards based on the solrconfig.xml in example, but changing the properties appropriately.

          I will layout more on the wiki

          Show
          Grant Ingersoll added a comment - First draft. Addresses the newSearcher event in a minimal way. Add the two jar files to the lib directory. Then, setup two shards based on the solrconfig.xml in example, but changing the properties appropriately. I will layout more on the wiki
          Jason Rutherglen made changes -
          Summary Implement a SOLR specific naming service (using Zookeeper) Implement a Solr specific naming service (using Zookeeper)
          Hide
          Grant Ingersoll added a comment -

          I'm bringing my patch up to date, will post tonight or tomorrow.

          Show
          Grant Ingersoll added a comment - I'm bringing my patch up to date, will post tonight or tomorrow.
          Hide
          Grant Ingersoll added a comment -
          Show
          Grant Ingersoll added a comment - Wiki entry started at http://wiki.apache.org/solr/ZooKeeperIntegration
          Grant Ingersoll made changes -
          Link This issue is blocked by SOLR-1237 [ SOLR-1237 ]
          Hide
          Grant Ingersoll added a comment -

          I have a patch that starts on some of this that I did on the plane the other day, but haven't had time to write it up. It has auto shard registration, failover for distributed search and a plan for rebalancing for replication, etc.

          Show
          Grant Ingersoll added a comment - I have a patch that starts on some of this that I did on the plane the other day, but haven't had time to write it up. It has auto shard registration, failover for distributed search and a plan for rebalancing for replication, etc.
          Grant Ingersoll made changes -
          Field Original Value New Value
          Assignee Grant Ingersoll [ gsingers ]
          Jason Rutherglen created issue -

            People

            • Assignee:
              Grant Ingersoll
              Reporter:
              Jason Rutherglen
            • Votes:
              1 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 672h
                672h
                Remaining:
                Remaining Estimate - 672h
                672h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development