Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.7, 5.0
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      In a very large cluster the Overseer is likely to be overloaded.If the same node is a serving a few other shards it can lead to OverSeer getting slowed down due to GC pauses , or simply too much of work . If the cluster is really large , it is possible to dedicate high end h/w for OverSeers

      It works as a new collection admin command

      command=addrole&role=overseer&node=192.168.1.5:8983_solr

      This results in the creation of a entry in the /roles.json in ZK which would look like the following

      {
      "overseer" : ["192.168.1.5:8983_solr"]
      }
      

      If a node is designated for overseer it gets preference over others when overseer election takes place. If no designated servers are available another random node would become the Overseer.

      Later on, if one of the designated nodes are brought up ,it would take over the Overseer role from the current Overseer to become the Overseer of the system

      1. SOLR-5476.patch
        17 kB
        Noble Paul
      2. SOLR-5476.patch
        23 kB
        Noble Paul
      3. SOLR-5476.patch
        24 kB
        Noble Paul
      4. SOLR-5476.patch
        25 kB
        Noble Paul
      5. SOLR-5476.patch
        25 kB
        Noble Paul
      6. SOLR-5476.patch
        5 kB
        Noble Paul
      7. SOLR-5476.patch
        5 kB
        Noble Paul

        Issue Links

          Activity

          Hide
          Noble Paul added a comment -

          The strategy to implement this is as follows

          Prioritization logic run at Overseer node

          • Any Overseer elect would first check for the presence of /roles.json in ZK. If not return
          • lookup the children of /overseer_elect node and see if there are nodes with Overseer roles among children. If not return
          • All nodes whose seq number is smaller than that of the first Overseer role node is asked to disconnect and reconnect . So that those nodes will append themselves to the tail of the list and the Overseer role node would come up at the head
          • After this process , the Overseer checks if it is itself an Overseer role node , If not ,relinquish it's Overseer position and go back to election

          If a new node comes up who is assigned an Overseer role, it has to ensure that it is at the head of the election queue ,

          • If any new node comes up, it checks for the /roles.json . If present , checks if it is assigned an overseer role, if yes , it checks if the head of the election queue is a node with Overseer role. If not, send a message to the current Overseer to run the prioritization logic
          Show
          Noble Paul added a comment - The strategy to implement this is as follows Prioritization logic run at Overseer node Any Overseer elect would first check for the presence of /roles.json in ZK. If not return lookup the children of /overseer_elect node and see if there are nodes with Overseer roles among children. If not return All nodes whose seq number is smaller than that of the first Overseer role node is asked to disconnect and reconnect . So that those nodes will append themselves to the tail of the list and the Overseer role node would come up at the head After this process , the Overseer checks if it is itself an Overseer role node , If not ,relinquish it's Overseer position and go back to election If a new node comes up who is assigned an Overseer role, it has to ensure that it is at the head of the election queue , If any new node comes up, it checks for the /roles.json . If present , checks if it is assigned an overseer role, if yes , it checks if the head of the election queue is a node with Overseer role. If not, send a message to the current Overseer to run the prioritization logic
          Hide
          Noble Paul added a comment -

          New command implemented addrole and removerole . The only supported role now is overseer

          Show
          Noble Paul added a comment - New command implemented addrole and removerole . The only supported role now is overseer
          Hide
          Noble Paul added a comment - - edited

          added a testcase

          Show
          Noble Paul added a comment - - edited added a testcase
          Hide
          Noble Paul added a comment -

          some errors fixed

          Show
          Noble Paul added a comment - some errors fixed
          Hide
          Noble Paul added a comment -

          patch with more logging

          Show
          Noble Paul added a comment - patch with more logging
          Hide
          Noble Paul added a comment -

          code made more robust

          Show
          Noble Paul added a comment - code made more robust
          Hide
          ASF subversion and git services added a comment -

          Commit 1558760 from Noble Paul in branch 'dev/trunk'
          [ https://svn.apache.org/r1558760 ]

          SOLR-5476 Overseer Role for nodes

          Show
          ASF subversion and git services added a comment - Commit 1558760 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1558760 ] SOLR-5476 Overseer Role for nodes
          Hide
          ASF subversion and git services added a comment -

          Commit 1558776 from Noble Paul in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1558776 ]

          SOLR-5476 Overseer Role for nodes

          Show
          ASF subversion and git services added a comment - Commit 1558776 from Noble Paul in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1558776 ] SOLR-5476 Overseer Role for nodes
          Hide
          ASF subversion and git services added a comment -

          Commit 1558829 from Noble Paul in branch 'dev/trunk'
          [ https://svn.apache.org/r1558829 ]

          SOLR-5476 , logging added

          Show
          ASF subversion and git services added a comment - Commit 1558829 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1558829 ] SOLR-5476 , logging added
          Hide
          ASF subversion and git services added a comment -

          Commit 1558830 from Noble Paul in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1558830 ]

          SOLR-5476 , logging added

          Show
          ASF subversion and git services added a comment - Commit 1558830 from Noble Paul in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1558830 ] SOLR-5476 , logging added
          Hide
          ASF subversion and git services added a comment -

          Commit 1558846 from Noble Paul in branch 'dev/trunk'
          [ https://svn.apache.org/r1558846 ]

          SOLR-5476 removing forbidden API usage

          Show
          ASF subversion and git services added a comment - Commit 1558846 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1558846 ] SOLR-5476 removing forbidden API usage
          Hide
          ASF subversion and git services added a comment -

          Commit 1558847 from Noble Paul in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1558847 ]

          SOLR-5476 removing forbidden API usage

          Show
          ASF subversion and git services added a comment - Commit 1558847 from Noble Paul in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1558847 ] SOLR-5476 removing forbidden API usage
          Hide
          ASF subversion and git services added a comment -

          Commit 1559043 from Uwe Schindler in branch 'dev/trunk'
          [ https://svn.apache.org/r1559043 ]

          SOLR-5476: Fix forbidden.

          PLEASE RUN "ant precommit" (root) or alternatively the faster "ant check-forbidden-apis" (in your module folder) before committing!

          Show
          ASF subversion and git services added a comment - Commit 1559043 from Uwe Schindler in branch 'dev/trunk' [ https://svn.apache.org/r1559043 ] SOLR-5476 : Fix forbidden. PLEASE RUN "ant precommit" (root) or alternatively the faster "ant check-forbidden-apis" (in your module folder) before committing!
          Hide
          ASF subversion and git services added a comment -

          Commit 1559044 from Uwe Schindler in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1559044 ]

          Merged revision(s) 1559043 from lucene/dev/trunk:
          SOLR-5476: Fix forbidden.

          PLEASE RUN "ant precommit" (root) or alternatively the faster "ant check-forbidden-apis" (in your module folder) before committing!

          Show
          ASF subversion and git services added a comment - Commit 1559044 from Uwe Schindler in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1559044 ] Merged revision(s) 1559043 from lucene/dev/trunk: SOLR-5476 : Fix forbidden. PLEASE RUN "ant precommit" (root) or alternatively the faster "ant check-forbidden-apis" (in your module folder) before committing!
          Hide
          ASF subversion and git services added a comment -

          Commit 1559100 from Noble Paul in branch 'dev/trunk'
          [ https://svn.apache.org/r1559100 ]

          SOLR-5476 logging added

          Show
          ASF subversion and git services added a comment - Commit 1559100 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1559100 ] SOLR-5476 logging added
          Hide
          Mark Miller added a comment -

          Have you actually run into issues with this? Seems like premature optimization...the overseer simply fires http commands and simple zk stuff...you really think 'special hardware' overseers are going to matter and not just complicate the code?

          Show
          Mark Miller added a comment - Have you actually run into issues with this? Seems like premature optimization...the overseer simply fires http commands and simple zk stuff...you really think 'special hardware' overseers are going to matter and not just complicate the code?
          Hide
          Noble Paul added a comment -

          It is not really about special H/W , but mostly about 'dedicated' H/W.

          If the same Overseer node is used for normal cores there is a good chance that those nodes do some CPU intensive operations or GC pauses which would delay Overseer operations. This can cause piling up of messages in large clusters

          Show
          Noble Paul added a comment - It is not really about special H/W , but mostly about 'dedicated' H/W. If the same Overseer node is used for normal cores there is a good chance that those nodes do some CPU intensive operations or GC pauses which would delay Overseer operations. This can cause piling up of messages in large clusters
          Hide
          ASF subversion and git services added a comment -

          Commit 1559399 from Noble Paul in branch 'dev/trunk'
          [ https://svn.apache.org/r1559399 ]

          SOLR-5476 ftests are being fixed . SO ignore for the time being

          Show
          ASF subversion and git services added a comment - Commit 1559399 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1559399 ] SOLR-5476 ftests are being fixed . SO ignore for the time being
          Hide
          ASF subversion and git services added a comment -

          Commit 1559402 from Noble Paul in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1559402 ]

          SOLR-5476 ftests are being fixed . SO ignore for the time being

          Show
          ASF subversion and git services added a comment - Commit 1559402 from Noble Paul in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1559402 ] SOLR-5476 ftests are being fixed . SO ignore for the time being
          Hide
          ASF subversion and git services added a comment -

          Commit 1559405 from Noble Paul in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1559405 ]

          SOLR-5476 ftests are being fixed . SO ignore for the time being

          Show
          ASF subversion and git services added a comment - Commit 1559405 from Noble Paul in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1559405 ] SOLR-5476 ftests are being fixed . SO ignore for the time being
          Hide
          ASF subversion and git services added a comment -

          Commit 1559677 from Noble Paul in branch 'dev/trunk'
          [ https://svn.apache.org/r1559677 ]

          SOLR-5476 tests passing

          Show
          ASF subversion and git services added a comment - Commit 1559677 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1559677 ] SOLR-5476 tests passing
          Hide
          ASF subversion and git services added a comment -

          Commit 1559680 from Noble Paul in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1559680 ]

          SOLR-5476 tests passing

          Show
          ASF subversion and git services added a comment - Commit 1559680 from Noble Paul in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1559680 ] SOLR-5476 tests passing
          Hide
          ASF subversion and git services added a comment -

          Commit 1559936 from Noble Paul in branch 'dev/trunk'
          [ https://svn.apache.org/r1559936 ]

          SOLR-5476 handle nonode exception

          Show
          ASF subversion and git services added a comment - Commit 1559936 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1559936 ] SOLR-5476 handle nonode exception
          Hide
          ASF subversion and git services added a comment -

          Commit 1559937 from Noble Paul in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1559937 ]

          SOLR-5476 handle nonode exception

          Show
          ASF subversion and git services added a comment - Commit 1559937 from Noble Paul in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1559937 ] SOLR-5476 handle nonode exception
          Hide
          Mark Miller added a comment -

          This test is still failing on almost all of my local test runs.

          Show
          Mark Miller added a comment - This test is still failing on almost all of my local test runs.
          Hide
          ASF subversion and git services added a comment -

          Commit 1560279 from Noble Paul in branch 'dev/trunk'
          [ https://svn.apache.org/r1560279 ]

          SOLR-5476 hardened a bit

          Show
          ASF subversion and git services added a comment - Commit 1560279 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1560279 ] SOLR-5476 hardened a bit
          Hide
          ASF subversion and git services added a comment -

          Commit 1560280 from Noble Paul in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1560280 ]

          SOLR-5476 hardened a bit

          Show
          ASF subversion and git services added a comment - Commit 1560280 from Noble Paul in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1560280 ] SOLR-5476 hardened a bit
          Hide
          Noble Paul added a comment -

          Mark Miller What is the sure shot way for Overseer/OverseerCollectionProcessor to quit and somebody else take its place? The failures are due to the leader not giving up

          Show
          Noble Paul added a comment - Mark Miller What is the sure shot way for Overseer/OverseerCollectionProcessor to quit and somebody else take its place? The failures are due to the leader not giving up
          Hide
          Shalin Shekhar Mangar added a comment -

          Perhaps you can create your own instance of overseer and force election just like OverseerTest does with OverseerRestarter?

          Show
          Shalin Shekhar Mangar added a comment - Perhaps you can create your own instance of overseer and force election just like OverseerTest does with OverseerRestarter?
          Hide
          ASF subversion and git services added a comment -

          Commit 1560314 from Noble Paul in branch 'dev/trunk'
          [ https://svn.apache.org/r1560314 ]

          SOLR-5476 ignored till there is a reliable way to restart Overseer

          Show
          ASF subversion and git services added a comment - Commit 1560314 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1560314 ] SOLR-5476 ignored till there is a reliable way to restart Overseer
          Hide
          ASF subversion and git services added a comment -

          Commit 1560315 from Noble Paul in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1560315 ]

          SOLR-5476 needs a reliable way to restart Overseer

          Show
          ASF subversion and git services added a comment - Commit 1560315 from Noble Paul in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1560315 ] SOLR-5476 needs a reliable way to restart Overseer
          Hide
          ASF subversion and git services added a comment -

          Commit 1562434 from Noble Paul in branch 'dev/trunk'
          [ https://svn.apache.org/r1562434 ]

          SOLR-5476 tests were failing earlier

          Show
          ASF subversion and git services added a comment - Commit 1562434 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1562434 ] SOLR-5476 tests were failing earlier
          Hide
          ASF subversion and git services added a comment -

          Commit 1562435 from Noble Paul in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1562435 ]

          SOLR-5476 tests were failing earlier

          Show
          ASF subversion and git services added a comment - Commit 1562435 from Noble Paul in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1562435 ] SOLR-5476 tests were failing earlier
          Hide
          Mark Miller added a comment - - edited

          See common jenkins test fail - something is off here, with the impl or the test. An Overseer reading the queue can run into another Overseer having already removed the distrib queue node being looked at.

          Show
          Mark Miller added a comment - - edited See common jenkins test fail - something is off here, with the impl or the test. An Overseer reading the queue can run into another Overseer having already removed the distrib queue node being looked at.
          Hide
          Mark Miller added a comment -

          I don't think this is right. Where do you ensure the old Overseer is stopped first? You can't just force a new one by deleting the election node and doing the run leader process.

          Show
          Mark Miller added a comment - I don't think this is right. Where do you ensure the old Overseer is stopped first? You can't just force a new one by deleting the election node and doing the run leader process.
          Hide
          Mark Miller added a comment -

          Trying to force a leader like that is complicated, and I think there are a bunch of holes here.

          Show
          Mark Miller added a comment - Trying to force a leader like that is complicated, and I think there are a bunch of holes here.
          Hide
          Mark Miller added a comment -

          I think the tests are too light as well. You want strong testing around this special overseer takeover - there are so many things that could go wrong - you need to make sure things continue merrily after several disaster scenario. And it needs to ensure that there are never more than one Overseer running at a time.

          I think this was committed before it was ready.

          Show
          Mark Miller added a comment - I think the tests are too light as well. You want strong testing around this special overseer takeover - there are so many things that could go wrong - you need to make sure things continue merrily after several disaster scenario. And it needs to ensure that there are never more than one Overseer running at a time. I think this was committed before it was ready.
          Hide
          Mark Miller added a comment -

          I've worked a little to try and do some of the right things along the current approach, but I'm starting to feel I'm -1 on this approach to leader prioritization. I think if there was proper testing here, no way doing things this way would survive it, and I'm not sure I can make this idea work.

          A better first approach probably involves having prospective leaders give up there position in line if they see a better candidate behind them. The trick then being to look out for how to deal with a node that looks like a better candidate but for some reason keeps failing to be leader.

          Show
          Mark Miller added a comment - I've worked a little to try and do some of the right things along the current approach, but I'm starting to feel I'm -1 on this approach to leader prioritization. I think if there was proper testing here, no way doing things this way would survive it, and I'm not sure I can make this idea work. A better first approach probably involves having prospective leaders give up there position in line if they see a better candidate behind them. The trick then being to look out for how to deal with a node that looks like a better candidate but for some reason keeps failing to be leader.
          Hide
          Noble Paul added a comment - - edited

          A better first approach probably involves having prospective leaders give up there position in line if they see a better candidate behind them.

          That can also be done too. But it is generally better to keep the Overseer designates line up first and second (if there are more than one) in the queue to avoid delay and it is easy to do also. Usually the command is invoked to assign an overseer after the Overseer is already elected and running. In that scenario we need to force the current Overseer to give up their position .

          Show
          Noble Paul added a comment - - edited A better first approach probably involves having prospective leaders give up there position in line if they see a better candidate behind them. That can also be done too. But it is generally better to keep the Overseer designates line up first and second (if there are more than one) in the queue to avoid delay and it is easy to do also. Usually the command is invoked to assign an overseer after the Overseer is already elected and running. In that scenario we need to force the current Overseer to give up their position .
          Hide
          Mark Miller added a comment -

          The problem is the reliance on the isLeader checks in the over seer threads to kill the previous overseer. It's a race, and the tests are showing how easy it is for multiple overseers to interfere with each other. We have to avoid those races, both because the current impls require it, but also because who knows what future things the overseer may do. The Overseer must be a mutually exclusive process 100%.

          I'm not sure that we can bullet proof this approach. Perhaps. But I wouldn't believe it until we had some much stronger testing. The current test fails almost every run for me when I run it in eclipse because Overseers interfere with each other.

          I'm also a little scared of the way this messes with the election process, but that appears like it should be okay.

          Show
          Mark Miller added a comment - The problem is the reliance on the isLeader checks in the over seer threads to kill the previous overseer. It's a race, and the tests are showing how easy it is for multiple overseers to interfere with each other. We have to avoid those races, both because the current impls require it, but also because who knows what future things the overseer may do. The Overseer must be a mutually exclusive process 100%. I'm not sure that we can bullet proof this approach. Perhaps. But I wouldn't believe it until we had some much stronger testing. The current test fails almost every run for me when I run it in eclipse because Overseers interfere with each other. I'm also a little scared of the way this messes with the election process, but that appears like it should be okay.
          Hide
          Noble Paul added a comment -

          The problem of race condition happens on the first time ADDROLE is invoked . Once you have added a couple of Overseers designate it won't happen again because the Overseer designates are always in the front for election and then there is no need to kill a running overseer

          Show
          Noble Paul added a comment - The problem of race condition happens on the first time ADDROLE is invoked . Once you have added a couple of Overseers designate it won't happen again because the Overseer designates are always in the front for election and then there is no need to kill a running overseer
          Hide
          Mark Miller added a comment -

          It's a huge problem. As you can see from the extremely common test fails. Currently, you see a fail were it kills the processing thread. That race is a big deal. I'm -1 on this for 4.7 given the problems with it. Something has to be done that doesn't compromise the Overseer running mutually exclusively. This is important now and for the future.

          We need to ensure the Overseer cannot possibly step on itself, and this feature pretty much ensures that it will.

          I'm -1 on the feature until the implementation is fixed.

          Show
          Mark Miller added a comment - It's a huge problem. As you can see from the extremely common test fails. Currently, you see a fail were it kills the processing thread. That race is a big deal. I'm -1 on this for 4.7 given the problems with it. Something has to be done that doesn't compromise the Overseer running mutually exclusively. This is important now and for the future. We need to ensure the Overseer cannot possibly step on itself, and this feature pretty much ensures that it will. I'm -1 on the feature until the implementation is fixed.
          Hide
          Noble Paul added a comment -

          What is the best strategy to kill an existing overseer? Once we agree upon that it is easy to fix it

          Show
          Noble Paul added a comment - What is the best strategy to kill an existing overseer? Once we agree upon that it is easy to fix it
          Hide
          Noble Paul added a comment -

          The test does not fail . It errors out . Because I forcefully kill the overseer.

          Show
          Noble Paul added a comment - The test does not fail . It errors out . Because I forcefully kill the overseer.
          Hide
          Noble Paul added a comment -

          This is to avoid the race condition

          In this patch

          • The new overseer is asked to wait for the STATE_UPDATE_DELAY before processing the queue items
          • STATE_UPDATE_DELAY is the maximum interval between the Overseer checks for amILeader().
          Show
          Noble Paul added a comment - This is to avoid the race condition In this patch The new overseer is asked to wait for the STATE_UPDATE_DELAY before processing the queue items STATE_UPDATE_DELAY is the maximum interval between the Overseer checks for amILeader().
          Hide
          Mark Miller added a comment -

          I don't know that that is a great solution either though.

          For one, future overseer tasks may have nothing to do with the state delay.

          For two, you don't know how long any overseer job will take - you might wait that delay, but the overseer may be taking 30 seconds or more to run a command and then when it done it removes a queue item out from the new overseer, killing the thread. I can't work out how this protects against Overseers completely from overlapping.

          I've thought through a lot of this type of thing, and it just seems like there are lots of holes that are very hard to fill on this path.

          Show
          Mark Miller added a comment - I don't know that that is a great solution either though. For one, future overseer tasks may have nothing to do with the state delay. For two, you don't know how long any overseer job will take - you might wait that delay, but the overseer may be taking 30 seconds or more to run a command and then when it done it removes a queue item out from the new overseer, killing the thread. I can't work out how this protects against Overseers completely from overlapping. I've thought through a lot of this type of thing, and it just seems like there are lots of holes that are very hard to fill on this path.
          Hide
          Noble Paul added a comment -

          The in-flight operations for Overseer are really short running. Keeping the delay slightly longer than STATE_UPDATE_DELAY would not hurt.

          Again , the problem of Overlap because of long running operations is no just restricted to the Overseer Roles feature it can happen otherwise too.

          Show
          Noble Paul added a comment - The in-flight operations for Overseer are really short running. Keeping the delay slightly longer than STATE_UPDATE_DELAY would not hurt. Again , the problem of Overlap because of long running operations is no just restricted to the Overseer Roles feature it can happen otherwise too.
          Hide
          Mark Miller added a comment -

          Again , the problem of Overlap because of long running operations is no just restricted to the Overseer Roles feature it can happen otherwise too.

          It's not likely to happen otherwise. Those checks that you are now using like a feature are really just fail safes. Emergency shutoffs to minimize damage in a bad situation. They are not meant to be hit. Normally an Overseer will be an Overseer until it dies, crashes, cases to exist. There will be no overlap in any normal circumstances.

          This patch changes that.

          Show
          Mark Miller added a comment - Again , the problem of Overlap because of long running operations is no just restricted to the Overseer Roles feature it can happen otherwise too. It's not likely to happen otherwise. Those checks that you are now using like a feature are really just fail safes. Emergency shutoffs to minimize damage in a bad situation. They are not meant to be hit. Normally an Overseer will be an Overseer until it dies, crashes, cases to exist. There will be no overlap in any normal circumstances. This patch changes that.
          Hide
          ASF subversion and git services added a comment -

          Commit 1566247 from Noble Paul in branch 'dev/trunk'
          [ https://svn.apache.org/r1566247 ]

          SOLR-5476 avoiding race condition

          Show
          ASF subversion and git services added a comment - Commit 1566247 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1566247 ] SOLR-5476 avoiding race condition
          Hide
          ASF subversion and git services added a comment -

          Commit 1566248 from Noble Paul in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1566248 ]

          SOLR-5476 avoiding race condition

          Show
          ASF subversion and git services added a comment - Commit 1566248 from Noble Paul in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1566248 ] SOLR-5476 avoiding race condition
          Hide
          Mark Miller added a comment -

          I just saw this fail again locally with the latest commit.

          Show
          Mark Miller added a comment - I just saw this fail again locally with the latest commit.
          Hide
          Noble Paul added a comment -

          I ran it around 20 times on my local box and could not get an error/failure . I'll look at jenkins for failures

          Show
          Noble Paul added a comment - I ran it around 20 times on my local box and could not get an error/failure . I'll look at jenkins for failures
          Hide
          Mark Miller added a comment -

          I've seen it fail a couple times already - something about failing to reorder.

          Show
          Mark Miller added a comment - I've seen it fail a couple times already - something about failing to reorder.
          Hide
          Mark Miller added a comment -

          To sum up an offline conversation Noble and I had about this:

          I'm not happy with the approach, I think it's dangerous, and I'm worried about making more and more of these types of exceptions.

          However, I will withdraw my -1 on it if the testing is significantly beefed up and it can be proved the current approach is usually not going to screw you. I think the current test is an absolute minimum test and it has regularly failed on jenkins (and for me locally even more fequently) since it was committed.

          We also agreed on a JIRA issue to come up with a better approach at some point.

          I also think we need to add some defensive programming. If we try and remove a queue item that is not there, we should assume another overseer already ate it rather than letting the Overseer thread die.

          Show
          Mark Miller added a comment - To sum up an offline conversation Noble and I had about this: I'm not happy with the approach, I think it's dangerous, and I'm worried about making more and more of these types of exceptions. However, I will withdraw my -1 on it if the testing is significantly beefed up and it can be proved the current approach is usually not going to screw you. I think the current test is an absolute minimum test and it has regularly failed on jenkins (and for me locally even more fequently) since it was committed. We also agreed on a JIRA issue to come up with a better approach at some point. I also think we need to add some defensive programming. If we try and remove a queue item that is not there, we should assume another overseer already ate it rather than letting the Overseer thread die.
          Hide
          Noble Paul added a comment -

          Another testcase to kill an existing overseer and checking for another one to takeover

          Show
          Noble Paul added a comment - Another testcase to kill an existing overseer and checking for another one to takeover
          Hide
          ASF subversion and git services added a comment -

          Commit 1566620 from Noble Paul in branch 'dev/trunk'
          [ https://svn.apache.org/r1566620 ]

          SOLR-5476 added testcase

          Show
          ASF subversion and git services added a comment - Commit 1566620 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1566620 ] SOLR-5476 added testcase
          Hide
          ASF subversion and git services added a comment -

          Commit 1566624 from Noble Paul in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1566624 ]

          SOLR-5476 added testcase

          Show
          ASF subversion and git services added a comment - Commit 1566624 from Noble Paul in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1566624 ] SOLR-5476 added testcase
          Hide
          ASF subversion and git services added a comment -

          Commit 1567012 from Noble Paul in branch 'dev/trunk'
          [ https://svn.apache.org/r1567012 ]

          SOLR-5476 added testcase

          Show
          ASF subversion and git services added a comment - Commit 1567012 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1567012 ] SOLR-5476 added testcase
          Hide
          Ramkumar Aiyengar added a comment -

          Will this feature lend itself to dynamically updating the role of a node – so depending on the state of the environment, I can grant or revoke the Overseer role? I guess the collection admin will technically allow for it, but just wanted to check if that's something that can be routinely done when the cluster is live without any performance or stability impact.

          Show
          Ramkumar Aiyengar added a comment - Will this feature lend itself to dynamically updating the role of a node – so depending on the state of the environment, I can grant or revoke the Overseer role? I guess the collection admin will technically allow for it, but just wanted to check if that's something that can be routinely done when the cluster is live without any performance or stability impact.
          Hide
          Noble Paul added a comment -

          Yes, you can invoke the collection API on a live system. The performance impact will be there for overseer operations for a few seconds . Regular read /write should not be impacted at all

          Show
          Noble Paul added a comment - Yes, you can invoke the collection API on a live system. The performance impact will be there for overseer operations for a few seconds . Regular read /write should not be impacted at all
          Hide
          Mark Miller added a comment -

          The testing for this is still super minimal. Also the following was not addressed:

          I also think we need to add some defensive programming. If we try and remove a queue item that is not there, we should assume another overseer already ate it rather than letting the Overseer thread die.
          

          I don't think much of what I was concerned about was addressed. The test can also still fail because two Overseers are running at the same time (Mike just saw this).

          If this didn't go out in 4.7 I would bring back the -1.

          This is why I said on the phone that I would prefer we mark it as experimental - so that it could still be vetoed after release if the problems were not addressed.

          Show
          Mark Miller added a comment - The testing for this is still super minimal. Also the following was not addressed: I also think we need to add some defensive programming. If we try and remove a queue item that is not there, we should assume another overseer already ate it rather than letting the Overseer thread die. I don't think much of what I was concerned about was addressed. The test can also still fail because two Overseers are running at the same time (Mike just saw this). If this didn't go out in 4.7 I would bring back the -1. This is why I said on the phone that I would prefer we mark it as experimental - so that it could still be vetoed after release if the problems were not addressed.
          Hide
          Noble Paul added a comment -

          I also think we need to add some defensive programming. If we try and remove a queue item that is not there, we should assume another overseer already ate it rather than letting the Overseer thread die.

          We are trying to kill an overseer safely.

          What is being done is wait for STATE_UPDATE_DELAY+100 ms and then start the next Overseer . I agree that this is not 100% foolproof (but I have not seen OverseerRolesTest fail for quite sometime)

          We can definitely add more checks to ensure a proper handing over of overseer role. I didn't quite get the above suggestion clearly. Please elaborate . I'll be glad to implement that

          I think this should be an issue in itself rather than confusing with the "Roles" issue.

          This is why I said on the phone that I would prefer we mark it as experimental

          Where do I need to sign?

          BTW , I would say the users of this feature are very unlikely to hit this issue

          Show
          Noble Paul added a comment - I also think we need to add some defensive programming. If we try and remove a queue item that is not there, we should assume another overseer already ate it rather than letting the Overseer thread die. We are trying to kill an overseer safely. What is being done is wait for STATE_UPDATE_DELAY+100 ms and then start the next Overseer . I agree that this is not 100% foolproof (but I have not seen OverseerRolesTest fail for quite sometime) We can definitely add more checks to ensure a proper handing over of overseer role. I didn't quite get the above suggestion clearly. Please elaborate . I'll be glad to implement that I think this should be an issue in itself rather than confusing with the "Roles" issue. This is why I said on the phone that I would prefer we mark it as experimental Where do I need to sign? BTW , I would say the users of this feature are very unlikely to hit this issue
          Hide
          Noble Paul added a comment -

          I'll open another issue for Overseer restart and I will work on hardening that. This probably is not the right place

          Show
          Noble Paul added a comment - I'll open another issue for Overseer restart and I will work on hardening that. This probably is not the right place

            People

            • Assignee:
              Noble Paul
              Reporter:
              Noble Paul
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development