Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.6, 7.0
    • Component/s: None
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None

      Description

      There are several JIRAs (I'll link in a second) about trying to be more efficient about processing overseer messages as the overseer can become a bottleneck, especially with very large numbers of replicas in a cluster. One of the approaches mentioned near the end of SOLR-5872 (15-Mar) was to "read large no:of items say 10000. put them into in memory buckets and feed them into overseer....".

      This JIRA is to break out that part of the discussion as it might be an easy win whereas "eliminating the Overseer queue" would be quite an undertaking.

      1. SOLR-10524.patch
        2 kB
        Cao Manh Dat
      2. SOLR-10524.patch
        16 kB
        Cao Manh Dat
      3. SOLR-10524.patch
        13 kB
        Noble Paul
      4. SOLR-10524.patch
        12 kB
        Cao Manh Dat
      5. SOLR-10524-dragonsinth.patch
        4 kB
        Scott Blum
      6. SOLR-10524-NPE-fix.patch
        0.6 kB
        Christine Poerschke

        Issue Links

          Activity

          Hide
          caomanhdat Cao Manh Dat added a comment -

          Patch for this ticket. Thanks Noble Paul for the raw patch.

          Show
          caomanhdat Cao Manh Dat added a comment - Patch for this ticket. Thanks Noble Paul for the raw patch.
          Hide
          noble.paul Noble Paul added a comment -

          Cao Manh Dat Looks good.

          This should be able to dramatically reduce the state update operations when overseer is under heavy load

          I have made the method Overseer#sortItems() as public static . Can you please write simple JUnits to test the order of items produced by that method?

          Scott Blum is it possible for u to take a look at the patch , especially the changes made to DistributedQueue

          Show
          noble.paul Noble Paul added a comment - Cao Manh Dat Looks good. This should be able to dramatically reduce the state update operations when overseer is under heavy load I have made the method Overseer#sortItems() as public static . Can you please write simple JUnits to test the order of items produced by that method? Scott Blum is it possible for u to take a look at the patch , especially the changes made to DistributedQueue
          Hide
          dragonsinth Scott Blum added a comment - - edited

          Couple of thoughts:

          1) In the places where you've changed Collection -> List, I would go one step further and make it a concrete ArrayList, to a) explicitly convey that the returned list is a mutable copy rather than a view of internal state and b) explicitly convey that sortAndAdd() is operating efficiently on said lists.

          2) DQ.remove(id): don't you want to unconditionally knownChildren.remove(id), even if the ZK delete succeeds?

          3) DQ.remove(id): there is no need to loop here, in fact you'll get stuck in an infinite loop if someone else deletes the node you're targeting. The reason there's a loop in removeFirst() is because it's trying a different id each iteration.

          Suggested remove(id) impl:

            public void remove(String id) throws KeeperException, InterruptedException {
              // Remove the ZK node *first*; ZK will resolve any races with peek()/poll().
              // This is counterintuitive, but peek()/poll() will not return an element if the underlying
              // ZK node has been deleted, so it's okay to update knownChildren afterwards.
              try {
                String path = dir + "/" + id;
                zookeeper.delete(path, -1, true);
              } catch (KeeperException.NoNodeException e) {
                // Another client deleted the node first, this is fine.
              }
              updateLock.lockInterruptibly();
              try {
                knownChildren.remove(id);
              } finally {
                updateLock.unlock();
              }
            }
          
          Show
          dragonsinth Scott Blum added a comment - - edited Couple of thoughts: 1) In the places where you've changed Collection -> List, I would go one step further and make it a concrete ArrayList, to a) explicitly convey that the returned list is a mutable copy rather than a view of internal state and b) explicitly convey that sortAndAdd() is operating efficiently on said lists. 2) DQ.remove(id): don't you want to unconditionally knownChildren.remove(id), even if the ZK delete succeeds? 3) DQ.remove(id): there is no need to loop here, in fact you'll get stuck in an infinite loop if someone else deletes the node you're targeting. The reason there's a loop in removeFirst() is because it's trying a different id each iteration. Suggested remove(id) impl: public void remove( String id) throws KeeperException, InterruptedException { // Remove the ZK node *first*; ZK will resolve any races with peek()/poll(). // This is counterintuitive, but peek()/poll() will not return an element if the underlying // ZK node has been deleted, so it's okay to update knownChildren afterwards. try { String path = dir + "/" + id; zookeeper.delete(path, -1, true ); } catch (KeeperException.NoNodeException e) { // Another client deleted the node first, this is fine. } updateLock.lockInterruptibly(); try { knownChildren.remove(id); } finally { updateLock.unlock(); } }
          Hide
          erickerickson Erick Erickson added a comment -

          Kind of a side note, but I got distracted when looking at this by IntelliJ claiming that these were unreferenced. Are they necessary?

          private final DistributedMap runningMap;
          private final DistributedMap completedMap;
          private final DistributedMap failureMap;
          private final Stats zkStats;

          Show
          erickerickson Erick Erickson added a comment - Kind of a side note, but I got distracted when looking at this by IntelliJ claiming that these were unreferenced. Are they necessary? private final DistributedMap runningMap; private final DistributedMap completedMap; private final DistributedMap failureMap; private final Stats zkStats;
          Hide
          caomanhdat Cao Manh Dat added a comment -

          Updated patch for ticket.

          Show
          caomanhdat Cao Manh Dat added a comment - Updated patch for ticket.
          Hide
          dragonsinth Scott Blum added a comment -

          Updated patch DistributedQueue LGTM

          Show
          dragonsinth Scott Blum added a comment - Updated patch DistributedQueue LGTM
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -
          1. The tmp list in the sortItems method should also be a LinkedList otherwise tmp.remove(0) becomes expensive.
          2. I ran the OverseerTest#testPerformance method which simulates a worst case scenario of 20000 mixed collection updates and it shows that update_state invocations drop two order of magnitude from 20011 to 131.
          3. However the overall time does not change that much. Drops from 3m 3s 531ms without the patch to 2m 53s 282ms. Presumably when real world latencies between overseer and zk is accounted for, the difference should be larger. I'd like for us to benchmark this with a remote ZK host to see how much does this patch increase the overseer throughput.
          4. This patch process messages in an order different from the state update queue but always removes the first element. This is wrong and can cause a lot of problems in the cluster if overseer fails over and restarts processing. We must remove the message that was processed.
          5. Also now that the order of processing is different, we must have tests that assert that the right items are removed from the queue at all times even during overseer restarts. The bar of testing for this kind of change has to be very high!
          6. Is all the re-sorting logic even necessary? It seems that the intention is to workaround the batching logic inside ZkStateWriter. Why not remove the batching logic (when switching between collections) from ZkStateWriter altogether? It will simplify both places.
          Show
          shalinmangar Shalin Shekhar Mangar added a comment - The tmp list in the sortItems method should also be a LinkedList otherwise tmp.remove(0) becomes expensive. I ran the OverseerTest#testPerformance method which simulates a worst case scenario of 20000 mixed collection updates and it shows that update_state invocations drop two order of magnitude from 20011 to 131. However the overall time does not change that much. Drops from 3m 3s 531ms without the patch to 2m 53s 282ms. Presumably when real world latencies between overseer and zk is accounted for, the difference should be larger. I'd like for us to benchmark this with a remote ZK host to see how much does this patch increase the overseer throughput. This patch process messages in an order different from the state update queue but always removes the first element. This is wrong and can cause a lot of problems in the cluster if overseer fails over and restarts processing. We must remove the message that was processed. Also now that the order of processing is different, we must have tests that assert that the right items are removed from the queue at all times even during overseer restarts. The bar of testing for this kind of change has to be very high! Is all the re-sorting logic even necessary? It seems that the intention is to workaround the batching logic inside ZkStateWriter. Why not remove the batching logic (when switching between collections) from ZkStateWriter altogether? It will simplify both places.
          Hide
          caomanhdat Cao Manh Dat added a comment -

          Updated patch for this ticket after a discussion with Noble Paul and Shalin Shekhar Mangar. Here are result of OverseerTest.testPerformance()

          Without the patch

          Overseer loop finished processing:
          avgRequestsPerSecond: 0.00809284358238982
          5minRateRequestsPerSecond: 0.0
          15minRateRequestsPerSecond: 0.0
          avgTimePerRequest: 123564881129000000
          medianRequestTime: 123564881129000000
          75thPcRequestTime: 123564881129000000
          95thPcRequestTime: 123564881129000000
          99thPcRequestTime: 123564881129000000
          999thPcRequestTime: 123564881129000000
          op: am_i_leader, success: 3, failure: 0
          avgRequestsPerSecond: 0.024318192042511424
          5minRateRequestsPerSecond: 0.2726342664775392
          15minRateRequestsPerSecond: 0.35201956953766844
          avgTimePerRequest: 353111000000
          medianRequestTime: 116973000000
          75thPcRequestTime: 116973000000
          95thPcRequestTime: 1733875000000
          99thPcRequestTime: 1733875000000
          999thPcRequestTime: 1733875000000
          op: update_state, success: 20011, failure: 0
          avgRequestsPerSecond: 162.28792277377633
          5minRateRequestsPerSecond: 106.44733871784089
          15minRateRequestsPerSecond: 89.86620980167666
          avgTimePerRequest: 213680000000
          medianRequestTime: 205539000000
          75thPcRequestTime: 221076000000
          95thPcRequestTime: 253206000000
          99thPcRequestTime: 282888000000
          999thPcRequestTime: 548583000000
          op: state, success: 20001, failure: 0
          avgRequestsPerSecond: 162.44457624784178
          5minRateRequestsPerSecond: 107.66013079551965
          15minRateRequestsPerSecond: 91.18766381210062
          avgTimePerRequest: 13250000000
          medianRequestTime: 11459000000
          75thPcRequestTime: 16188000000
          95thPcRequestTime: 21414000000
          99thPcRequestTime: 39280000000
          999thPcRequestTime: 67098000000

          With the patch

          Overseer loop finished processing:
          avgRequestsPerSecond: 0.00802836931576006
          5minRateRequestsPerSecond: 0.0
          15minRateRequestsPerSecond: 0.0
          avgTimePerRequest: 124556932520000000
          medianRequestTime: 124556932520000000
          75thPcRequestTime: 124556932520000000
          95thPcRequestTime: 124556932520000000
          99thPcRequestTime: 124556932520000000
          999thPcRequestTime: 124556932520000000
          op: am_i_leader, success: 3, failure: 0
          avgRequestsPerSecond: 0.024113954682119472
          5minRateRequestsPerSecond: 0.2726342664775392
          15minRateRequestsPerSecond: 0.35201956953766844
          avgTimePerRequest: 306734000000
          medianRequestTime: 116296000000
          75thPcRequestTime: 116296000000
          95thPcRequestTime: 1417483000000
          99thPcRequestTime: 1417483000000
          999thPcRequestTime: 1417483000000
          op: update_state, success: 52, failure: 0
          avgRequestsPerSecond: 0.4181288003958347
          5minRateRequestsPerSecond: 0.4
          15minRateRequestsPerSecond: 0.4
          avgTimePerRequest: 2117982000000
          medianRequestTime: 2054633000000
          75thPcRequestTime: 2212862000000
          95thPcRequestTime: 2648609000000
          99thPcRequestTime: 4582074000000
          999thPcRequestTime: 6145919000000
          op: state, success: 20001, failure: 0
          avgRequestsPerSecond: 161.02141495173862
          5minRateRequestsPerSecond: 107.06882627730678
          15minRateRequestsPerSecond: 91.09679521134835
          avgTimePerRequest: 17483000000
          medianRequestTime: 16009000000
          75thPcRequestTime: 22093000000
          95thPcRequestTime: 32283000000
          99thPcRequestTime: 46404000000
          999thPcRequestTime: 117668000000

          As we can see, the number of update_state is much reduced from 20011 to 52.

          Show
          caomanhdat Cao Manh Dat added a comment - Updated patch for this ticket after a discussion with Noble Paul and Shalin Shekhar Mangar . Here are result of OverseerTest.testPerformance() Without the patch Overseer loop finished processing: avgRequestsPerSecond: 0.00809284358238982 5minRateRequestsPerSecond: 0.0 15minRateRequestsPerSecond: 0.0 avgTimePerRequest: 123564881129000000 medianRequestTime: 123564881129000000 75thPcRequestTime: 123564881129000000 95thPcRequestTime: 123564881129000000 99thPcRequestTime: 123564881129000000 999thPcRequestTime: 123564881129000000 op: am_i_leader, success: 3, failure: 0 avgRequestsPerSecond: 0.024318192042511424 5minRateRequestsPerSecond: 0.2726342664775392 15minRateRequestsPerSecond: 0.35201956953766844 avgTimePerRequest: 353111000000 medianRequestTime: 116973000000 75thPcRequestTime: 116973000000 95thPcRequestTime: 1733875000000 99thPcRequestTime: 1733875000000 999thPcRequestTime: 1733875000000 op: update_state, success: 20011, failure: 0 avgRequestsPerSecond: 162.28792277377633 5minRateRequestsPerSecond: 106.44733871784089 15minRateRequestsPerSecond: 89.86620980167666 avgTimePerRequest: 213680000000 medianRequestTime: 205539000000 75thPcRequestTime: 221076000000 95thPcRequestTime: 253206000000 99thPcRequestTime: 282888000000 999thPcRequestTime: 548583000000 op: state, success: 20001, failure: 0 avgRequestsPerSecond: 162.44457624784178 5minRateRequestsPerSecond: 107.66013079551965 15minRateRequestsPerSecond: 91.18766381210062 avgTimePerRequest: 13250000000 medianRequestTime: 11459000000 75thPcRequestTime: 16188000000 95thPcRequestTime: 21414000000 99thPcRequestTime: 39280000000 999thPcRequestTime: 67098000000 With the patch Overseer loop finished processing: avgRequestsPerSecond: 0.00802836931576006 5minRateRequestsPerSecond: 0.0 15minRateRequestsPerSecond: 0.0 avgTimePerRequest: 124556932520000000 medianRequestTime: 124556932520000000 75thPcRequestTime: 124556932520000000 95thPcRequestTime: 124556932520000000 99thPcRequestTime: 124556932520000000 999thPcRequestTime: 124556932520000000 op: am_i_leader, success: 3, failure: 0 avgRequestsPerSecond: 0.024113954682119472 5minRateRequestsPerSecond: 0.2726342664775392 15minRateRequestsPerSecond: 0.35201956953766844 avgTimePerRequest: 306734000000 medianRequestTime: 116296000000 75thPcRequestTime: 116296000000 95thPcRequestTime: 1417483000000 99thPcRequestTime: 1417483000000 999thPcRequestTime: 1417483000000 op: update_state, success: 52, failure: 0 avgRequestsPerSecond: 0.4181288003958347 5minRateRequestsPerSecond: 0.4 15minRateRequestsPerSecond: 0.4 avgTimePerRequest: 2117982000000 medianRequestTime: 2054633000000 75thPcRequestTime: 2212862000000 95thPcRequestTime: 2648609000000 99thPcRequestTime: 4582074000000 999thPcRequestTime: 6145919000000 op: state, success: 20001, failure: 0 avgRequestsPerSecond: 161.02141495173862 5minRateRequestsPerSecond: 107.06882627730678 15minRateRequestsPerSecond: 91.09679521134835 avgTimePerRequest: 17483000000 medianRequestTime: 16009000000 75thPcRequestTime: 22093000000 95thPcRequestTime: 32283000000 99thPcRequestTime: 46404000000 999thPcRequestTime: 117668000000 As we can see, the number of update_state is much reduced from 20011 to 52.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment - - edited

          Yes, I like this. Same performance, much smaller changes and no chance of something going wrong in the cluster because of processing re-ordered messages. +1 to commit.

          There are optimizations we can do on the read side using multi-get. Lets open another issue to explore that as well. Oops, zookeeper has no multi-get.

          As a side note, there is a bug in the nsToMs method in testOverseer – it actually assumes the nanoseconds as milliseconds and the converts them to nano seconds! I'll fix it separately.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - - edited Yes, I like this. Same performance, much smaller changes and no chance of something going wrong in the cluster because of processing re-ordered messages. +1 to commit. There are optimizations we can do on the read side using multi-get. Lets open another issue to explore that as well. Oops, zookeeper has no multi-get. As a side note, there is a bug in the nsToMs method in testOverseer – it actually assumes the nanoseconds as milliseconds and the converts them to nano seconds! I'll fix it separately.
          Hide
          erickerickson Erick Erickson added a comment -

          So if I were preparing an "executive summary", there would be several take-aways:

          1> The number of update state operations, i.e. the number of times state is actually written to ZK is drastically lower under heavy load; by a factor of almost 400!

          2> One implication here is that the number of state change notifications that ZK has to send out, and thus the number of times the state gets read by Solr nodes is also decreased by that same factor. So the fact that the state-read operations throughput is the same should be evaluated in light of the fact that there will be many fewer of them.

          3> One thing not captured by the numbers is that the size of the Overseer queue is much less like to spin out of control due to both <2> and the fact that we're reading/ordering/processing batches of up to 10,000 messages at once.

          4> Even though some of the throughput numbers haven't changed (am_i_leader for instance), they'll spend much less time waiting to be carried out due to 1-3. Plus only three points may make a circle, but isn't enough data to make a good generalization from

          Is this fair? Accurate? Complete? I'm looking for something to present to those users who have seen the Overseer queue grow to the 100s of K, effectively making their cluster unusable.

          Thanks for this work! As collections get larger and larger this has become a very significant pain-point.

          Show
          erickerickson Erick Erickson added a comment - So if I were preparing an "executive summary", there would be several take-aways: 1> The number of update state operations, i.e. the number of times state is actually written to ZK is drastically lower under heavy load; by a factor of almost 400! 2> One implication here is that the number of state change notifications that ZK has to send out, and thus the number of times the state gets read by Solr nodes is also decreased by that same factor. So the fact that the state-read operations throughput is the same should be evaluated in light of the fact that there will be many fewer of them. 3> One thing not captured by the numbers is that the size of the Overseer queue is much less like to spin out of control due to both <2> and the fact that we're reading/ordering/processing batches of up to 10,000 messages at once. 4> Even though some of the throughput numbers haven't changed (am_i_leader for instance), they'll spend much less time waiting to be carried out due to 1-3. Plus only three points may make a circle, but isn't enough data to make a good generalization from Is this fair? Accurate? Complete? I'm looking for something to present to those users who have seen the Overseer queue grow to the 100s of K, effectively making their cluster unusable. Thanks for this work! As collections get larger and larger this has become a very significant pain-point.
          Hide
          caomanhdat Cao Manh Dat added a comment -

          I will commit the patch soon if no one have any objection.

          Show
          caomanhdat Cao Manh Dat added a comment - I will commit the patch soon if no one have any objection.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 20c4886816ceae96af9d99a5e99f5cd9037d8ef4 in lucene-solr's branch refs/heads/master from Cao Manh Dat
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=20c4886 ]

          SOLR-10524: Explore in-memory partitioning for processing Overseer queue messages

          Show
          jira-bot ASF subversion and git services added a comment - Commit 20c4886816ceae96af9d99a5e99f5cd9037d8ef4 in lucene-solr's branch refs/heads/master from Cao Manh Dat [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=20c4886 ] SOLR-10524 : Explore in-memory partitioning for processing Overseer queue messages
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Thanks for the summary Erick Erickson.

          3> One thing not captured by the numbers is that the size of the Overseer queue is much less like to spin out of control due to both <2> and the fact that we're reading/ordering/processing batches of up to 10,000 messages at once.

          We aren't reading them 10k at a time because there is no multi-read in ZK. However we write the results of the processing once per 10K messages or 2.5 seconds or whenever the queue goes empty whichever happens earlier.

          The rest looks okay. SOLR-10619 fixes an even bigger problem with the distributed queue used by overseer so we'll see even bigger gains after it is resolved.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Thanks for the summary Erick Erickson . 3> One thing not captured by the numbers is that the size of the Overseer queue is much less like to spin out of control due to both <2> and the fact that we're reading/ordering/processing batches of up to 10,000 messages at once. We aren't reading them 10k at a time because there is no multi-read in ZK. However we write the results of the processing once per 10K messages or 2.5 seconds or whenever the queue goes empty whichever happens earlier. The rest looks okay. SOLR-10619 fixes an even bigger problem with the distributed queue used by overseer so we'll see even bigger gains after it is resolved.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Cao Manh Dat – please rename this issue and edit the description in CHANGES.txt – there is no in-memory partitioning being done here. We are simply improving batching in overseer for messages coming from mixed collections to reduce ZK collection state writes.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Cao Manh Dat – please rename this issue and edit the description in CHANGES.txt – there is no in-memory partitioning being done here. We are simply improving batching in overseer for messages coming from mixed collections to reduce ZK collection state writes.
          Hide
          noble.paul Noble Paul added a comment -

          Scott Blum there is no multi-read available in ZK in the newer versions also right?

          Show
          noble.paul Noble Paul added a comment - Scott Blum there is no multi-read available in ZK in the newer versions also right?
          Hide
          noble.paul Noble Paul added a comment -

          However there is an asynchronous version of getData()

            /**
               * The asynchronous version of getData.
               *
               * @see #getData(String, Watcher, Stat)
               */
              public void getData(final String path, Watcher watcher,
                      DataCallback cb, Object ctx){
          }
          

          should we run tests to compare if it can give us an advantage? I guess it should

          Another optimization is doing multiple deletes from the workQueue using the following method

           /**
               * The asynchronous version of multi.
               *
               * @see #multi(Iterable)
               */
              public void multi(Iterable<Op> ops, MultiCallback cb, Object ctx) {
          
          Show
          noble.paul Noble Paul added a comment - However there is an asynchronous version of getData() /** * The asynchronous version of getData. * * @see #getData( String , Watcher, Stat) */ public void getData( final String path, Watcher watcher, DataCallback cb, Object ctx){ } should we run tests to compare if it can give us an advantage? I guess it should Another optimization is doing multiple deletes from the workQueue using the following method /** * The asynchronous version of multi. * * @see #multi(Iterable) */ public void multi(Iterable<Op> ops, MultiCallback cb, Object ctx) {
          Hide
          cpoerschke Christine Poerschke added a comment -

          It seems that the cmd.collection == null check in maybeFlushBefore is needed, patch attached to re-instate/illustrate.

          Show
          cpoerschke Christine Poerschke added a comment - It seems that the cmd.collection == null check in maybeFlushBefore is needed, patch attached to re-instate/illustrate.
          Hide
          erickerickson Erick Erickson added a comment -

          Shalin Shekhar Mangar Hmmm, maybe a better way to phrase it is that we process the queue in batches of up to 10,000? I didn't mean to convey that the Zookeeper read was in batches that size, just that we processed them up to that many at once.

          Show
          erickerickson Erick Erickson added a comment - Shalin Shekhar Mangar Hmmm, maybe a better way to phrase it is that we process the queue in batches of up to 10,000? I didn't mean to convey that the Zookeeper read was in batches that size, just that we processed them up to that many at once.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Thanks Christine, please go ahead and commit your patch.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Thanks Christine, please go ahead and commit your patch.
          Hide
          joel.bernstein Joel Bernstein added a comment - - edited

          I'm seeing the errors below when running the StreamExpressionTest. I suspect it's related to this ticket. I've been adding tests the past couple of days but only started seeing this today:

          Overseer main queue loop
          [junit4] 2> java.lang.NullPointerException
          [junit4] 2> 239409 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_0000000000) [n:127.0.0.1:51485_solr ] o.a.s.c.Overseer Exception in Overseer main queue loop
          [junit4] 2> java.lang.NullPointerException
          [junit4] 2> 239410 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_0000000000) [n:127.0.0.1:51485_solr ] o.a.s.c.Overseer Exception in Overseer main queue loop
          [junit4] 2> java.lang.NullPointerException
          [junit4] 2> 239411 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_0000000000) [n:127.0.0.1:51485_solr ] o.a.s.c.Overseer Exception in Overseer main queue loop
          [junit4] 2> java.lang.NullPointerException
          [junit4] 2> 239412 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_0000000000) [n:127.0.0.1:51485_solr ] o.a.s.c.Overseer Exception in Overseer main queue loop
          [junit4] 2> java.lang.NullPointerException
          [junit4] 2> 239413 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_0000000000) [n:127.0.0.1:51485_solr ] o.a.s.c.Overseer Exception in Overseer main queue loop
          [junit4] 2> java.lang.NullPointerException
          [junit4] 2> 239413 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_0000000000) [n:127.0.0.1:51485_solr ] o.a.s.c.Overseer Exception in Overseer main queue loop
          [junit4] 2> java.lang.NullPointerException
          [junit4] 2> 239414 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_0000000000) [n:127.0.0.1:51485_solr ] o.a.s.c.Overseer Exception in Overseer main queue loop
          [junit4] 2> java.lang.NullPointerException
          [junit4] 2> 239415 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_0000000000) [n:127.0.0.1:51485_solr ] o.a.s.c.Overseer Exception in Overseer main queue loop

          Show
          joel.bernstein Joel Bernstein added a comment - - edited I'm seeing the errors below when running the StreamExpressionTest. I suspect it's related to this ticket. I've been adding tests the past couple of days but only started seeing this today: Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239409 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_0000000000) [n:127.0.0.1:51485_solr ] o.a.s.c.Overseer Exception in Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239410 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_0000000000) [n:127.0.0.1:51485_solr ] o.a.s.c.Overseer Exception in Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239411 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_0000000000) [n:127.0.0.1:51485_solr ] o.a.s.c.Overseer Exception in Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239412 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_0000000000) [n:127.0.0.1:51485_solr ] o.a.s.c.Overseer Exception in Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239413 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_0000000000) [n:127.0.0.1:51485_solr ] o.a.s.c.Overseer Exception in Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239413 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_0000000000) [n:127.0.0.1:51485_solr ] o.a.s.c.Overseer Exception in Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239414 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_0000000000) [n:127.0.0.1:51485_solr ] o.a.s.c.Overseer Exception in Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239415 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_0000000000) [n:127.0.0.1:51485_solr ] o.a.s.c.Overseer Exception in Overseer main queue loop
          Hide
          tomasflobbe Tomás Fernández Löbbe added a comment -

          It seems that the cmd.collection == null check in maybeFlushBefore is needed, patch attached to re-instate/illustrate.

          Yes, lots of tests are failing without this. ZkStateWriterTest fails if I apply Christine's patch though, some more changes are needed

          Show
          tomasflobbe Tomás Fernández Löbbe added a comment - It seems that the cmd.collection == null check in maybeFlushBefore is needed, patch attached to re-instate/illustrate. Yes, lots of tests are failing without this. ZkStateWriterTest fails if I apply Christine's patch though, some more changes are needed
          Hide
          joel.bernstein Joel Bernstein added a comment -

          Yeah, I just switched over to branch_6x and I'm not getting the failures there. Looks like it's this ticket.

          Show
          joel.bernstein Joel Bernstein added a comment - Yeah, I just switched over to branch_6x and I'm not getting the failures there. Looks like it's this ticket.
          Hide
          dragonsinth Scott Blum added a comment -

          ZkStateWriterTests.testZkStateWriterBatching() is written for exactly the behavior we wanted to change here. That test needs an overhaul. Patch forthcoming.

          Show
          dragonsinth Scott Blum added a comment - ZkStateWriterTests.testZkStateWriterBatching() is written for exactly the behavior we wanted to change here. That test needs an overhaul. Patch forthcoming.
          Hide
          dragonsinth Scott Blum added a comment -

          Fixes ZkStateWriterTest, etc

          Show
          dragonsinth Scott Blum added a comment - Fixes ZkStateWriterTest, etc
          Hide
          caomanhdat Cao Manh Dat added a comment -

          Looks good +1 for commit

          Show
          caomanhdat Cao Manh Dat added a comment - Looks good +1 for commit
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 972e342fee7a02e71300a9739b9971e63708589b in lucene-solr's branch refs/heads/master from Scott Blum
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=972e342 ]

          SOLR-10524: Build fix for NPE

          Introduced by ZkStateWriter batching optimizations.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 972e342fee7a02e71300a9739b9971e63708589b in lucene-solr's branch refs/heads/master from Scott Blum [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=972e342 ] SOLR-10524 : Build fix for NPE Introduced by ZkStateWriter batching optimizations.
          Hide
          dragonsinth Scott Blum added a comment -

          Thanks Cao Manh Dat, committed to master.

          Show
          dragonsinth Scott Blum added a comment - Thanks Cao Manh Dat , committed to master.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 9cab9c0cf2777a21a81386be2262e84da2bca751 in lucene-solr's branch refs/heads/branch_6x from Cao Manh Dat
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=9cab9c0 ]

          SOLR-10524: Explore in-memory partitioning for processing Overseer queue messages

          Show
          jira-bot ASF subversion and git services added a comment - Commit 9cab9c0cf2777a21a81386be2262e84da2bca751 in lucene-solr's branch refs/heads/branch_6x from Cao Manh Dat [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=9cab9c0 ] SOLR-10524 : Explore in-memory partitioning for processing Overseer queue messages
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 5c626dc9e0a488d43e9a2f41947fd2ec3b0b046f in lucene-solr's branch refs/heads/branch_6x from Scott Blum
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5c626dc ]

          SOLR-10524: Build fix for NPE

          Introduced by ZkStateWriter batching optimizations.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 5c626dc9e0a488d43e9a2f41947fd2ec3b0b046f in lucene-solr's branch refs/heads/branch_6x from Scott Blum [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5c626dc ] SOLR-10524 : Build fix for NPE Introduced by ZkStateWriter batching optimizations.

            People

            • Assignee:
              caomanhdat Cao Manh Dat
              Reporter:
              erickerickson Erick Erickson
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development