Solr
  1. Solr
  2. SOLR-5681

Make the OverseerCollectionProcessor multi-threaded

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.9, 6.0
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      Right now, the OverseerCollectionProcessor is single threaded i.e submitting anything long running would have it block processing of other mutually exclusive tasks.
      When OCP tasks become optionally async (SOLR-5477), it'd be good to have truly non-blocking behavior by multi-threading the OCP itself.

      For example, a ShardSplit call on Collection1 would block the thread and thereby, not processing a create collection task (which would stay queued in zk) though both the tasks are mutually exclusive.

      Here are a few of the challenges:

      • Mutual exclusivity: Only let mutually exclusive tasks run in parallel. An easy way to handle that is to only let 1 task per collection run at a time.
      • ZK Distributed Queue to feed tasks: The OCP consumes tasks from a queue. The task from the workQueue is only removed on completion so that in case of a failure, the new Overseer can re-consume the same task and retry. A queue is not the right data structure in the first place to look ahead i.e. get the 2nd task from the queue when the 1st one is in process. Also, deleting tasks which are not at the head of a queue is not really an 'intuitive' thing.

      Proposed solutions for task management:

      • Task funnel and peekAfter(): The parent thread is responsible for getting and passing the request to a new thread (or one from the pool). The parent method uses a peekAfter(last element) instead of a peek(). The peekAfter returns the task after the 'last element'. Maintain this request information and use it for deleting/cleaning up the workQueue.
      • Another (almost duplicate) queue: While offering tasks to workQueue, also offer them to a new queue (call it volatileWorkQueue?). The difference is, as soon as a task from this is picked up for processing by the thread, it's removed from the queue. At the end, the cleanup is done from the workQueue.
      1. SOLR-5681.patch
        45 kB
        Anshum Gupta
      2. SOLR-5681.patch
        44 kB
        Anshum Gupta
      3. SOLR-5681.patch
        44 kB
        Anshum Gupta
      4. SOLR-5681.patch
        44 kB
        Anshum Gupta
      5. SOLR-5681.patch
        45 kB
        Anshum Gupta
      6. SOLR-5681.patch
        46 kB
        Anshum Gupta
      7. SOLR-5681.patch
        40 kB
        Anshum Gupta
      8. SOLR-5681.patch
        36 kB
        Anshum Gupta
      9. SOLR-5681.patch
        36 kB
        Anshum Gupta
      10. SOLR-5681.patch
        37 kB
        Anshum Gupta
      11. SOLR-5681.patch
        36 kB
        Anshum Gupta
      12. SOLR-5681.patch
        35 kB
        Anshum Gupta
      13. SOLR-5681.patch
        35 kB
        Anshum Gupta
      14. SOLR-5681.patch
        37 kB
        Anshum Gupta
      15. SOLR-5681.patch
        35 kB
        Anshum Gupta
      16. SOLR-5681.patch
        30 kB
        Noble Paul
      17. SOLR-5681.patch
        30 kB
        Anshum Gupta
      18. SOLR-5681.patch
        23 kB
        Anshum Gupta
      19. SOLR-5681.patch
        20 kB
        Anshum Gupta
      20. SOLR-5681-2.patch
        75 kB
        Anshum Gupta
      21. SOLR-5681-2.patch
        75 kB
        Anshum Gupta
      22. SOLR-5681-2.patch
        75 kB
        Anshum Gupta
      23. SOLR-5681-2.patch
        73 kB
        Anshum Gupta
      24. SOLR-5681-2.patch
        73 kB
        Anshum Gupta
      25. SOLR-5681-2.patch
        69 kB
        Anshum Gupta
      26. SOLR-5681-2.patch
        68 kB
        Noble Paul
      27. SOLR-5681-2.patch
        69 kB
        Noble Paul
      28. SOLR-5681-2.patch
        66 kB
        Anshum Gupta
      29. SOLR-5681-2.patch
        66 kB
        Anshum Gupta
      30. SOLR-5681-2.patch
        66 kB
        Anshum Gupta
      31. SOLR-5681-2.patch
        66 kB
        Anshum Gupta
      32. SOLR-5681-2.patch
        64 kB
        Anshum Gupta
      33. SOLR-5681-2.patch
        79 kB
        Anshum Gupta

        Issue Links

          Activity

          Hide
          Anshum Gupta added a comment -

          Async Collection API calls.

          Show
          Anshum Gupta added a comment - Async Collection API calls.
          Hide
          Anshum Gupta added a comment -

          Here's another approach:
          Follow approach#2 with the following changes. When a request comes in, move it to the runningMap. Right now it just get's put into the runningMap without being removed from the work queue.
          On completion, move the task from the running to the failed/completed map. In case of an Overseer failure, whenever a new Overseer comes up, make sure that all tasks from the running queue are moved back to the submitted queue. Also that they move in 'practically' the same order.

          Show
          Anshum Gupta added a comment - Here's another approach: Follow approach#2 with the following changes. When a request comes in, move it to the runningMap. Right now it just get's put into the runningMap without being removed from the work queue. On completion, move the task from the running to the failed/completed map. In case of an Overseer failure, whenever a new Overseer comes up, make sure that all tasks from the running queue are moved back to the submitted queue. Also that they move in 'practically' the same order.
          Hide
          Noble Paul added a comment -

          Although we call the stuff a queue in ZK , ZK internally has no queue. You can just read or delete an item from anywhere. Keep an in memory list of tasks getting processed . Read a few items in batch (from the head of course) decide what can be run now w/o interfering with other running tasks and pass it over to a thread. When the tasks are completed , remove them from the in memory list and delete from the zk directly using delete() command instead of queue.remove()

          Show
          Noble Paul added a comment - Although we call the stuff a queue in ZK , ZK internally has no queue. You can just read or delete an item from anywhere. Keep an in memory list of tasks getting processed . Read a few items in batch (from the head of course) decide what can be run now w/o interfering with other running tasks and pass it over to a thread. When the tasks are completed , remove them from the in memory list and delete from the zk directly using delete() command instead of queue.remove()
          Hide
          Anshum Gupta added a comment -

          It's supposed to work in the following cases:

          • Calls are async
          • Only one call per collection would be processed at a given time (keeping the mutual exclusion logic simple for now).

          I'll also buffer top-X submitted tasks from the zk queue in memory and use an internal, in-memory tracking for the purpose of processing the parallel tasks.

          Here's the flow that I'm working on now:

          1. Read top-X tasks from the zk queue (as opposed to a single task now)
          2. Process tasks, spawn thread/pass task to ThreadPoolExecutor in case it's an ASYNC request, else process inline(still debating). Synchronize on the tasks buffer and internal maps. Remove when done.
          3. Delete the task from zk directly (not using typical queue mechanism).
          4. + Store more information about the completed ASYNC task in the zk map entry.
          Show
          Anshum Gupta added a comment - It's supposed to work in the following cases: Calls are async Only one call per collection would be processed at a given time (keeping the mutual exclusion logic simple for now). I'll also buffer top-X submitted tasks from the zk queue in memory and use an internal, in-memory tracking for the purpose of processing the parallel tasks. Here's the flow that I'm working on now: Read top-X tasks from the zk queue (as opposed to a single task now) Process tasks, spawn thread/pass task to ThreadPoolExecutor in case it's an ASYNC request, else process inline(still debating). Synchronize on the tasks buffer and internal maps. Remove when done. Delete the task from zk directly (not using typical queue mechanism). + Store more information about the completed ASYNC task in the zk map entry.
          Hide
          Anshum Gupta added a comment - - edited

          First patch. I've ignored OverseerCollectionProcessorTest (still figuring out the changes required to mock the multi threaded OverseerCollectionProcessor).
          Also, have another patch almost ready but that doesn't pass the tests quite as well so will upload it when I fix the stuff.
          Here's what can be expected in the next (few) patches:

          1. Configurable ThreadPool size.
          2. Fix hard-coded logic (where-ever that's the case).
          3. Mocked out multi-threaded OCP test.
          4. Tests for :
            1. Starting a long running task and having a shorter task triggered after that (with an expectation of seeing that complete before the other one).
            2. Test to validate that only mutually exclusive tasks run in parallel (sync + async combined).
          Show
          Anshum Gupta added a comment - - edited First patch. I've ignored OverseerCollectionProcessorTest (still figuring out the changes required to mock the multi threaded OverseerCollectionProcessor). Also, have another patch almost ready but that doesn't pass the tests quite as well so will upload it when I fix the stuff. Here's what can be expected in the next (few) patches: Configurable ThreadPool size. Fix hard-coded logic (where-ever that's the case). Mocked out multi-threaded OCP test. Tests for : Starting a long running task and having a shorter task triggered after that (with an expectation of seeing that complete before the other one). Test to validate that only mutually exclusive tasks run in parallel (sync + async combined).
          Hide
          Noble Paul added a comment - - edited

          workQueue.peekTopN(10); If a task is being processed , this will always return immediately with one item and the loop would continue without a pause , hogging CPU/ZK-traffic. You will need to ensure that the call returns if the available items in the queue are different from the ones being processed.

          Show
          Noble Paul added a comment - - edited workQueue.peekTopN(10); If a task is being processed , this will always return immediately with one item and the loop would continue without a pause , hogging CPU/ZK-traffic. You will need to ensure that the call returns if the available items in the queue are different from the ones being processed.
          Hide
          Anshum Gupta added a comment -

          Another patch with a few things fixed, got rid off a bit of the hard coded logic and a multi threading probable race-condition.
          Also, the main thread loop now continues if there's nothing new in the work-queue.

          Show
          Anshum Gupta added a comment - Another patch with a few things fixed, got rid off a bit of the hard coded logic and a multi threading probable race-condition. Also, the main thread loop now continues if there's nothing new in the work-queue.
          Hide
          Anshum Gupta added a comment -

          Added a test for running parallel tasks (multiple collection creation and split). Seems like there's some issue fetching new tasks from the queue.
          Working on resolving the issue.

          Show
          Anshum Gupta added a comment - Added a test for running parallel tasks (multiple collection creation and split). Seems like there's some issue fetching new tasks from the queue. Working on resolving the issue.
          Hide
          Noble Paul added a comment - - edited

          FIxed OCPTest error

          peekTopN() still does not take care of the problem I reported earlier

          Show
          Noble Paul added a comment - - edited FIxed OCPTest error peekTopN() still does not take care of the problem I reported earlier
          Hide
          Anshum Gupta added a comment -

          Updated patch, though it has some failing tests.

          Show
          Anshum Gupta added a comment - Updated patch, though it has some failing tests.
          Hide
          Anshum Gupta added a comment -

          Another one. Still has some issues with:

          • Removal of tasks from work queue.
          • Failed tasks.

          Working on the above two.

          Show
          Anshum Gupta added a comment - Another one. Still has some issues with: Removal of tasks from work queue. Failed tasks. Working on the above two.
          Hide
          Anshum Gupta added a comment -

          A little more cleanup.

          Show
          Anshum Gupta added a comment - A little more cleanup.
          Hide
          Anshum Gupta added a comment -

          Patch with all but 1 test passing.

          Show
          Anshum Gupta added a comment - Patch with all but 1 test passing.
          Hide
          Anshum Gupta added a comment -

          More test. Working on adding more tests and fixing some issues found while adding another test.

          Show
          Anshum Gupta added a comment - More test. Working on adding more tests and fixing some issues found while adding another test.
          Hide
          Noble Paul added a comment - - edited

          I see a fundamental problem with the implementation

          When 'async' param is not passed, the main thread is blocked. This means if I accidently fire a long running task w/o the async param , the OCP is locked up till it is completed.

          The solution would be to have all tasks be run in the threadpool and the purpose of async param should be to just choose whether to wait or not till the operation completes

          Show
          Noble Paul added a comment - - edited I see a fundamental problem with the implementation When 'async' param is not passed, the main thread is blocked. This means if I accidently fire a long running task w/o the async param , the OCP is locked up till it is completed. The solution would be to have all tasks be run in the threadpool and the purpose of async param should be to just choose whether to wait or not till the operation completes
          Hide
          Anshum Gupta added a comment -

          Changes to make it completely non-blocking. The only reason why a task would not be processed now is if there's another one mid-way for the same collection. I have some tests which fail intermittently, working on fixing those.

          Show
          Anshum Gupta added a comment - Changes to make it completely non-blocking. The only reason why a task would not be processed now is if there's another one mid-way for the same collection. I have some tests which fail intermittently, working on fixing those.
          Hide
          Anshum Gupta added a comment -

          Made the completedTasks map a ConcurrentHashMap to handle the multi-threaded puts.

          Show
          Anshum Gupta added a comment - Made the completedTasks map a ConcurrentHashMap to handle the multi-threaded puts.
          Hide
          Anshum Gupta added a comment -

          Updated patch. Handles the following:

          • All tasks are attempted to be processed in parallel. The only restriction is the collection name i.e. At any point in time, only one task per collection can be processed.
          • Restarts: Task-id is noted and until that taskId, an attempt is made at figuring out if the task was already completed i.e. present in the completed/failed zk map.

          It would be good if someone looked at it too.

          Has a few failing tests due to the last change. Working on fixing that.

          Show
          Anshum Gupta added a comment - Updated patch. Handles the following: All tasks are attempted to be processed in parallel. The only restriction is the collection name i.e. At any point in time, only one task per collection can be processed. Restarts: Task-id is noted and until that taskId, an attempt is made at figuring out if the task was already completed i.e. present in the completed/failed zk map. It would be good if someone looked at it too. Has a few failing tests due to the last change. Working on fixing that.
          Hide
          Anshum Gupta added a comment -

          Patch with a small change.

          Show
          Anshum Gupta added a comment - Patch with a small change.
          Hide
          Anshum Gupta added a comment -

          A restructured patch. The MultiThreadedOverseerCollectionProcessor now pretty much controls everything when it comes to the execution of tasks.
          The meaty processMessage now is a part of the inner class. Need to fix the existing OCPTest for that now. Everything else passes fine.
          Also, here's the motivation behind moving everything to the inner class. The shardHandler is not really thread safe and sharing the same shardHandler is not possible in parallel. I'm now passing the zkController from the Overseer to the OCP, using it in the inner class to get a new shardHandler for every new task that comes in. I've added a todo in there for using something like a shardHandler pool or something for that.
          I'm manually testing this stuff it seems to be working. Need to evaluate on the backwards compat for OCP.processMessage() which was protected until now.

          Show
          Anshum Gupta added a comment - A restructured patch. The MultiThreadedOverseerCollectionProcessor now pretty much controls everything when it comes to the execution of tasks. The meaty processMessage now is a part of the inner class. Need to fix the existing OCPTest for that now. Everything else passes fine. Also, here's the motivation behind moving everything to the inner class. The shardHandler is not really thread safe and sharing the same shardHandler is not possible in parallel. I'm now passing the zkController from the Overseer to the OCP, using it in the inner class to get a new shardHandler for every new task that comes in. I've added a todo in there for using something like a shardHandler pool or something for that. I'm manually testing this stuff it seems to be working. Need to evaluate on the backwards compat for OCP.processMessage() which was protected until now.
          Hide
          Anshum Gupta added a comment - - edited

          Another patch. This fixes an NPE issue from the DistributedQueue that was screwing up the OverseerTest.
          Need to fix OCPTest and OverseerStatusTest. Rest everything passes. Manual testing passes for everything I've tested so far i.e. Multiple parallel collection creation, shard splits, deletions.

          Also, will add cleaning up of processedZKTasks in the next patch.

          Show
          Anshum Gupta added a comment - - edited Another patch. This fixes an NPE issue from the DistributedQueue that was screwing up the OverseerTest. Need to fix OCPTest and OverseerStatusTest. Rest everything passes. Manual testing passes for everything I've tested so far i.e. Multiple parallel collection creation, shard splits, deletions. Also, will add cleaning up of processedZKTasks in the next patch.
          Hide
          Anshum Gupta added a comment -

          Cleaning up processedZKTasks in OCP.

          Show
          Anshum Gupta added a comment - Cleaning up processedZKTasks in OCP.
          Hide
          Anshum Gupta added a comment -

          Fixed the OVERSEERSTATUS (Stats) Test. Made that good for multi-threaded code.

          Only need to fix OverseerCollectionProcessorTest (substitute processMessage?).

          Show
          Anshum Gupta added a comment - Fixed the OVERSEERSTATUS (Stats) Test. Made that good for multi-threaded code. Only need to fix OverseerCollectionProcessorTest (substitute processMessage?).
          Hide
          Anshum Gupta added a comment -

          Another patch, with a different approach.
          This one involves creating a new shardHandler more often.

          Show
          Anshum Gupta added a comment - Another patch, with a different approach. This one involves creating a new shardHandler more often.
          Hide
          Noble Paul added a comment -

          The failing tests are fixed

          Show
          Noble Paul added a comment - The failing tests are fixed
          Hide
          Anshum Gupta added a comment -

          Thanks Noble.

          It would be good to get more eyes on this. Considering it touches important parts of the mechanics of Collection API processing, the more the better and the sooner the better too. I'd want to close this soon else it might soon turn into something that's very tough to maintain.

          Show
          Anshum Gupta added a comment - Thanks Noble. It would be good to get more eyes on this. Considering it touches important parts of the mechanics of Collection API processing, the more the better and the sooner the better too. I'd want to close this soon else it might soon turn into something that's very tough to maintain.
          Hide
          Noble Paul added a comment -

          Anshum Gupta please put up a patch for the trunk where all tests are passing , so that others can take a look

          Show
          Noble Paul added a comment - Anshum Gupta please put up a patch for the trunk where all tests are passing , so that others can take a look
          Hide
          Anshum Gupta added a comment -

          This patch is cleaned up and passes tests.

          Show
          Anshum Gupta added a comment - This patch is cleaned up and passes tests.
          Hide
          Shalin Shekhar Mangar added a comment -

          Some comments on the latest patch:

          1. The new createCollection method in CollectionAdminRequest is not required. In fact we should clean up the existing methods which hard code “implicit” router. I opened SOLR-6073 for it.
          2. There are some unrelated changes in CollectionHandler.handleRequestStatus()
          3. The added synchronisation in CoreAdminHandler.addTask is required. In fact it is a bug with the async work you did earlier and it should be fixed in trunk/branch_4x asap. We’re probably late for it to make into 4.8.1 but we should still try for it.
          4. The DistributedMap.size() method needlessly fetches all children. It can be implemented more efficiently using:
            Stat stat = new Stat();
            zookeeper.getData(dir, null, stat, true);
            stat.getNumChildren();
          5. The ‘excludeList’ param in DistributedQueue.peekTopN should be named ‘excludeSet’.
          6. DistributedQueue.peekTopN has the following code. It checks for topN.isEmpty but it should actually check for orderedChildren.isEmpty instead. Otherwise the method will return null even if children were found in the second pass after waiting.
            if (waitedEnough) {
                      if (topN.isEmpty()) return null;
                    }
            
          7. DistributedQueue.peekTopN has the following. Here the counter should be incremented only after topN.add(queueEvent) otherwise it either returns less nodes than requested and available or it waits more than required. For example, suppose children are (1,2,3,4,5), n=2 and excludeList=(1,2) then an extra await is invoked or if excludeList=(1,3) then only 2 is returned. In fact I think we should remove counter and just use topN.size() in the if condition. Also, is there any chance that headNode may be null?
            for (String headNode : orderedChildren.values()) {
                      if (headNode != null && counter++ < n) {
                        try {
                          String id = dir + "/" + headNode;
                          if (excludeList != null && excludeList.contains(id)) continue;
                          QueueEvent queueEvent = new QueueEvent(id,
                              zookeeper.getData(dir + "/" + headNode, null, null, true), null);
                          topN.add(queueEvent);
                        } catch (KeeperException.NoNodeException e) {
                          // Another client removed the node first, try next
                        }
                      } else {
                        if (topN.size() >= 1) {
                          return topN;
                        }
                      }
                    }
                    if (topN.size() >= 1) {
                      return topN;
                    } else {
                      childWatcher.await(wait == Long.MAX_VALUE ? DEFAULT_TIMEOUT : wait);
                      waitedEnough = wait != Long.MAX_VALUE;
                      continue;
                    }
            
          8. The DistributedQueue.peekTopN method catches and swallows the InterruptedException. We should just declare that it throws InterruptedException and let the caller deal with it.
          9. Remove the e.printStackTrace() calls in DistributedQueue.getLastElementId()
          10. Do not swallow InterruptedException in DistributedQueue.getLastElementId()
          11. overseerCollectionProcessor.shutdown(); in Overseer.close() is not required because that is done by ccThread.close() already
          12. There are formatting errors in success, error, time and storeFailureDetails methods in Overseer.Stats
          13. If the maxParallelThreads is supposed to be a constant then it should renamed accordingly as MAX_PARALLEL_THREADS.
          14. The maxParallelThreads=10 is not actually used while creating the ThreadPoolExecutor. Instead it is initialised with 5-100 threads!
          15. Use this.processedZKTasks = Collections.synchronizedSet(new HashSet<String>()); to remove the unchecked cast warning in OCP constructor.
          16. Instead of passing a shardHandler to OCP constructor, why not just pass a shardHandlerFactory?
          17. Remove the e.printStackTrace in catch clauses in OCP.run()
          18. Do not swallow InterruptedException in OCP.run()
          19. In OCP.cleanupWorkQueue, the synchronization on a ConcurrentHashMap is not required
          20. What is the reason behind cleaning work queue twice and sleeping for 20ms in this code:
                    cleanUpWorkQueue();
            
                    while(runningTasks.size() > maxParallelThreads) {
                      Thread.sleep(20);
                    }
            
                    cleanUpWorkQueue();
            
            
          21. There are unrelated changes in OCP.prioritizeOverseerNodes
          22. There are formatting problems in run(), checkExclusivity and cleanUpWorkQueue methods in OCP.
          23. We should check for asyncId != null in if (completedMap.contains(asyncId) || failureMap.contains(asyncId)) to avoid two unnecessary calls to ZK.
          24. KeeperException.NodeExistsException thrown from markTaskAsRunning is ignored - Why would that happen? If it happens, why is it okay to ignore it? Shouldn’t we fail loudly or log a warning?
          Show
          Shalin Shekhar Mangar added a comment - Some comments on the latest patch: The new createCollection method in CollectionAdminRequest is not required. In fact we should clean up the existing methods which hard code “implicit” router. I opened SOLR-6073 for it. There are some unrelated changes in CollectionHandler.handleRequestStatus() The added synchronisation in CoreAdminHandler.addTask is required. In fact it is a bug with the async work you did earlier and it should be fixed in trunk/branch_4x asap. We’re probably late for it to make into 4.8.1 but we should still try for it. The DistributedMap.size() method needlessly fetches all children. It can be implemented more efficiently using: Stat stat = new Stat(); zookeeper.getData(dir, null, stat, true); stat.getNumChildren(); The ‘excludeList’ param in DistributedQueue.peekTopN should be named ‘excludeSet’. DistributedQueue.peekTopN has the following code. It checks for topN.isEmpty but it should actually check for orderedChildren.isEmpty instead. Otherwise the method will return null even if children were found in the second pass after waiting. if (waitedEnough) { if (topN.isEmpty()) return null ; } DistributedQueue.peekTopN has the following. Here the counter should be incremented only after topN.add(queueEvent) otherwise it either returns less nodes than requested and available or it waits more than required. For example, suppose children are (1,2,3,4,5), n=2 and excludeList=(1,2) then an extra await is invoked or if excludeList=(1,3) then only 2 is returned. In fact I think we should remove counter and just use topN.size() in the if condition. Also, is there any chance that headNode may be null? for ( String headNode : orderedChildren.values()) { if (headNode != null && counter++ < n) { try { String id = dir + "/" + headNode; if (excludeList != null && excludeList.contains(id)) continue ; QueueEvent queueEvent = new QueueEvent(id, zookeeper.getData(dir + "/" + headNode, null , null , true ), null ); topN.add(queueEvent); } catch (KeeperException.NoNodeException e) { // Another client removed the node first, try next } } else { if (topN.size() >= 1) { return topN; } } } if (topN.size() >= 1) { return topN; } else { childWatcher.await(wait == Long .MAX_VALUE ? DEFAULT_TIMEOUT : wait); waitedEnough = wait != Long .MAX_VALUE; continue ; } The DistributedQueue.peekTopN method catches and swallows the InterruptedException. We should just declare that it throws InterruptedException and let the caller deal with it. Remove the e.printStackTrace() calls in DistributedQueue.getLastElementId() Do not swallow InterruptedException in DistributedQueue.getLastElementId() overseerCollectionProcessor.shutdown(); in Overseer.close() is not required because that is done by ccThread.close() already There are formatting errors in success, error, time and storeFailureDetails methods in Overseer.Stats If the maxParallelThreads is supposed to be a constant then it should renamed accordingly as MAX_PARALLEL_THREADS. The maxParallelThreads=10 is not actually used while creating the ThreadPoolExecutor. Instead it is initialised with 5-100 threads! Use this.processedZKTasks = Collections.synchronizedSet(new HashSet<String>()); to remove the unchecked cast warning in OCP constructor. Instead of passing a shardHandler to OCP constructor, why not just pass a shardHandlerFactory? Remove the e.printStackTrace in catch clauses in OCP.run() Do not swallow InterruptedException in OCP.run() In OCP.cleanupWorkQueue, the synchronization on a ConcurrentHashMap is not required What is the reason behind cleaning work queue twice and sleeping for 20ms in this code: cleanUpWorkQueue(); while (runningTasks.size() > maxParallelThreads) { Thread .sleep(20); } cleanUpWorkQueue(); There are unrelated changes in OCP.prioritizeOverseerNodes There are formatting problems in run(), checkExclusivity and cleanUpWorkQueue methods in OCP. We should check for asyncId != null in if (completedMap.contains(asyncId) || failureMap.contains(asyncId)) to avoid two unnecessary calls to ZK. KeeperException.NodeExistsException thrown from markTaskAsRunning is ignored - Why would that happen? If it happens, why is it okay to ignore it? Shouldn’t we fail loudly or log a warning?
          Hide
          Anshum Gupta added a comment -

          Thanks for looking at that Shalin. I've addressed everything but #6 and #7 (working on it right now). Everything other than that is either in the patch or explained below (or both).

          There are some unrelated changes in CollectionHandler.handleRequestStatus()

          They are related. The request status message flow earlier assumed that the same request would never be in the workQueue and completed/failed Map. It changes with this. The entry from workQueue is only removed by the parent thread in a batched manner. The entry in the completed/failed map however is made by the task thread. We now need to check for the task in completed/failed map before we look up the running map/workQueue. So the change.

          The added synchronisation in CoreAdminHandler.addTask is required.

          SOLR-6075 created and put up a patch there. Was wanting to commit it but I have trouble logging in. Will wait for my password to be reset or will hope someone commits it.

          DistributedQueue.peekTopN has the following code. It checks for topN.isEmpty but it should actually check for orderedChildren.isEmpty instead.

          If ordered children aren’t empty but topN is, we want to return null. So the check.

          Also, is there any chance that headNode may be null?

          I looked at the existing code and it does not assume that headNode would never be null. That’s the reason I added it here too.

          Remove the e.printStackTrace() calls in DistributedQueue.getLastElementId()

          Seems like you’ve been looking at the older patch. This method was renamed to getTailId() and the issues you mention already addressed.

          Instead of passing a shardHandler to OCP constructor, why not just pass a shardHandlerFactory?

          Required deeper changes to the mock etc. We could move to that later but I think for now, this makes sense.

          In OCP.cleanupWorkQueue, the synchronization on a ConcurrentHashMap is not required

          It’s required as it’s a read-process-update issue. We iterate on the key set of the map and then in the end clear it while someone else might add to it. We don’t want to clear a completed task which was never removed from the zk workQueue.

          What is the reason behind cleaning work queue twice and sleeping for 20ms in this code:

          To maintain concurrency limits and clean-up from zk after tasks complete. I’ve made it a little better by adding a waited bool in that loop and only call cleanUp for the second time when the waited is set.

          There are unrelated changes in OCP.prioritizeOverseerNodes

          Merge issue.. at least it seems like it.

          KeeperException.NodeExistsException thrown from markTaskAsRunning is ignored

          This should never happen. It’s just that DistributedMap.put throws that exception so we need to catch it. I’ve added a log.error for that and also commented saying that should never happen.

          Show
          Anshum Gupta added a comment - Thanks for looking at that Shalin. I've addressed everything but #6 and #7 (working on it right now). Everything other than that is either in the patch or explained below (or both). There are some unrelated changes in CollectionHandler.handleRequestStatus() They are related. The request status message flow earlier assumed that the same request would never be in the workQueue and completed/failed Map. It changes with this. The entry from workQueue is only removed by the parent thread in a batched manner. The entry in the completed/failed map however is made by the task thread. We now need to check for the task in completed/failed map before we look up the running map/workQueue. So the change. The added synchronisation in CoreAdminHandler.addTask is required. SOLR-6075 created and put up a patch there. Was wanting to commit it but I have trouble logging in. Will wait for my password to be reset or will hope someone commits it. DistributedQueue.peekTopN has the following code. It checks for topN.isEmpty but it should actually check for orderedChildren.isEmpty instead. If ordered children aren’t empty but topN is, we want to return null. So the check. Also, is there any chance that headNode may be null? I looked at the existing code and it does not assume that headNode would never be null. That’s the reason I added it here too. Remove the e.printStackTrace() calls in DistributedQueue.getLastElementId() Seems like you’ve been looking at the older patch. This method was renamed to getTailId() and the issues you mention already addressed. Instead of passing a shardHandler to OCP constructor, why not just pass a shardHandlerFactory? Required deeper changes to the mock etc. We could move to that later but I think for now, this makes sense. In OCP.cleanupWorkQueue, the synchronization on a ConcurrentHashMap is not required It’s required as it’s a read-process-update issue. We iterate on the key set of the map and then in the end clear it while someone else might add to it. We don’t want to clear a completed task which was never removed from the zk workQueue. What is the reason behind cleaning work queue twice and sleeping for 20ms in this code: To maintain concurrency limits and clean-up from zk after tasks complete. I’ve made it a little better by adding a waited bool in that loop and only call cleanUp for the second time when the waited is set. There are unrelated changes in OCP.prioritizeOverseerNodes Merge issue.. at least it seems like it. KeeperException.NodeExistsException thrown from markTaskAsRunning is ignored This should never happen. It’s just that DistributedMap.put throws that exception so we need to catch it. I’ve added a log.error for that and also commented saying that should never happen.
          Hide
          Anshum Gupta added a comment -

          Addressing #6.

          I realized I had already handled #7 in the last patch.

          Show
          Anshum Gupta added a comment - Addressing #6. I realized I had already handled #7 in the last patch.
          Hide
          Noble Paul added a comment - - edited

          There are unrelated changes in OCP.prioritizeOverseerNodes

          I made those changes. I would commit it to trunk anyway because this logging was unnecessary. you can remove it

          Show
          Noble Paul added a comment - - edited There are unrelated changes in OCP.prioritizeOverseerNodes I made those changes. I would commit it to trunk anyway because this logging was unnecessary. you can remove it
          Hide
          Noble Paul added a comment - - edited

          OCP.

          • markTaskAsRunning()make all those synchronized collection objects final
          • why is there a private variable sharHandlerOCP. We should consistently create a new object all the time and not keep any shardHandler

          DistributedQueue.peekTopN()

          • I don’t think that method needs to catch InterruptedException
          • the second param is a Set don’t call it a List
          • DistributedQueue.peekTopN returning List and null both is not required. Always return a non null List . No need to do null checks unnecessarily
          Show
          Noble Paul added a comment - - edited OCP. markTaskAsRunning()make all those synchronized collection objects final why is there a private variable sharHandlerOCP. We should consistently create a new object all the time and not keep any shardHandler DistributedQueue.peekTopN() I don’t think that method needs to catch InterruptedException the second param is a Set don’t call it a List DistributedQueue.peekTopN returning List and null both is not required. Always return a non null List . No need to do null checks unnecessarily
          Hide
          Anshum Gupta added a comment -

          Noble Paul Thanks for taking a look at it, but seems like you looked at an older patch.
          DistributedQueue.peekTopN issues are already fixed in the latest patch. I'll make changes to never return a null from peekTopN.

          I'll also change markTaskAsRunning objects final.
          shardHandlerOCP is also already removed in the last patch.

          Show
          Anshum Gupta added a comment - Noble Paul Thanks for taking a look at it, but seems like you looked at an older patch. DistributedQueue.peekTopN issues are already fixed in the latest patch. I'll make changes to never return a null from peekTopN. I'll also change markTaskAsRunning objects final. shardHandlerOCP is also already removed in the last patch.
          Hide
          Anshum Gupta added a comment -

          Patch that makes a few vars final (tracking related).
          Also, completedTasks is no longer a synchronizedHashMap but the synchronization is self-managed.

          Show
          Anshum Gupta added a comment - Patch that makes a few vars final (tracking related). Also, completedTasks is no longer a synchronizedHashMap but the synchronization is self-managed.
          Hide
          Anshum Gupta added a comment -

          Made 'stale' in OCP a local variable and documented the use of the variable in the code.

          Show
          Anshum Gupta added a comment - Made 'stale' in OCP a local variable and documented the use of the variable in the code.
          Hide
          Shalin Shekhar Mangar added a comment -

          Thanks Anshum. Comments on the your last patch:

          1. The processedZKTasks, runningTasks and collectionWip are wrapped with Collections.synchronizedSet and yet they are again synchronised before usage. One of the two synchronisation should be removed.
          2. MultiThreadedOverseerCollectionProcessor.updateStats has a todo about removing the synchronisation which is correct. That synchronisation is not required here.
          3. MultiThreadedOverseerCollectionProcessor is just a Runnable and not really multi-threaded by itself. We should rename it something more suitable to e.g. CollectionActionRunner or OverseerCollectionProcessorThread or maybe just Runner.
          4. Any exception thrown during processMessage inside MultiThreadedOverseerCollectionProcessor is never logged. In fact if an exception is thrown by processMessage, the “response” will be null and it will cause NPE at head.setBytes(SolrResponse.serializable(response)); We should log such exceptions at the very minimum.
          5. There is a behaviour change between the earlier OCP and the multi-threaded one. If a processMessage results in an exception, the previous OCP will not remove the queue entry and continue to retry the task but the multi threaded OCP will mark the task as complete and remove it from the queue.
          6. The stats collection in MultiThreadedOverseerCollectionProcessor.run is not correct, the right way to time the operation is to use timerContext.stop() in a finally block right after processMessage
          7. There’s an empty catch (KeeperException e) block in MultiThreadedOverseerCollectionProcessor.run()
          8. Wrong code formatting in OCP.close()
          9. OCP.close() should use a shutDown(), awaitTermination(), shutdownNow() call pattern. The shutdown method by itself will not interrupt tasks in progress.
          10. There are still unrelated changes in OCP.prioritizeOverseerNodes

          I'l still reviewing the changes.

          Show
          Shalin Shekhar Mangar added a comment - Thanks Anshum. Comments on the your last patch: The processedZKTasks, runningTasks and collectionWip are wrapped with Collections.synchronizedSet and yet they are again synchronised before usage. One of the two synchronisation should be removed. MultiThreadedOverseerCollectionProcessor.updateStats has a todo about removing the synchronisation which is correct. That synchronisation is not required here. MultiThreadedOverseerCollectionProcessor is just a Runnable and not really multi-threaded by itself. We should rename it something more suitable to e.g. CollectionActionRunner or OverseerCollectionProcessorThread or maybe just Runner. Any exception thrown during processMessage inside MultiThreadedOverseerCollectionProcessor is never logged. In fact if an exception is thrown by processMessage, the “response” will be null and it will cause NPE at head.setBytes(SolrResponse.serializable(response)); We should log such exceptions at the very minimum. There is a behaviour change between the earlier OCP and the multi-threaded one. If a processMessage results in an exception, the previous OCP will not remove the queue entry and continue to retry the task but the multi threaded OCP will mark the task as complete and remove it from the queue. The stats collection in MultiThreadedOverseerCollectionProcessor.run is not correct, the right way to time the operation is to use timerContext.stop() in a finally block right after processMessage There’s an empty catch (KeeperException e) block in MultiThreadedOverseerCollectionProcessor.run() Wrong code formatting in OCP.close() OCP.close() should use a shutDown(), awaitTermination(), shutdownNow() call pattern. The shutdown method by itself will not interrupt tasks in progress. There are still unrelated changes in OCP.prioritizeOverseerNodes I'l still reviewing the changes.
          Hide
          Shalin Shekhar Mangar added a comment -

          More comments:

          1. DistributedQueue.peekTopN should count stats in the same way as peek() does by using “peekN_wait_forever” and “peekN_wait_” + wait.
          2. DistributedQueue.peekTopN is still not correct. Suppose orderedChildren returns 0 nodes, the childWatcher.await will be called, thread will wait and immediately return 0 results even if children were available. So there was no point in waiting at all if we were going to return 0 results.
          3. The same thing happens later in DQ.peekTopN after the loop. There’s no point in calling await if we’re going to return null anyway.
            childWatcher.await(wait == Long.MAX_VALUE ? DEFAULT_TIMEOUT : wait);
                      waitedEnough = wait != Long.MAX_VALUE;
                      if (waitedEnough) {
                        return null;
                      }
            
          4. The DQ.getTailId (renamed from getLastElementId) still has an empty catch block for KeeperException.
          5. We should probably add unit test for the DQ.peekTopN method.
          Show
          Shalin Shekhar Mangar added a comment - More comments: DistributedQueue.peekTopN should count stats in the same way as peek() does by using “peekN_wait_forever” and “peekN_wait_” + wait. DistributedQueue.peekTopN is still not correct. Suppose orderedChildren returns 0 nodes, the childWatcher.await will be called, thread will wait and immediately return 0 results even if children were available. So there was no point in waiting at all if we were going to return 0 results. The same thing happens later in DQ.peekTopN after the loop. There’s no point in calling await if we’re going to return null anyway. childWatcher.await(wait == Long .MAX_VALUE ? DEFAULT_TIMEOUT : wait); waitedEnough = wait != Long .MAX_VALUE; if (waitedEnough) { return null ; } The DQ.getTailId (renamed from getLastElementId) still has an empty catch block for KeeperException. We should probably add unit test for the DQ.peekTopN method.
          Hide
          Noble Paul added a comment -

          minor changes

          • use wait/notify instead of Thread.sleep() in overseer main thread
          • changed some variable names
          Show
          Noble Paul added a comment - minor changes use wait/notify instead of Thread.sleep() in overseer main thread changed some variable names
          Hide
          Noble Paul added a comment -

          Shalin Shekhar Mangar your comments are taken care of in distributed queue

          Show
          Noble Paul added a comment - Shalin Shekhar Mangar your comments are taken care of in distributed queue
          Hide
          Anshum Gupta added a comment -

          Patch that addresses all of the things you had recommended. Here's a summary of the changes:

          • Synchronized variables are fixed. They are explicitly handled now.
          • Renamed the inner class to Runner.
          • Failed task behavior switched back to what it should be like i.e. the OCP retries.
          • Stats handling fixed.
          • Close is now handled gracefully. Also, there's a call to OCP.close in the finally block for the main OCP thread run().
          • The threadpool is no longer created in the constructor but in the run method.
          Show
          Anshum Gupta added a comment - Patch that addresses all of the things you had recommended. Here's a summary of the changes: Synchronized variables are fixed. They are explicitly handled now. Renamed the inner class to Runner. Failed task behavior switched back to what it should be like i.e. the OCP retries. Stats handling fixed. Close is now handled gracefully. Also, there's a call to OCP.close in the finally block for the main OCP thread run(). The threadpool is no longer created in the constructor but in the run method.
          Hide
          Anshum Gupta added a comment - - edited

          Added a new test that tests that a short running task (OVERSEERSTATUS) fired after a long running SHARDSPLIT returns before the completion of the latter.

          Also, moved the invokeCollectionApi() from OverseerStatus test to the parent AbstractFullDistribZkTestBase test as that method is useful for other tests too.

          Show
          Anshum Gupta added a comment - - edited Added a new test that tests that a short running task (OVERSEERSTATUS) fired after a long running SHARDSPLIT returns before the completion of the latter. Also, moved the invokeCollectionApi() from OverseerStatus test to the parent AbstractFullDistribZkTestBase test as that method is useful for other tests too.
          Hide
          Anshum Gupta added a comment -

          Another patch, integrates the patch for SOLR-6075. Will remove this before committing once that goes into trunk.

          Show
          Anshum Gupta added a comment - Another patch, integrates the patch for SOLR-6075 . Will remove this before committing once that goes into trunk.
          Hide
          Anshum Gupta added a comment -

          I would like to move ahead with committing this patch if I don't receive any feedback soon.

          Show
          Anshum Gupta added a comment - I would like to move ahead with committing this patch if I don't receive any feedback soon.
          Hide
          Shalin Shekhar Mangar added a comment -

          Thanks Anshum.

          1. The new createCollection method in CollectionAdminRequest is not required. In fact we should clean up the existing methods which hard code “implicit” router. I opened SOLR-6073 for it.

            This patch still has the new createCollection method in CollectionAdminRequest. Please remove that.

          2. Another patch, integrates the patch for SOLR-6075. Will remove this before committing once that goes into trunk.

            Okay.

          3. Let's remove this DEBUG instance. I see this in a lot of other tests but I cannot find where this instance is set to true. It's a bug in the other tests too and we should fix them. I opened SOLR-6090
            private static final boolean DEBUG = false;
            
          4. Should we rename processedZkTasks to runningZkTasks? They are 'processed' by the main OCP thread but they are still 'running' so it may give a wrong impression to someone reading the code.
          5. Let's document the purpose of each of the sets/maps we've introduced such as completedTasks, processedZkTasks, runningTasks, collectionWip as a code comment.
          6. I think we should use use the return value of collectionWip.add(collectionName) as a fail-safe and throw an exception if it ever returns false.
          7. The OCP.Runner must call either markTaskComplete or resetTaskWithException upon exit otherwise we'll have items in queue which will never be processed and we'll never know why. It is not enough to call resetTaskWithException upon a KeeperException or InterruptedException only.
          8. Similar to above, we should have debug level logging on items in our various data structures before cleanUpWorkQueue, after cleanUpWorkQueue, before the peekTopN call and the items returned by the peekTopN method. Also we should log the item skipped by 'checkExclusivity' in debug level. Without this logging, it'd be almost impossible to debug problems in production.
          9. If the maxParallelThreads is supposed to be a constant then it should renamed accordingly as MAX_PARALLEL_THREADS
            Let's make it a constant.
          10. We can improve MultiThreadedOCPTest.testTaskExclusivity by sending a shard split for shard1_0 as the third collection action.
          11. There are still formatting problems in Overseer.Stats.success, error, time methods.
          12. ZkStateReader has a new MAX_COLL_PROCESSOR_THREADS instance variable which is never used.
          Show
          Shalin Shekhar Mangar added a comment - Thanks Anshum. The new createCollection method in CollectionAdminRequest is not required. In fact we should clean up the existing methods which hard code “implicit” router. I opened SOLR-6073 for it. This patch still has the new createCollection method in CollectionAdminRequest. Please remove that. Another patch, integrates the patch for SOLR-6075 . Will remove this before committing once that goes into trunk. Okay. Let's remove this DEBUG instance. I see this in a lot of other tests but I cannot find where this instance is set to true. It's a bug in the other tests too and we should fix them. I opened SOLR-6090 private static final boolean DEBUG = false ; Should we rename processedZkTasks to runningZkTasks? They are 'processed' by the main OCP thread but they are still 'running' so it may give a wrong impression to someone reading the code. Let's document the purpose of each of the sets/maps we've introduced such as completedTasks, processedZkTasks, runningTasks, collectionWip as a code comment. I think we should use use the return value of collectionWip.add(collectionName) as a fail-safe and throw an exception if it ever returns false. The OCP.Runner must call either markTaskComplete or resetTaskWithException upon exit otherwise we'll have items in queue which will never be processed and we'll never know why. It is not enough to call resetTaskWithException upon a KeeperException or InterruptedException only. Similar to above, we should have debug level logging on items in our various data structures before cleanUpWorkQueue, after cleanUpWorkQueue, before the peekTopN call and the items returned by the peekTopN method. Also we should log the item skipped by 'checkExclusivity' in debug level. Without this logging, it'd be almost impossible to debug problems in production. If the maxParallelThreads is supposed to be a constant then it should renamed accordingly as MAX_PARALLEL_THREADS Let's make it a constant. We can improve MultiThreadedOCPTest.testTaskExclusivity by sending a shard split for shard1_0 as the third collection action. There are still formatting problems in Overseer.Stats.success, error, time methods. ZkStateReader has a new MAX_COLL_PROCESSOR_THREADS instance variable which is never used.
          Hide
          Anshum Gupta added a comment -

          This patch still has the new createCollection method in CollectionAdminRequest. Please remove that.

          Done.

          private static final boolean DEBUG = false;
          

          Removed

          Should we rename processedZkTasks to runningZkTasks?

          Done.

          Let's document the purpose of each of the sets/maps we've introduced such as completedTasks, processedZkTasks, runningTasks, collectionWip as a code comment.

          Done

          I think we should use use the return value of collectionWip.add(collectionName) as a fail-safe and throw an exception if it ever returns false.

          There's no uncaught exception or an exit point where this might not be reset. It would only not be reset in case the Overseer itself goes down but then there's nothing that stops a new Overseer from picking up a task that has not completely processed by an older Overseer. I don't think we need to check for that. Also, the only thread that adds/checks is the main thread.

          we should have debug level logging on items in our various data structures

          Done.

          We can improve MultiThreadedOCPTest.testTaskExclusivity by sending a shard split for shard1_0 as the third collection action

          Firing async calls doesn't guarantee the order of task execution. Sending SPLIT for shard1, followed by one for shard1_0 might lead to a failed test if split for shard1_0 get's picked up before splitting shard1.

          There are still formatting problems in Overseer.Stats.success, error, time methods

          Weird as I don't see any in my IDE. Hopefully this patch has it right.

          I’m fixing to get maxParallelThreads come from cluster props i.e. configurable.

          Show
          Anshum Gupta added a comment - This patch still has the new createCollection method in CollectionAdminRequest. Please remove that. Done. private static final boolean DEBUG = false ; Removed Should we rename processedZkTasks to runningZkTasks? Done. Let's document the purpose of each of the sets/maps we've introduced such as completedTasks, processedZkTasks, runningTasks, collectionWip as a code comment. Done I think we should use use the return value of collectionWip.add(collectionName) as a fail-safe and throw an exception if it ever returns false. There's no uncaught exception or an exit point where this might not be reset. It would only not be reset in case the Overseer itself goes down but then there's nothing that stops a new Overseer from picking up a task that has not completely processed by an older Overseer. I don't think we need to check for that. Also, the only thread that adds/checks is the main thread. we should have debug level logging on items in our various data structures Done. We can improve MultiThreadedOCPTest.testTaskExclusivity by sending a shard split for shard1_0 as the third collection action Firing async calls doesn't guarantee the order of task execution. Sending SPLIT for shard1, followed by one for shard1_0 might lead to a failed test if split for shard1_0 get's picked up before splitting shard1. There are still formatting problems in Overseer.Stats.success, error, time methods Weird as I don't see any in my IDE. Hopefully this patch has it right. I’m fixing to get maxParallelThreads come from cluster props i.e. configurable.
          Hide
          Anshum Gupta added a comment - - edited

          A patch with fix for a potential NPE in a log.debug message.

          Also added a check for successful completion of split shard tasks.

          Show
          Anshum Gupta added a comment - - edited A patch with fix for a potential NPE in a log.debug message. Also added a check for successful completion of split shard tasks.
          Hide
          Shalin Shekhar Mangar added a comment -

          Firing async calls doesn't guarantee the order of task execution. Sending SPLIT for shard1, followed by one for shard1_0 might lead to a failed test if split for shard1_0 get's picked up before splitting shard1.

          Ah, of course, you're right. How silly of me

          Thanks for adding all the logging.

          The OCP.Runner must call either markTaskComplete or resetTaskWithException upon exit otherwise we'll have items in queue which will never be processed and we'll never know why. It is not enough to call resetTaskWithException upon a KeeperException or InterruptedException only.

          This is still not addressed.

          I’m fixing to get maxParallelThreads come from cluster props i.e. configurable.

          This is still pending.

          One thing that I didn't notice earlier. We shouldn't be cleaning up in OCP.run before we execute the prioritizeOverseerNode() method. All OCP operations must be done after the prioritize call.

          Show
          Shalin Shekhar Mangar added a comment - Firing async calls doesn't guarantee the order of task execution. Sending SPLIT for shard1, followed by one for shard1_0 might lead to a failed test if split for shard1_0 get's picked up before splitting shard1. Ah, of course, you're right. How silly of me Thanks for adding all the logging. The OCP.Runner must call either markTaskComplete or resetTaskWithException upon exit otherwise we'll have items in queue which will never be processed and we'll never know why. It is not enough to call resetTaskWithException upon a KeeperException or InterruptedException only. This is still not addressed. I’m fixing to get maxParallelThreads come from cluster props i.e. configurable. This is still pending. One thing that I didn't notice earlier. We shouldn't be cleaning up in OCP.run before we execute the prioritizeOverseerNode() method. All OCP operations must be done after the prioritize call.
          Hide
          Anshum Gupta added a comment -

          A patch that handles all of that. I'm running the tests and the precommit.
          If everything passes, I'd want to commit this.

          It is not enough to call resetTaskWithException upon a KeeperException or InterruptedException only.

          Though I’m half inclined towards adding a catch all exception clause (or an equivalent), I’ll put it in there to be safe.

          maxParallelThreads come from cluster props i.e. configurable

          I’ll continue working on this but I don’t think this should block us from getting this in. It required a bit of tweaking around in the mocks so I’ll vote for committing it and adding this after the commit.
          I’ve removed the variables for now though.

          We shouldn't be cleaning up in OCP.run before we execute the prioritizeOverseerNode() method. All OCP operations must be done after the prioritize call.

          We are not. We just read the last element.

          Show
          Anshum Gupta added a comment - A patch that handles all of that. I'm running the tests and the precommit. If everything passes, I'd want to commit this. It is not enough to call resetTaskWithException upon a KeeperException or InterruptedException only. Though I’m half inclined towards adding a catch all exception clause (or an equivalent), I’ll put it in there to be safe. maxParallelThreads come from cluster props i.e. configurable I’ll continue working on this but I don’t think this should block us from getting this in. It required a bit of tweaking around in the mocks so I’ll vote for committing it and adding this after the commit. I’ve removed the variables for now though. We shouldn't be cleaning up in OCP.run before we execute the prioritizeOverseerNode() method. All OCP operations must be done after the prioritize call. We are not. We just read the last element.
          Hide
          Shalin Shekhar Mangar added a comment -

          Yeah, making maxParallelThreads configurable isn't required. Just make it a constant for now.

          We just read the last element.

          I misunderstood that. Thanks for clearing it up!

          I think this is good to go! My +1

          Show
          Shalin Shekhar Mangar added a comment - Yeah, making maxParallelThreads configurable isn't required. Just make it a constant for now. We just read the last element. I misunderstood that. Thanks for clearing it up! I think this is good to go! My +1
          Hide
          ASF subversion and git services added a comment -

          Commit 1596089 from Anshum Gupta in branch 'dev/trunk'
          [ https://svn.apache.org/r1596089 ]

          SOLR-5681: Make the OverseerCollectionProcessor multi-threaded

          Show
          ASF subversion and git services added a comment - Commit 1596089 from Anshum Gupta in branch 'dev/trunk' [ https://svn.apache.org/r1596089 ] SOLR-5681 : Make the OverseerCollectionProcessor multi-threaded
          Hide
          Noble Paul added a comment -

          The REQUESTSTATUS would not tell me what was the output of the run?
          So, I miss that when async is used?

          Show
          Noble Paul added a comment - The REQUESTSTATUS would not tell me what was the output of the run? So, I miss that when async is used?
          Hide
          Anshum Gupta added a comment -

          I had created SOLR-5886 for that.
          Will take that one up soon.

          Show
          Anshum Gupta added a comment - I had created SOLR-5886 for that. Will take that one up soon.
          Hide
          ASF subversion and git services added a comment -

          Commit 1596379 from Anshum Gupta in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1596379 ]

          SOLR-5681: Make the OverseerCollectionProcessor multi-threaded, merge from trunk (r1596089)

          Show
          ASF subversion and git services added a comment - Commit 1596379 from Anshum Gupta in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1596379 ] SOLR-5681 : Make the OverseerCollectionProcessor multi-threaded, merge from trunk (r1596089)
          Hide
          ASF subversion and git services added a comment -

          Commit 1597137 from Anshum Gupta in branch 'dev/trunk'
          [ https://svn.apache.org/r1597137 ]

          SOLR-5681: Add synchronization while printing tracking maps to a fix jenkins failure

          Show
          ASF subversion and git services added a comment - Commit 1597137 from Anshum Gupta in branch 'dev/trunk' [ https://svn.apache.org/r1597137 ] SOLR-5681 : Add synchronization while printing tracking maps to a fix jenkins failure
          Hide
          ASF subversion and git services added a comment -

          Commit 1597140 from Anshum Gupta in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1597140 ]

          SOLR-5681: Add synchronization while printing tracking maps to a fix jenkins failure (Merge from trunk)

          Show
          ASF subversion and git services added a comment - Commit 1597140 from Anshum Gupta in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1597140 ] SOLR-5681 : Add synchronization while printing tracking maps to a fix jenkins failure (Merge from trunk)
          Hide
          ASF subversion and git services added a comment -

          Commit 1597146 from Anshum Gupta in branch 'dev/trunk'
          [ https://svn.apache.org/r1597146 ]

          SOLR-5681: Fixing CHANGES.txt entry

          Show
          ASF subversion and git services added a comment - Commit 1597146 from Anshum Gupta in branch 'dev/trunk' [ https://svn.apache.org/r1597146 ] SOLR-5681 : Fixing CHANGES.txt entry
          Hide
          ASF subversion and git services added a comment -

          Commit 1597150 from Anshum Gupta in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1597150 ]

          SOLR-5681: Fixing CHANGES.txt entry

          Show
          ASF subversion and git services added a comment - Commit 1597150 from Anshum Gupta in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1597150 ] SOLR-5681 : Fixing CHANGES.txt entry
          Hide
          ASF subversion and git services added a comment -

          Commit 1597160 from Anshum Gupta in branch 'dev/trunk'
          [ https://svn.apache.org/r1597160 ]

          SOLR-5681: Fixing CHANGES.txt entry, moving it into the Optimizations section

          Show
          ASF subversion and git services added a comment - Commit 1597160 from Anshum Gupta in branch 'dev/trunk' [ https://svn.apache.org/r1597160 ] SOLR-5681 : Fixing CHANGES.txt entry, moving it into the Optimizations section
          Hide
          ASF subversion and git services added a comment -

          Commit 1597161 from Anshum Gupta in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1597161 ]

          SOLR-5681: Fixing CHANGES.txt entry, moving it into the Optimizations section

          Show
          ASF subversion and git services added a comment - Commit 1597161 from Anshum Gupta in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1597161 ] SOLR-5681 : Fixing CHANGES.txt entry, moving it into the Optimizations section
          Hide
          Mark Miller added a comment -

          Hmm...did this mess with error handling for the collections API somehow? Anshum Gupta.

          We throw errors like:
          throw new SolrException(ErrorCode.BAD_REQUEST, "collection already exists: " + collectionName);
          and
          throw new SolrException(ErrorCode.BAD_REQUEST, "No config set found to associate with the collection.");
          and other such things in OverseerCollectionMessageHandler.

          But now all this code runs in an executor, and the errors are simply logged in another thread. The client always get's, the response timed out.

          I'll file a new bug issue, but just looking for where it was introduced. We need some collections api error testing I think.

          Show
          Mark Miller added a comment - Hmm...did this mess with error handling for the collections API somehow? Anshum Gupta . We throw errors like: throw new SolrException(ErrorCode.BAD_REQUEST, "collection already exists: " + collectionName); and throw new SolrException(ErrorCode.BAD_REQUEST, "No config set found to associate with the collection."); and other such things in OverseerCollectionMessageHandler. But now all this code runs in an executor, and the errors are simply logged in another thread. The client always get's, the response timed out. I'll file a new bug issue, but just looking for where it was introduced. We need some collections api error testing I think.
          Hide
          Mark Miller added a comment -

          Scratch that, off in the weeds.

          Show
          Mark Miller added a comment - Scratch that, off in the weeds.

            People

            • Assignee:
              Anshum Gupta
              Reporter:
              Anshum Gupta
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development