Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-8744

Overseer operations need more fine grained mutual exclusion

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 5.4.1
    • Fix Version/s: 6.1
    • Component/s: SolrCloud

      Description

      SplitShard creates a mutex over the whole collection, but, in practice, this is a big scaling problem. Multiple split shard operations could happen at the time time, as long as different shards are being split. In practice, those shards often reside on different machines, so there's no I/O bottleneck in those cases, just the mutex in Overseer forcing the operations to be done serially.

      Given that a single split can take many minutes on a large collection, this is a bottleneck at scale.

      Here is the proposed new design

      There are various Collection operations performed at Overseer. They may need exclusive access at various levels. Each operation must define the Access level at which the access is required. Access level is an enum.

      CLUSTER(0)
      COLLECTION(1)
      SHARD(2)
      REPLICA(3)

      The Overseer node maintains a tree of these locks. The lock tree would look as follows. The tree can be created lazily as and when tasks come up.

      Legend: 
      C1, C2 -> Collections
      S1, S2 -> Shards 
      R1,R2,R3,R4 -> Replicas
      
      
                       Cluster
                      /       \
                     /         \         
                    C1          C2
                   / \         /   \     
                  /   \       /     \      
                 S1   S2      S1     S2
              R1, R2  R3.R4  R1,R2   R3,R4
      

      When the overseer receives a message, it tries to acquire the appropriate lock from the tree. For example, if an operation needs a lock at a Collection level and it needs to operate on Collection C1, the node C1 and all child nodes of C1 must be free.

      Lock acquiring logic

      Each operation would start from the root of the tree (Level 0 -> Cluster) and start moving down depending upon the operation. After it reaches the right node, it checks if all the children are free from a lock. If it fails to acquire a lock, it remains in the work queue. A scheduler thread waits for notification from the current set of tasks . Every task would do a notify() on the monitor of the scheduler thread. The thread would start from the head of the queue and check all tasks to see if that task is able to acquire the right lock. If yes, it is executed, if not, the task is left in the work queue.
      When a new task arrives in the work queue, the schedulerthread wakes and just try to schedule that task.

        Attachments

        1. SOLR-8744.patch
          42 kB
          Noble Paul
        2. SOLR-8744.patch
          45 kB
          Noble Paul
        3. SOLR-8744.patch
          39 kB
          Noble Paul
        4. SOLR-8744.patch
          39 kB
          Noble Paul
        5. SOLR-8744.patch
          11 kB
          Noble Paul
        6. SOLR-8744.patch
          16 kB
          Noble Paul
        7. SOLR-8744.patch
          15 kB
          Scott Blum
        8. SOLR-8744.patch
          15 kB
          Noble Paul
        9. SOLR-8744.patch
          17 kB
          Scott Blum
        10. SmileyLockTree.java
          4 kB
          David Smiley
        11. SmileyLockTree.java
          5 kB
          David Smiley

          Issue Links

            Activity

              People

              • Assignee:
                noble.paul Noble Paul
                Reporter:
                dragonsinth Scott Blum
              • Votes:
                0 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: