Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-9264

Optimize ZkController.publishAndWaitForDownStates

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.2, 7.0
    • Component/s: SolrCloud
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None

      Description

      ZkController.publishAndWaitForDownStates keeps looping over all collections in the cluster state to ensure that every replica hosted on the current node has been marked as down. This is wasteful when you have a large number of collections because each access to a non-watched collection gets data from ZK. Instead, we can watch the interesting collections (i.e. which have replicas hosted locally) and wait till we see the required state.

      1. SOLR-9264.patch
        5 kB
        Shalin Shekhar Mangar
      2. SOLR-9264.patch
        5 kB
        Shalin Shekhar Mangar
      3. SOLR-9264.patch
        4 kB
        Shalin Shekhar Mangar

        Activity

        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Patch which uses ZkStateReader's registerCollectionStateWatcher to wait until all replicas are in 'down' state.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Patch which uses ZkStateReader's registerCollectionStateWatcher to wait until all replicas are in 'down' state.
        Hide
        hgadre Hrishikesh Gadre added a comment -

        [~ shalinmangar] I think the patch looks good. Only couple of minor comments,

        • Can we rename the "interestingCollections" and "interestingCollection" variable to something like "collectionsWithLocalReplica" and "collectionWithLocalReplica" ? It is a little difficult to understand what "interesting" means in this context without reading the JIRA description...
        • Is it possible for the callback to be delivered more than once? If yes then we should probably add some defensive check before invoking the countDown method on the latch.
        Show
        hgadre Hrishikesh Gadre added a comment - [~ shalinmangar] I think the patch looks good. Only couple of minor comments, Can we rename the "interestingCollections" and "interestingCollection" variable to something like "collectionsWithLocalReplica" and "collectionWithLocalReplica" ? It is a little difficult to understand what "interesting" means in this context without reading the JIRA description... Is it possible for the callback to be delivered more than once? If yes then we should probably add some defensive check before invoking the countDown method on the latch.
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Thanks Hrishikesh. This patch incorporates both of your review comments. I'll commit this shortly.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Thanks Hrishikesh. This patch incorporates both of your review comments. I'll commit this shortly.
        Hide
        hgadre Hrishikesh Gadre added a comment -

        [~ shalinmangar] Thanks for the updated patch.

        It seems to me that the logic based on AtomicBoolean is probably not sufficient if the callback is invoked multiple times sequentially for the same collection since this variable is in the local scope. Is it a possibility? I think instead of AtomicBoolean we should use a concurrent hashmap (outside the scope of the lambda expression). This map should be pre-populated with the collection names before registering the callback. We can even reuse the collectionsWithLocalReplica variable for this purpose (i.e. instead of Set, we will use ConcurrentHashMap type).

        Inside the callback we can use the remove method in a similar fashion to compareAndSet.

        i.e. instead of counted.compareAndSet(false, true)
        do collectionsWithLocalReplica.remove(collectionName) != null

        What do you think?

        Show
        hgadre Hrishikesh Gadre added a comment - [~ shalinmangar] Thanks for the updated patch. It seems to me that the logic based on AtomicBoolean is probably not sufficient if the callback is invoked multiple times sequentially for the same collection since this variable is in the local scope. Is it a possibility? I think instead of AtomicBoolean we should use a concurrent hashmap (outside the scope of the lambda expression). This map should be pre-populated with the collection names before registering the callback. We can even reuse the collectionsWithLocalReplica variable for this purpose (i.e. instead of Set, we will use ConcurrentHashMap type). Inside the callback we can use the remove method in a similar fashion to compareAndSet. i.e. instead of counted.compareAndSet(false, true) do collectionsWithLocalReplica.remove(collectionName) != null What do you think?
        Hide
        hgadre Hrishikesh Gadre added a comment -

        since this variable is in the local scope.

        OK my bad. I see that the AtomicBoolean is indeed outside the scope of lambda expression. Now I don't quite understand how does this fix the concurrency issue. Don't we need separate AtomicBoolean per collection ?

        Show
        hgadre Hrishikesh Gadre added a comment - since this variable is in the local scope. OK my bad. I see that the AtomicBoolean is indeed outside the scope of lambda expression. Now I don't quite understand how does this fix the concurrency issue. Don't we need separate AtomicBoolean per collection ?
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Now I don't quite understand how does this fix the concurrency issue. Don't we need separate AtomicBoolean per collection ?

        My last patch did have separate AtomicBoolean per collection because they were created inside the loop. However, I can see how this seems more complicated so I have changed the collectionsWithLocalReplica to be a concurrent hash set and we remove the collection from it to ensure that the latch isn't counted down more than once for the same collection.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Now I don't quite understand how does this fix the concurrency issue. Don't we need separate AtomicBoolean per collection ? My last patch did have separate AtomicBoolean per collection because they were created inside the loop. However, I can see how this seems more complicated so I have changed the collectionsWithLocalReplica to be a concurrent hash set and we remove the collection from it to ensure that the latch isn't counted down more than once for the same collection.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 015e0fc1cf1d581c9657cd8f5588062c02588793 in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=015e0fc ]

        SOLR-9264: Optimize ZkController.publishAndWaitForDownStates to not read all collection states and watch relevant collections instead

        Show
        jira-bot ASF subversion and git services added a comment - Commit 015e0fc1cf1d581c9657cd8f5588062c02588793 in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=015e0fc ] SOLR-9264 : Optimize ZkController.publishAndWaitForDownStates to not read all collection states and watch relevant collections instead
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 8cb37842ec531a84469607971024336b68c6ed50 in lucene-solr's branch refs/heads/branch_6x from Shalin Shekhar Mangar
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8cb3784 ]

        SOLR-9264: Optimize ZkController.publishAndWaitForDownStates to not read all collection states and watch relevant collections instead
        (cherry picked from commit 015e0fc)

        Show
        jira-bot ASF subversion and git services added a comment - Commit 8cb37842ec531a84469607971024336b68c6ed50 in lucene-solr's branch refs/heads/branch_6x from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8cb3784 ] SOLR-9264 : Optimize ZkController.publishAndWaitForDownStates to not read all collection states and watch relevant collections instead (cherry picked from commit 015e0fc)
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Thanks for the review Hrishikesh!

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Thanks for the review Hrishikesh!
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 74c86063cf94dcc4dc022776bba31ae278686b42 in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=74c8606 ]

        SOLR-9264: Remove unused imports

        Show
        jira-bot ASF subversion and git services added a comment - Commit 74c86063cf94dcc4dc022776bba31ae278686b42 in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=74c8606 ] SOLR-9264 : Remove unused imports
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1ce09b482e9370649e9a7421b4961a67e744e46f in lucene-solr's branch refs/heads/branch_6x from Shalin Shekhar Mangar
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1ce09b4 ]

        SOLR-9264: Remove unused imports
        (cherry picked from commit 74c8606)

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1ce09b482e9370649e9a7421b4961a67e744e46f in lucene-solr's branch refs/heads/branch_6x from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1ce09b4 ] SOLR-9264 : Remove unused imports (cherry picked from commit 74c8606)
        Hide
        mikemccand Michael McCandless added a comment -

        Bulk close resolved issues after 6.2.0 release.

        Show
        mikemccand Michael McCandless added a comment - Bulk close resolved issues after 6.2.0 release.

          People

          • Assignee:
            shalinmangar Shalin Shekhar Mangar
            Reporter:
            shalinmangar Shalin Shekhar Mangar
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development