Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-10983

Fix DOWNNODE -> queue-work explosion

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.6.1, 7.0, master (8.0)
    • Component/s: SolrCloud
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None

      Description

      Every DOWNNODE command enqueues N copies of itself into queue-work, where N is number of collections affected by the DOWNNODE.

      This rarely matters in practice, because queue-work gets immediately dumped-- however, if anything throws an exception (such as ZK bad version), we don't clear queue-work. Then the next time through the loop we run the expensive DOWNNODE command potentially hundreds of times.

        Activity

        Show
        dragonsinth Scott Blum added a comment - Shalin Shekhar Mangar Joshua Humphries
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Nice catch!

        Your patch solves another problem – today if an exception happens, we run through items in the work-queue and the last item from state-update-queue (the one during which the exception happened) so we run the same item twice.

        Considering that DOWNNODE is the only command that enqueues multiple ZkWriteCommands, I think we should add a method to ZkStateWriter which calls enqueue only once for the entire batch. That and your patch solve all problems nicely i.e.

        1. DOWNNODE creating multiple work queue items
        2. Exceptions not clearing work queue
        3. Overseer executing same item twice from work queue and state update queue on an exception
        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Nice catch! Your patch solves another problem – today if an exception happens, we run through items in the work-queue and the last item from state-update-queue (the one during which the exception happened) so we run the same item twice. Considering that DOWNNODE is the only command that enqueues multiple ZkWriteCommands, I think we should add a method to ZkStateWriter which calls enqueue only once for the entire batch. That and your patch solve all problems nicely i.e. DOWNNODE creating multiple work queue items Exceptions not clearing work queue Overseer executing same item twice from work queue and state update queue on an exception
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        On second thought, creating a batch enqueue command is not so straightforward and the callback is called once per enqueue as per the contract of ZkWriteCallback so it is technically not a bug. So I am fine with your solution as it exists. +1 to commit. Please make sure it is backported to the branch_7x and branch_7_0 so that it makes it into the 7.0 release.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - On second thought, creating a batch enqueue command is not so straightforward and the callback is called once per enqueue as per the contract of ZkWriteCallback so it is technically not a bug. So I am fine with your solution as it exists. +1 to commit. Please make sure it is backported to the branch_7x and branch_7_0 so that it makes it into the 7.0 release.
        Hide
        dragonsinth Scott Blum added a comment -

        Thanks! Will do

        Show
        dragonsinth Scott Blum added a comment - Thanks! Will do
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 380eed838d6646ec02592a9d2e6649e6aa1b5d9b in lucene-solr's branch refs/heads/master from Scott Blum
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=380eed8 ]

        SOLR-10983: Fix DOWNNODE -> queue-work explosion

        Show
        jira-bot ASF subversion and git services added a comment - Commit 380eed838d6646ec02592a9d2e6649e6aa1b5d9b in lucene-solr's branch refs/heads/master from Scott Blum [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=380eed8 ] SOLR-10983 : Fix DOWNNODE -> queue-work explosion
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 17245c2e5a93bca59572c09af78a6ad6045e75eb in lucene-solr's branch refs/heads/branch_7x from Scott Blum
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=17245c2 ]

        SOLR-10983: Fix DOWNNODE -> queue-work explosion

        Show
        jira-bot ASF subversion and git services added a comment - Commit 17245c2e5a93bca59572c09af78a6ad6045e75eb in lucene-solr's branch refs/heads/branch_7x from Scott Blum [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=17245c2 ] SOLR-10983 : Fix DOWNNODE -> queue-work explosion
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 51638c09bf4f5457650ab40c60b5f98512f9ca1d in lucene-solr's branch refs/heads/branch_7_0 from Scott Blum
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=51638c0 ]

        SOLR-10983: Fix DOWNNODE -> queue-work explosion

        Show
        jira-bot ASF subversion and git services added a comment - Commit 51638c09bf4f5457650ab40c60b5f98512f9ca1d in lucene-solr's branch refs/heads/branch_7_0 from Scott Blum [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=51638c0 ] SOLR-10983 : Fix DOWNNODE -> queue-work explosion
        Hide
        dragonsinth Scott Blum added a comment -

        BTW: this issue most likely affects all 6.x releases (and even some late 5.x), so it should be considered if we do any 6.x point releases later.

        Show
        dragonsinth Scott Blum added a comment - BTW: this issue most likely affects all 6.x releases (and even some late 5.x), so it should be considered if we do any 6.x point releases later.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit d704796a785aa0d8e455661e519bb2f0c67b7311 in lucene-solr's branch refs/heads/branch_6x from Erick
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d704796 ]

        SOLR-10983: Fix DOWNNODE -> queue-work explosion, backporting to 6x as per the comments in the JIRA

        Show
        jira-bot ASF subversion and git services added a comment - Commit d704796a785aa0d8e455661e519bb2f0c67b7311 in lucene-solr's branch refs/heads/branch_6x from Erick [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d704796 ] SOLR-10983 : Fix DOWNNODE -> queue-work explosion, backporting to 6x as per the comments in the JIRA
        Hide
        erickerickson Erick Erickson added a comment -

        I backported this to 6x (future 6.7) as I really expect there to be a final release of the 6x code line and didn't want this to be omitted. No harm if there's not a 6.7.

        Show
        erickerickson Erick Erickson added a comment - I backported this to 6x (future 6.7) as I really expect there to be a final release of the 6x code line and didn't want this to be omitted. No harm if there's not a 6.7.
        Hide
        varunthacker Varun Thacker added a comment -

        Re-opening to backport to 6.6.1

        Show
        varunthacker Varun Thacker added a comment - Re-opening to backport to 6.6.1
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit f031a85f50902cfc0b54422b35f60effb7353b05 in lucene-solr's branch refs/heads/branch_6_6 from Scott Blum
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f031a85 ]

        SOLR-10983: Fix DOWNNODE -> queue-work explosion

        Show
        jira-bot ASF subversion and git services added a comment - Commit f031a85f50902cfc0b54422b35f60effb7353b05 in lucene-solr's branch refs/heads/branch_6_6 from Scott Blum [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f031a85 ] SOLR-10983 : Fix DOWNNODE -> queue-work explosion

          People

          • Assignee:
            dragonsinth Scott Blum
            Reporter:
            dragonsinth Scott Blum
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development