Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-9936

Allow configuration for recoveryExecutor thread pool size

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 6.3
    • Fix Version/s: 6.6, 7.0
    • Component/s: replication (java)
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None
    • Flags:
      Patch

      Description

      There are two executor services in UpdateShardHandler, the updateExecutor whose size is unbounded for reasons explained in the code comments. There is also the recoveryExecutor which was added later, and is the one that executes the RecoveryStrategy code to actually fetch index files and store to disk, eventually calling an fsync thread to ensure the data is written.

      We found that with a fast network such as 10GbE it's very easy to overload the local disk storage when doing a restart of Solr instances after some downtime, if they have many cores to load. Typically we have each physical server containing 6 SSDs and 6 Solr instances, so each Solr has its home dir on a dedicated SSD. With 100+ cores (shard replicas) on each instance, startup can really hammer the SSD as it's writing in parallel from as many cores as Solr is recovering. This made recovery time bad enough that replicas were down for a long time, and even shards marked as down if none of its replicas have recovered (usually when many machines have been restarted). The very slow IO times (10s of seconds or worse) also made the JVM pause, so that disconnects from ZK, which didn't help recovery either.

      This patch allowed us to throttle how much parallelism there would be writing to a disk - in practice we're using a pool size of 4 threads, to prevent the SSD getting overloaded, and that worked well enough to make recovery of all cores in reasonable time.

      Due to the comment on the other thread pool size, I'd like some comments on whether it's OK to do this for the recoveryExecutor though?

      It's configured in solr.xml with e.g.

        <updateshardhandler>
          <int name="maxRecoveryThreads">${solr.recovery.threads:4}</int>
        </updateshardhandler>
      
      1. SOLR-9936.patch
        7 kB
        Tim Owen
      2. SOLR-9936.patch
        7 kB
        Tim Owen

        Activity

        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        We would want to do some vetting and testing, but perhaps currently okay to limit the recovery executor.

        Show
        markrmiller@gmail.com Mark Miller added a comment - We would want to do some vetting and testing, but perhaps currently okay to limit the recovery executor.
        Hide
        TimOwen Tim Owen added a comment -

        Thanks - given the comment with the updateExecutor code and Yonik's reply in ticket SOLR-8205 I was wary of changing this, but I couldn't see a scenario where it could deadlock. Would certainly appreciate some further input from people who've worked on the recovery code e.g. Shalin Shekhar Mangar in SOLR-7280.

        Show
        TimOwen Tim Owen added a comment - Thanks - given the comment with the updateExecutor code and Yonik's reply in ticket SOLR-8205 I was wary of changing this, but I couldn't see a scenario where it could deadlock. Would certainly appreciate some further input from people who've worked on the recovery code e.g. Shalin Shekhar Mangar in SOLR-7280 .
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        I'm sure we pulled recovery into its own executor to avoid deadlock issues. All the latest changes make it fine in theory, it's really just a matter of hammering the system to be sure.

        Show
        markrmiller@gmail.com Mark Miller added a comment - I'm sure we pulled recovery into its own executor to avoid deadlock issues. All the latest changes make it fine in theory, it's really just a matter of hammering the system to be sure.
        Hide
        TimOwen Tim Owen added a comment -

        Just uploaded a replacement patch that builds against the master branch (the previous one was a patch against 6.3 and wouldn't merge to master because of all the changes to metrics)

        Show
        TimOwen Tim Owen added a comment - Just uploaded a replacement patch that builds against the master branch (the previous one was a patch against 6.3 and wouldn't merge to master because of all the changes to metrics)
        Hide
        yjfeng001 Jerome Yang added a comment -

        Any update on this issue?

        Show
        yjfeng001 Jerome Yang added a comment - Any update on this issue?
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        As long as we continue to use this executor in a similiar manner, there should not be any issue with this.

        Show
        markrmiller@gmail.com Mark Miller added a comment - As long as we continue to use this executor in a similiar manner, there should not be any issue with this.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit bc6ff493b09a1ec5454c5ce790f6b7ecb714743e in lucene-solr's branch refs/heads/master from markrmiller
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bc6ff49 ]

        SOLR-9936: Allow configuration for recoveryExecutor thread pool size.

        Show
        jira-bot ASF subversion and git services added a comment - Commit bc6ff493b09a1ec5454c5ce790f6b7ecb714743e in lucene-solr's branch refs/heads/master from markrmiller [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bc6ff49 ] SOLR-9936 : Allow configuration for recoveryExecutor thread pool size.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 5116968babb9ea1cd283574beff3e3e83c397c97 in lucene-solr's branch refs/heads/branch_6x from markrmiller
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5116968 ]

        SOLR-9936: Allow configuration for recoveryExecutor thread pool size.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 5116968babb9ea1cd283574beff3e3e83c397c97 in lucene-solr's branch refs/heads/branch_6x from markrmiller [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5116968 ] SOLR-9936 : Allow configuration for recoveryExecutor thread pool size.
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        Thanks Tim!

        Show
        markrmiller@gmail.com Mark Miller added a comment - Thanks Tim!

          People

          • Assignee:
            markrmiller@gmail.com Mark Miller
            Reporter:
            TimOwen Tim Owen
          • Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development