Lucene - Core
  1. Lucene - Core
  2. LUCENE-6119

Add auto-io-throttle to ConcurrentMergeScheduler

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This method returns number of "incoming" bytes IW has written since it
      was opened, excluding merging.

      It tracks flushed segments, new commits (segments_N), incoming
      files/segments by addIndexes, newly written live docs / doc values
      updates files.

      It's an easy statistic for IW to track and should be useful to help
      applications more intelligently set defaults for IO throttling
      (RateLimiter).

      For example, an application that does hardly any indexing but finally
      triggered a large merge can afford to heavily throttle that large
      merge so it won't interfere with ongoing searches.

      But an application that's causing IW to write new bytes at 50 MB/sec
      must set a correspondingly higher IO throttling otherwise merges will
      clearly fall behind.

      1. LUCENE-6119.patch
        143 kB
        Michael McCandless
      2. LUCENE-6119.patch
        143 kB
        Michael McCandless
      3. LUCENE-6119.patch
        124 kB
        Michael McCandless
      4. LUCENE-6119.patch
        66 kB
        Michael McCandless
      5. LUCENE-6119.patch
        35 kB
        Michael McCandless
      6. LUCENE-6119.patch
        12 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Simple patch + test.

        Show
        Michael McCandless added a comment - Simple patch + test.
        Hide
        Michael McCandless added a comment -

        Thinking about this more ... it may be better to do this entirely inside a FilterDirectory.

        E.g. when IndexOutput is closed, and the IOContext is not MERGE, increment the bytes written ... and then that same directory instance could dynamically update the target merge throttling ... maybe.

        Show
        Michael McCandless added a comment - Thinking about this more ... it may be better to do this entirely inside a FilterDirectory. E.g. when IndexOutput is closed, and the IOContext is not MERGE, increment the bytes written ... and then that same directory instance could dynamically update the target merge throttling ... maybe.
        Hide
        Michael McCandless added a comment -

        OK new patch with a completely different approach, moving the
        tracking under Directory.

        I added a new AdaptiveRateLimitedDirectoryWrapper: it watches the
        average bytes/sec written by non-merges, and then (based on a
        multiplier) sets the merge throttling accordingly. It uses a rolling
        timestamps window of the last 1 GB of writes with 1 MB resolution, and
        lets you set min/max on the merge throttle.

        Also, I removed RateLimitedDirectoryWrapper: I think it's dangerous
        how it encourages you to throttle anything except merges, and
        encourages you to just set a fixed rate (one size does NOT fit
        all...). If you want "similar" behavior you can use
        AdaptiveRateLimitedDirectoryWrapper but set min and max to the same
        value.

        Show
        Michael McCandless added a comment - OK new patch with a completely different approach, moving the tracking under Directory. I added a new AdaptiveRateLimitedDirectoryWrapper: it watches the average bytes/sec written by non-merges, and then (based on a multiplier) sets the merge throttling accordingly. It uses a rolling timestamps window of the last 1 GB of writes with 1 MB resolution, and lets you set min/max on the merge throttle. Also, I removed RateLimitedDirectoryWrapper: I think it's dangerous how it encourages you to throttle anything except merges, and encourages you to just set a fixed rate (one size does NOT fit all...). If you want "similar" behavior you can use AdaptiveRateLimitedDirectoryWrapper but set min and max to the same value.
        Hide
        Michael McCandless added a comment -

        I ran some tests with this approach and I think it's no good.

        This creates a tricky feedback system, where both CMS (via hard stalling of incoming threads) and this directory attempt to make change to let merges catch up. When CMS's hard stalls kick in, this lowers the indexing byte/sec rate, which causes this directory to (over simplistically) lower the merge IO throttling, which causes the merges to take longer.

        I think it's better if all throttling efforts happen in one place, e.g. CMS. I'l think about it ...

        Show
        Michael McCandless added a comment - I ran some tests with this approach and I think it's no good. This creates a tricky feedback system, where both CMS (via hard stalling of incoming threads) and this directory attempt to make change to let merges catch up. When CMS's hard stalls kick in, this lowers the indexing byte/sec rate, which causes this directory to (over simplistically) lower the merge IO throttling, which causes the merges to take longer. I think it's better if all throttling efforts happen in one place, e.g. CMS. I'l think about it ...
        Hide
        Michael McCandless added a comment -

        OK here's yet another approach that is promising, but this is still a
        rough work in progress...

        First, it removes all "cross cutting" merge abort checking and instead
        does it "down low" by wrapping all IndexOutputs created for merging:
        nice cleanup!

        Second, it puts all throttling (hard stall of incoming index threads,
        pause/unpause merges, IO rate limiting) responsibility in CMS, which
        makes sense because it's the merge scheduler that "knows" whether
        merges are falling behind, that sees how many merges need running,
        etc. (Plus CMS is already doing throttling...).

        Each merge is given its own MergeRateLimiter by IndexWriter, which
        handles 1) checking for abort, 2) pausing/unpausing merges, 3)
        optionally "io nice" (write MB/sec rate limit) each merge thread.

        CMS has a simplistic estimator of requires bytes/sec (records last 1K
        merges and computes required bytes/sec) ... I think this is too
        simplistic (e.g. doesn't handle a slow index that suddenly picks
        up)... still thinking about how it can better tune itself.

        I also removed CMS tweaking thread priorities: this seems ineffective
        in practice, and I think the "io nice" approach is better.

        Show
        Michael McCandless added a comment - OK here's yet another approach that is promising, but this is still a rough work in progress... First, it removes all "cross cutting" merge abort checking and instead does it "down low" by wrapping all IndexOutputs created for merging: nice cleanup! Second, it puts all throttling (hard stall of incoming index threads, pause/unpause merges, IO rate limiting) responsibility in CMS, which makes sense because it's the merge scheduler that "knows" whether merges are falling behind, that sees how many merges need running, etc. (Plus CMS is already doing throttling...). Each merge is given its own MergeRateLimiter by IndexWriter, which handles 1) checking for abort, 2) pausing/unpausing merges, 3) optionally "io nice" (write MB/sec rate limit) each merge thread. CMS has a simplistic estimator of requires bytes/sec (records last 1K merges and computes required bytes/sec) ... I think this is too simplistic (e.g. doesn't handle a slow index that suddenly picks up)... still thinking about how it can better tune itself. I also removed CMS tweaking thread priorities: this seems ineffective in practice, and I think the "io nice" approach is better.
        Hide
        Robert Muir added a comment -

        great to see progress nuking checkabort!

        Show
        Robert Muir added a comment - great to see progress nuking checkabort!
        Hide
        Michael McCandless added a comment -

        New patch... I think it's close.

        This adds "enable/disableAutoIOThrottle" methods to CMS, to have CMS
        pick a reasonable IO throttle over time so merges don't fall behind
        but also don't suck up all available IO. It's a live setting, and
        default is on. CMS.getIORateLimitMBPerSec returns the current
        auto-IO-rate.

        All "merge abort checks" are gone and instead handled by the per-merge
        rate limiter that IW sets up for each merge. This gives merge
        schedulers "io nice"-like control over each merge thread.

        Setting the right IO throttle is a fun control problem (see
        http://en.wikipedia.org/wiki/Control_theory), much like the fan in
        your laptop that changes its speed depending on internal temperature,
        or a factory that must add more workers depending on incoming jobs.

        I first tried "open loop" control, trying to set the rate based on
        indexing rate or incoming merges rate, but that doesn't work very
        well since there are many variables (e.g. CFS on or off) that affect
        required MB/sec writing.

        So then I switched to a simplistic feedback control: when a merge
        arrives, if another merge that's "close" to that same size is still
        running, we are falling behind and we aggressively (+20%) increase the
        IO throttle. Else, if there is a prior backlog still, leave the rate
        unchanged. Else, we decrease it. In my various tests of "tiny
        flushed segs" vs "big flushed segs", NRT reopens vs no, CFS or not, 1
        2 or 3 merge threads, this approach seems to work well.

        I haven't yet tested on spinning disks though ... will have to wait
        until I'm back home ... somehow my beast box died while I'm on
        vacation! I think fsck must be waiting for me on the console

        Forced merges have their own separate throttle (defaults to
        unlimited).

        I think it's important CMS not have min/max MB/sec throttle control:
        I think this just invites disaster when apps set them to inappropriate
        values (but I added a protected CMS method "escape hatch" so a
        subclass can override the control logic).

        I also removed RateLimitedDirectoryWrapper: it's too simplistic and
        too dangerous. Finally I cleaned a few things up and improved verbose
        infoStream logging so we can see more stats for each merge.

        Show
        Michael McCandless added a comment - New patch... I think it's close. This adds "enable/disableAutoIOThrottle" methods to CMS, to have CMS pick a reasonable IO throttle over time so merges don't fall behind but also don't suck up all available IO. It's a live setting, and default is on. CMS.getIORateLimitMBPerSec returns the current auto-IO-rate. All "merge abort checks" are gone and instead handled by the per-merge rate limiter that IW sets up for each merge. This gives merge schedulers "io nice"-like control over each merge thread. Setting the right IO throttle is a fun control problem (see http://en.wikipedia.org/wiki/Control_theory ), much like the fan in your laptop that changes its speed depending on internal temperature, or a factory that must add more workers depending on incoming jobs. I first tried "open loop" control, trying to set the rate based on indexing rate or incoming merges rate, but that doesn't work very well since there are many variables (e.g. CFS on or off) that affect required MB/sec writing. So then I switched to a simplistic feedback control: when a merge arrives, if another merge that's "close" to that same size is still running, we are falling behind and we aggressively (+20%) increase the IO throttle. Else, if there is a prior backlog still, leave the rate unchanged. Else, we decrease it. In my various tests of "tiny flushed segs" vs "big flushed segs", NRT reopens vs no, CFS or not, 1 2 or 3 merge threads, this approach seems to work well. I haven't yet tested on spinning disks though ... will have to wait until I'm back home ... somehow my beast box died while I'm on vacation! I think fsck must be waiting for me on the console Forced merges have their own separate throttle (defaults to unlimited). I think it's important CMS not have min/max MB/sec throttle control: I think this just invites disaster when apps set them to inappropriate values (but I added a protected CMS method "escape hatch" so a subclass can override the control logic). I also removed RateLimitedDirectoryWrapper: it's too simplistic and too dangerous. Finally I cleaned a few things up and improved verbose infoStream logging so we can see more stats for each merge.
        Hide
        Robert Muir added a comment -
            // Defensive: sleep for at most 250 msec; the loop above will call us again if we should keep sleeping:
            if (curPauseNS > 250L*1000000000) {
              curPauseNS = 250L*1000000000;
            }
        

        Did you mean 250 milliseconds or 250 seconds?

        Show
        Robert Muir added a comment - // Defensive: sleep for at most 250 msec; the loop above will call us again if we should keep sleeping: if (curPauseNS > 250L*1000000000) { curPauseNS = 250L*1000000000; } Did you mean 250 milliseconds or 250 seconds?
        Hide
        Robert Muir added a comment -

        +1, I really like this approach.

        Show
        Robert Muir added a comment - +1, I really like this approach.
        Hide
        Michael McCandless added a comment -

        Did you mean 250 milliseconds or 250 seconds?

        Woops! I'll fix, thanks.

        Show
        Michael McCandless added a comment - Did you mean 250 milliseconds or 250 seconds? Woops! I'll fix, thanks.
        Hide
        Michael McCandless added a comment -

        New patch, fixing one nocommit, adding some more infoStream logging
        around applying deletes, fixing "ant precommit". I also fixed CFS
        building to also throttle.

        I tested on spinning disks ... it seems to behave well: under intense
        indexing, the throttle moves to unlimited since the spinning disk
        can't keep up. Under light indexing, it stays low.

        I upped the starting rate to 20 MB/sec (from 5 before): this helps it
        move to less throttling more quickly before merges fall behind in the
        beginning during heavy indexing.

        Tests pass ... I think it's ready.

        Show
        Michael McCandless added a comment - New patch, fixing one nocommit, adding some more infoStream logging around applying deletes, fixing "ant precommit". I also fixed CFS building to also throttle. I tested on spinning disks ... it seems to behave well: under intense indexing, the throttle moves to unlimited since the spinning disk can't keep up. Under light indexing, it stays low. I upped the starting rate to 20 MB/sec (from 5 before): this helps it move to less throttling more quickly before merges fall behind in the beginning during heavy indexing. Tests pass ... I think it's ready.
        Hide
        Adrien Grand added a comment -

        I was never sure what a good value for the rate limiter would be so I'm very happy to see Lucene take care of it by itself.

        +  /** true if we should rate-limit writes for each merge; false if not.  null means use dynamic default: */
        +  private boolean doAutoIOThrottle = true;
        

        I think the comment is outdated since doAutoIOThrottle is a boolean now (instead of a Boolean)? There is a similar leftover a couple of lines below I think: if (doAutoIOThrottle == Boolean.TRUE)

        +    /** Set by {@link IndexWriter} to rate limit writes and abort this merge. */
        +    public final MergeRateLimiter rateLimiter;
        

        I think the comment is a bit confusing since this property is not actually set by the index writer?

        /** Returns 0 if no pause happened, 1 if pause because rate was 0.0 (merge is paused), 2 if paused with a normal rate limit. */
          private synchronized int maybePause(long bytes, long curNS) throws MergePolicy.MergeAbortedException
        

        Maybe having constants or an enum would make the code easier to read?

        Show
        Adrien Grand added a comment - I was never sure what a good value for the rate limiter would be so I'm very happy to see Lucene take care of it by itself. + /** true if we should rate-limit writes for each merge; false if not. null means use dynamic default : */ + private boolean doAutoIOThrottle = true ; I think the comment is outdated since doAutoIOThrottle is a boolean now (instead of a Boolean)? There is a similar leftover a couple of lines below I think: if (doAutoIOThrottle == Boolean.TRUE) + /** Set by {@link IndexWriter} to rate limit writes and abort this merge. */ + public final MergeRateLimiter rateLimiter; I think the comment is a bit confusing since this property is not actually set by the index writer? /** Returns 0 if no pause happened, 1 if pause because rate was 0.0 (merge is paused), 2 if paused with a normal rate limit. */ private synchronized int maybePause( long bytes, long curNS) throws MergePolicy.MergeAbortedException Maybe having constants or an enum would make the code easier to read?
        Hide
        Michael McCandless added a comment -

        Thanks Adrien Grand, here's a new patch with those fixes.

        Show
        Michael McCandless added a comment - Thanks Adrien Grand , here's a new patch with those fixes.
        Hide
        ASF subversion and git services added a comment -

        Commit 1649532 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1649532 ]

        LUCENE-6119: CMS dynamically rate limits IO writes of each merge depending on incoming merge rate

        Show
        ASF subversion and git services added a comment - Commit 1649532 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1649532 ] LUCENE-6119 : CMS dynamically rate limits IO writes of each merge depending on incoming merge rate
        Hide
        ASF subversion and git services added a comment -

        Commit 1649539 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1649539 ]

        LUCENE-6119: CMS dynamically rate limits IO writes of each merge depending on incoming merge rate

        Show
        ASF subversion and git services added a comment - Commit 1649539 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1649539 ] LUCENE-6119 : CMS dynamically rate limits IO writes of each merge depending on incoming merge rate
        Hide
        ASF subversion and git services added a comment -

        Commit 1650025 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1650025 ]

        LUCENE-6119: fix just arrived merge to throttle correctly

        Show
        ASF subversion and git services added a comment - Commit 1650025 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1650025 ] LUCENE-6119 : fix just arrived merge to throttle correctly
        Hide
        ASF subversion and git services added a comment -

        Commit 1650026 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1650026 ]

        LUCENE-6119: fix just arrived merge to throttle correctly

        Show
        ASF subversion and git services added a comment - Commit 1650026 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1650026 ] LUCENE-6119 : fix just arrived merge to throttle correctly
        Hide
        ASF subversion and git services added a comment -

        Commit 1650027 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1650027 ]

        LUCENE-6119: fix just arrived merge to throttle correctly

        Show
        ASF subversion and git services added a comment - Commit 1650027 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1650027 ] LUCENE-6119 : fix just arrived merge to throttle correctly
        Hide
        ASF subversion and git services added a comment -

        Commit 1650463 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1650463 ]

        LUCENE-6119: set initial rate for forced merge correctly

        Show
        ASF subversion and git services added a comment - Commit 1650463 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1650463 ] LUCENE-6119 : set initial rate for forced merge correctly
        Hide
        ASF subversion and git services added a comment -

        Commit 1650464 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1650464 ]

        LUCENE-6119: set initial rate for forced merge correctly

        Show
        ASF subversion and git services added a comment - Commit 1650464 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1650464 ] LUCENE-6119 : set initial rate for forced merge correctly
        Hide
        ASF subversion and git services added a comment -

        Commit 1650594 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1650594 ]

        LUCENE-6119: must check merge for abort even when we are not rate limiting; don't wrap rate limiter when doing addIndexes (it's not abortable); don't leak file handle when wrapping

        Show
        ASF subversion and git services added a comment - Commit 1650594 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1650594 ] LUCENE-6119 : must check merge for abort even when we are not rate limiting; don't wrap rate limiter when doing addIndexes (it's not abortable); don't leak file handle when wrapping
        Hide
        ASF subversion and git services added a comment -

        Commit 1650595 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1650595 ]

        LUCENE-6119: must check merge for abort even when we are not rate limiting; don't wrap rate limiter when doing addIndexes (it's not abortable); don't leak file handle when wrapping

        Show
        ASF subversion and git services added a comment - Commit 1650595 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1650595 ] LUCENE-6119 : must check merge for abort even when we are not rate limiting; don't wrap rate limiter when doing addIndexes (it's not abortable); don't leak file handle when wrapping
        Hide
        ASF subversion and git services added a comment -

        Commit 1651305 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1651305 ]

        LUCENE-6119: make sure minPauseCheckBytes is set on init of MergeRateLimiter

        Show
        ASF subversion and git services added a comment - Commit 1651305 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1651305 ] LUCENE-6119 : make sure minPauseCheckBytes is set on init of MergeRateLimiter
        Hide
        ASF subversion and git services added a comment -

        Commit 1651307 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1651307 ]

        LUCENE-6119: make sure minPauseCheckBytes is set on init of MergeRateLimiter

        Show
        ASF subversion and git services added a comment - Commit 1651307 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1651307 ] LUCENE-6119 : make sure minPauseCheckBytes is set on init of MergeRateLimiter
        Hide
        Dawid Weiss added a comment -

        I realize this issue is closed but it'd be sweet to have this kind of adaptive heuristic to throttle the number of merging threads as well. Let me explain.

        When you think of it, the really important measurement of quality on the surface is I/O throughput of merges combined with I/O throughput of IW additions (indexing). Essentially we want to maximize a function:

        f = merge_throughput + indexing_throughput
        

        perhaps with a bias towards indexing_throughput which can be modeled (by multiplying by a constant?). The underlying variables to adaptively tune are:

        • how many merge threads there are (for example having too many doesn't make sense on a spindle, with an SSD this is not a problem),
        • when to pause/ resume existing merge threads,
        • when to pause/ resume indexing threads.

        What's interesting is that we can tweak these variables in response to the the current value (and gradient) of function f. This means an adaptive algorithm could (examples):

        • react to temporary external system load (for example pausing some merge threads if it observes a drop in throughput),
        • find out the sweet spot of how many merge threads there can be without saturating I/O (no need to detect SSD vs. spindle; we just want to maximize f – the optimal number of merge threads would emerge by itself from looking at the data).

        Now the big question is what this algorithm should look like, of course. The options vary from relatively simple hand-written rule-based heuristics to an advanced black-box with either pre-trained or adaptive machine learning algorithms.

        I have an application that has just one of the objectives of function f (we need to quickly merge a large set of segments, optimally without knowing or caring what the underlying disk hardware/ disk buffers are). I'll report my impressions once I have it done.

        Show
        Dawid Weiss added a comment - I realize this issue is closed but it'd be sweet to have this kind of adaptive heuristic to throttle the number of merging threads as well. Let me explain. When you think of it, the really important measurement of quality on the surface is I/O throughput of merges combined with I/O throughput of IW additions (indexing). Essentially we want to maximize a function: f = merge_throughput + indexing_throughput perhaps with a bias towards indexing_throughput which can be modeled (by multiplying by a constant?). The underlying variables to adaptively tune are: how many merge threads there are (for example having too many doesn't make sense on a spindle, with an SSD this is not a problem), when to pause/ resume existing merge threads, when to pause/ resume indexing threads. What's interesting is that we can tweak these variables in response to the the current value (and gradient) of function f. This means an adaptive algorithm could (examples): react to temporary external system load (for example pausing some merge threads if it observes a drop in throughput), find out the sweet spot of how many merge threads there can be without saturating I/O (no need to detect SSD vs. spindle; we just want to maximize f – the optimal number of merge threads would emerge by itself from looking at the data). Now the big question is what this algorithm should look like, of course. The options vary from relatively simple hand-written rule-based heuristics to an advanced black-box with either pre-trained or adaptive machine learning algorithms. I have an application that has just one of the objectives of function f (we need to quickly merge a large set of segments, optimally without knowing or caring what the underlying disk hardware/ disk buffers are). I'll report my impressions once I have it done.
        Hide
        Michael McCandless added a comment -

        I think auto-tuning merge thread count would be a great addition!

        Show
        Michael McCandless added a comment - I think auto-tuning merge thread count would be a great addition!
        Hide
        Dawid Weiss added a comment -

        I know. It would take a lot of manual tuning or detection (ssd vs. non-ssd vs. hybrid vs. large mem disk buffers, etc.) off the map. And it could gracefully play with other components of the system without clogging everything (like ionice). We'll see.

        Show
        Dawid Weiss added a comment - I know. It would take a lot of manual tuning or detection (ssd vs. non-ssd vs. hybrid vs. large mem disk buffers, etc.) off the map. And it could gracefully play with other components of the system without clogging everything (like ionice). We'll see.
        Hide
        Anshum Gupta added a comment -

        Bulk close after 5.0 release.

        Show
        Anshum Gupta added a comment - Bulk close after 5.0 release.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development