Lucene - Core
  1. Lucene - Core
  2. LUCENE-4661

Reduce default maxMerge/ThreadCount for ConcurrentMergeScheduler

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.1, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I think our current defaults (maxThreadCount=#cores/2,
      maxMergeCount=maxThreadCount+2) are too high ... I've frequently found
      merges falling behind and then slowing each other down when I index on
      a spinning-magnets drive.

      As a test, I indexed all of English Wikipedia with term-vectors (=
      heavy on merging), using 6 threads ... at the defaults
      (maxThreadCount=3, maxMergeCount=5, for my machine) it took 5288 sec
      to index & wait for merges & commit. When I changed to
      maxThreadCount=1, maxMergeCount=2, indexing time sped up to 2902
      seconds (45% faster). This is on a spinning-magnets disk... basically
      spinning-magnets disk don't handle the concurrent IO well.

      Then I tested an OCZ Vertex 3 SSD: at the current defaults it took
      1494 seconds and at maxThreadCount=1, maxMergeCount=2 it took 1795 sec
      (20% slower). Net/net the SSD can handle merge concurrency just fine.

      I think we should change the defaults: spinning magnet drives are hurt
      by the current defaults more than SSDs are helped ... apps that know
      their IO system is fast can always increase the merge concurrency.

        Activity

        Hide
        Commit Tag Bot added a comment -

        [trunk commit] Michael McCandless
        http://svn.apache.org/viewvc?view=revision&revision=1429616

        LUCENE-4661: lower default maxThread/MergeCount in CMS

        Show
        Commit Tag Bot added a comment - [trunk commit] Michael McCandless http://svn.apache.org/viewvc?view=revision&revision=1429616 LUCENE-4661 : lower default maxThread/MergeCount in CMS
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Michael McCandless
        http://svn.apache.org/viewvc?view=revision&revision=1429617

        LUCENE-4661: lower default maxThread/MergeCount in CMS

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Michael McCandless http://svn.apache.org/viewvc?view=revision&revision=1429617 LUCENE-4661 : lower default maxThread/MergeCount in CMS
        Hide
        Littlestar added a comment -

        when you change default to 1
        so it can't be change any more, because setMaxMergeCount/setMaxThreadCount depends on each other.

        public void setMaxMergeCount(int count) {
        if (count < 1)

        { throw new IllegalArgumentException("count should be at least 1"); }
        if (count < maxThreadCount) { throw new IllegalArgumentException("count should be >= maxThreadCount (= " + maxThreadCount + ")"); }
        maxMergeCount = count;
        }

        public void setMaxThreadCount(int count) {
        if (count < 1) { throw new IllegalArgumentException("count should be at least 1"); }

        if (count > maxMergeCount)

        { throw new IllegalArgumentException("count should be <= maxMergeCount (= " + maxMergeCount + ")"); }

        maxThreadCount = count;
        }

        Show
        Littlestar added a comment - when you change default to 1 so it can't be change any more, because setMaxMergeCount/setMaxThreadCount depends on each other. public void setMaxMergeCount(int count) { if (count < 1) { throw new IllegalArgumentException("count should be at least 1"); } if (count < maxThreadCount) { throw new IllegalArgumentException("count should be >= maxThreadCount (= " + maxThreadCount + ")"); } maxMergeCount = count; } public void setMaxThreadCount(int count) { if (count < 1) { throw new IllegalArgumentException("count should be at least 1"); } if (count > maxMergeCount) { throw new IllegalArgumentException("count should be <= maxMergeCount (= " + maxMergeCount + ")"); } maxThreadCount = count; }
        Hide
        Michael McCandless added a comment -

        You can change the values, just be sure to first increase maxMergeCount and then maxThreadCount, in that order.

        Show
        Michael McCandless added a comment - You can change the values, just be sure to first increase maxMergeCount and then maxThreadCount, in that order.
        Hide
        Uwe Schindler added a comment -

        Is there maybe the possibility to find out if a disk is an SSD or rotating? With some IOCTLs in C you can do this, but from Java?

        Show
        Uwe Schindler added a comment - Is there maybe the possibility to find out if a disk is an SSD or rotating? With some IOCTLs in C you can do this, but from Java?
        Hide
        Shawn Heisey added a comment -

        I have a question about this - both for myself and for a message on the solr-user mailing list today.

        If you are importing millions of records from MySQL (or another DB) with DIH, eventually you'll reach a point where you've got multiple merge levels happening at the same time, which will stop indexing of new data long enough that the JDBC connection to the DB will time out.

        Is it enough in that situation to increase maxMergeCount, or do you also have to increase maxThreadCount? I have changed both, but if I only need to increase maxMergeCount and thus get the benefit of this issue, that would be awesome:

          <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler">
            <int name="maxThreadCount">4</int>
            <int name="maxMergeCount">4</int>
          </mergeScheduler>
        
        Show
        Shawn Heisey added a comment - I have a question about this - both for myself and for a message on the solr-user mailing list today. If you are importing millions of records from MySQL (or another DB) with DIH, eventually you'll reach a point where you've got multiple merge levels happening at the same time, which will stop indexing of new data long enough that the JDBC connection to the DB will time out. Is it enough in that situation to increase maxMergeCount, or do you also have to increase maxThreadCount? I have changed both, but if I only need to increase maxMergeCount and thus get the benefit of this issue, that would be awesome: <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"> <int name="maxThreadCount">4</int> <int name="maxMergeCount">4</int> </mergeScheduler>
        Hide
        Michael McCandless added a comment -

        You should only increase maxMergeCount (unless you're on an SSD, in which case you should increase maxThreadCount too!); this will allow a larger back-log of merges while stil only running one merge thread at a time.

        But note that this means "close()" can take a super long time while it works through the backlog of merges.

        Show
        Michael McCandless added a comment - You should only increase maxMergeCount (unless you're on an SSD, in which case you should increase maxThreadCount too!); this will allow a larger back-log of merges while stil only running one merge thread at a time. But note that this means "close()" can take a super long time while it works through the backlog of merges.
        Hide
        Markus Jelsma added a comment -

        We're on SSD's but have two CPU cores in each box. According to the old default this would lead to:
        <int name="maxMergeCount">3</int>
        <int name="maxThreadCount">1</int>

        Would you suggest to increase the to:
        <int name="maxMergeCount">4</int>
        <int name="maxThreadCount">2</int>

        ?

        Show
        Markus Jelsma added a comment - We're on SSD's but have two CPU cores in each box. According to the old default this would lead to: <int name="maxMergeCount">3</int> <int name="maxThreadCount">1</int> Would you suggest to increase the to: <int name="maxMergeCount">4</int> <int name="maxThreadCount">2</int> ?
        Hide
        Shawn Heisey added a comment -

        You should only increase maxMergeCount (unless you're on an SSD, in which case you should increase maxThreadCount too!); this will allow a larger back-log of merges while stil only running one merge thread at a time.

        A clarification question - Will it always keep running the 'index new content' thread while it merges with one thread in the background as long as the total of background merges plus one doesn't exceed maxMergeCount? That's the crux of the problem. If you aren't sure without a test, that's OK - I will be testing later today, because I'd really like the benefit you have described from decreasing to one thread.

        Show
        Shawn Heisey added a comment - You should only increase maxMergeCount (unless you're on an SSD, in which case you should increase maxThreadCount too!); this will allow a larger back-log of merges while stil only running one merge thread at a time. A clarification question - Will it always keep running the 'index new content' thread while it merges with one thread in the background as long as the total of background merges plus one doesn't exceed maxMergeCount? That's the crux of the problem. If you aren't sure without a test, that's OK - I will be testing later today, because I'd really like the benefit you have described from decreasing to one thread.
        Hide
        Michael McCandless added a comment -

        Marcus, I would stick with 3/1 ... but best would be to run experiments and see

        Shawn, CMS will accept up to maxMergeCount merges, but then if another merge wants to kick off, CMS will pause the thread that "caused" this merge to be kicked off (ie, pause the producers of segments). So if maxMergeCount=4, then 4 merges will be queued up (with one of them actually running, if maxThreadCount=1), but if your indexing thread(s) produce so many segments that a 5th merge now wants to run, they will then be paused at that point, until 1 merge finishes and we are back to 4 queued merges.

        Show
        Michael McCandless added a comment - Marcus, I would stick with 3/1 ... but best would be to run experiments and see Shawn, CMS will accept up to maxMergeCount merges, but then if another merge wants to kick off, CMS will pause the thread that "caused" this merge to be kicked off (ie, pause the producers of segments). So if maxMergeCount=4, then 4 merges will be queued up (with one of them actually running, if maxThreadCount=1), but if your indexing thread(s) produce so many segments that a 5th merge now wants to run, they will then be paused at that point, until 1 merge finishes and we are back to 4 queued merges.
        Hide
        Markus Jelsma added a comment -

        Thanks. Hope to do some experiments. Will report back if i can finish it up.

        Show
        Markus Jelsma added a comment - Thanks. Hope to do some experiments. Will report back if i can finish it up.
        Hide
        wolfgang hoschek added a comment -

        Might be good to experiment with Linux block device read-ahead settings (/sbin/blockdev --setra) and ensure using a file system that does write behind (e.g. ext4 or xfs). Larger buffer sizes typically allow for more concurrent seq streams even on spindles.

        Show
        wolfgang hoschek added a comment - Might be good to experiment with Linux block device read-ahead settings (/sbin/blockdev --setra) and ensure using a file system that does write behind (e.g. ext4 or xfs). Larger buffer sizes typically allow for more concurrent seq streams even on spindles.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development