Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.3
    • Component/s: core/index
    • Labels:
      None

      Description

      Provide the ability to handle merges in one or more concurrent threads, i.e., concurrent with other IndexWriter operations.

      I'm factoring the code from LUCENE-847 for this.

      1. LUCENE-870.take3.patch
        17 kB
        Michael McCandless
      2. LUCENE-870.take2.patch
        12 kB
        Michael McCandless
      3. CMP.patch.txt
        131 kB
        Steven Parkes
      4. concurrentMerge.patch
        104 kB
        Steven Parkes

        Issue Links

          Activity

          Steven Parkes created issue -
          Steven Parkes made changes -
          Field Original Value New Value
          Link This issue depends on LUCENE-848 [ LUCENE-848 ]
          Steven Parkes made changes -
          Link This issue depends on LUCENE-847 [ LUCENE-847 ]
          Hide
          Steven Parkes added a comment -

          Sigh. My typo rate's been too high lately. The depends-on link is to LUCENE-847, not LUCENE-848. Perhaps someone with JIRA "manage links" permissions can delete the wrong one.

          Show
          Steven Parkes added a comment - Sigh. My typo rate's been too high lately. The depends-on link is to LUCENE-847 , not LUCENE-848 . Perhaps someone with JIRA "manage links" permissions can delete the wrong one.
          Yonik Seeley made changes -
          Link This issue depends on LUCENE-848 [ LUCENE-848 ]
          Hide
          Steven Parkes added a comment -

          Copy Ning's concurrency patch over here, since LUCENE-847 is supposed to the non-concurrent changes.

          Show
          Steven Parkes added a comment - Copy Ning's concurrency patch over here, since LUCENE-847 is supposed to the non-concurrent changes.
          Steven Parkes made changes -
          Attachment concurrentMerge.patch [ 12363278 ]
          Hide
          Steven Parkes added a comment -

          Mike expressed interest in pursuing this with an alternative strargey, so I thought I'd give a "work in progress" snapshot of the way I'd be going.

          This code doesn't work, but it has some ideas, so it's only of interest to people who really want to make suggestions on how to do the parallelization.

          Overall, the idea of generating new threads for non-conflicting primitive merges seems okay. Need to make sure you don't overload the i/o system and that throttling code isn't in there.

          A couple of things things off the top that I haven't worked through yet:

          My current thinking is that when you are not going to do a merge serially, you need to copy the segmentInfo objects that you will be using. It may be possible to do this with a lock, but that gets harry. Ther'es also state in the SegmentInfo objescts themselves, like docStoreIsCompoundFile that can get changed on the fly.

          flushDocStore is challenging to parallelize. It's synchronized now, but you probably would rather it not be? It's complicated by the fact that doc stores are shared by multiple segments and so non-conflicting merges may stll share doc stores.

          Show
          Steven Parkes added a comment - Mike expressed interest in pursuing this with an alternative strargey, so I thought I'd give a "work in progress" snapshot of the way I'd be going. This code doesn't work, but it has some ideas, so it's only of interest to people who really want to make suggestions on how to do the parallelization. Overall, the idea of generating new threads for non-conflicting primitive merges seems okay. Need to make sure you don't overload the i/o system and that throttling code isn't in there. A couple of things things off the top that I haven't worked through yet: My current thinking is that when you are not going to do a merge serially, you need to copy the segmentInfo objects that you will be using. It may be possible to do this with a lock, but that gets harry. Ther'es also state in the SegmentInfo objescts themselves, like docStoreIsCompoundFile that can get changed on the fly. flushDocStore is challenging to parallelize. It's synchronized now, but you probably would rather it not be? It's complicated by the fact that doc stores are shared by multiple segments and so non-conflicting merges may stll share doc stores.
          Steven Parkes made changes -
          Attachment CMP.patch.txt [ 12364185 ]
          Hide
          Michael McCandless added a comment -

          Attaching patch that provides ConcurrentMergePolicyWrapper using the
          "stateless API" approach for MergePolicy. This must be used with the
          patch I just attached to LUCENE-847.

          This wrapper can wrap any MergePolicy instance and schedule the
          requested merges using background threads, which frees IndexWriter
          threads to continue adding/deleting docs.

          CMPW accepts a "max thread count" limit: if the number of concurrent
          merges needed exceeds this then it just returns the overflow back to
          IndexWriter which causes those merges to run in the foreground.

          Also in the patch I added 2 test cases to the existing
          TestStressIndexing test to use ConcurrentMergePolicyWrapper.

          I ran a quick test using this alg:

          analyzer=org.apache.lucene.analysis.SimpleAnalyzer
          doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
          docs.file=/lucene/wikifull.txt
          directory=FSDirectory
          ram.flush.mb = 16
          max.field.length = 2147483647
          doc.add.log.step = 5000
          doc.maker.forever=false

          ResetSystemErase
          CreateIndex
          {AddDoc >: *
          CloseIndex

          RepSumByName

          For baseline I used "LogByteSizeMergePolicy". Then, I compared with
          the same merge policy, but wrapped using ConcurrentMergePolicyWrapper.

          Baseline took 1544 sec to index all of wikipedia; using
          ConcurrentMergePolicyWrapper it took 1155 sec (25% speedup), which is
          quite sizable. This is a powerful way to make use of concurrency
          without the complexity of having to add threads to your indexing
          process. (This is with JDK 1.5, on a quad core MacPro with 4 drives
          in a RAID 0 array).

          Show
          Michael McCandless added a comment - Attaching patch that provides ConcurrentMergePolicyWrapper using the "stateless API" approach for MergePolicy. This must be used with the patch I just attached to LUCENE-847 . This wrapper can wrap any MergePolicy instance and schedule the requested merges using background threads, which frees IndexWriter threads to continue adding/deleting docs. CMPW accepts a "max thread count" limit: if the number of concurrent merges needed exceeds this then it just returns the overflow back to IndexWriter which causes those merges to run in the foreground. Also in the patch I added 2 test cases to the existing TestStressIndexing test to use ConcurrentMergePolicyWrapper. I ran a quick test using this alg: analyzer=org.apache.lucene.analysis.SimpleAnalyzer doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker docs.file=/lucene/wikifull.txt directory=FSDirectory ram.flush.mb = 16 max.field.length = 2147483647 doc.add.log.step = 5000 doc.maker.forever=false ResetSystemErase CreateIndex {AddDoc >: * CloseIndex RepSumByName For baseline I used "LogByteSizeMergePolicy". Then, I compared with the same merge policy, but wrapped using ConcurrentMergePolicyWrapper. Baseline took 1544 sec to index all of wikipedia; using ConcurrentMergePolicyWrapper it took 1155 sec (25% speedup), which is quite sizable. This is a powerful way to make use of concurrency without the complexity of having to add threads to your indexing process. (This is with JDK 1.5, on a quad core MacPro with 4 drives in a RAID 0 array).
          Michael McCandless made changes -
          Attachment LUCENE-870.take2.patch [ 12364528 ]
          Michael McCandless made changes -
          Fix Version/s 2.3 [ 12312531 ]
          Hide
          Michael McCandless added a comment -

          New rev of this patch to match newest patch added on LUCENE-847.

          Show
          Michael McCandless added a comment - New rev of this patch to match newest patch added on LUCENE-847 .
          Michael McCandless made changes -
          Attachment LUCENE-870.take3.patch [ 12364639 ]
          Michael McCandless committed 576798 (27 files)
          Reviews: none

          LUCENE-845, LUCENE-847, LUCENE-870: factor MergePolicy & MergeScheduler out of IndexWriter, improve overall concurrency of IndexWriter, and add ConcurrentMergeScheduler

          Lucene trunk
          Michael McCandless made changes -
          Resolution Fixed [ 1 ]
          Status Open [ 1 ] Resolved [ 5 ]
          Michael Busch made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Mark Thomas made changes -
          Workflow jira [ 12402847 ] Default workflow, editable Closed status [ 12562552 ]
          Mark Thomas made changes -
          Workflow Default workflow, editable Closed status [ 12562552 ] jira [ 12583515 ]
          Gavin made changes -
          Link This issue depends on LUCENE-847 [ LUCENE-847 ]
          Gavin made changes -
          Link This issue depends upon LUCENE-847 [ LUCENE-847 ]

            People

            • Assignee:
              Steven Parkes
              Reporter:
              Steven Parkes
            • Votes:
              1 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development