Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1120

Use bulk-byte-copy when merging term vectors

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.4
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Indexing all of Wikipedia, with term vectors on, under the YourKit
      profiler, shows that 26% of the time (!!) was spent merging the
      vectors. This was without offsets & positions, which would make
      matters even worse.

      Depressingly, merging, even with ConcurrentMergeScheduler, cannot in
      fact keep up with the flushing of new segments in this test, and this
      is on a strong IO system (Mac Pro with 4 drive RAID 0 array, 4 CPU
      cores).

      So, just like Robert's idea to merge stored fields with bulk copying
      whenever the field name->number mapping is "congruent" (LUCENE-1043),
      we can do the same with term vectors.

      It's a little trickier because the term vectors format doesn't quite
      make it easy to bulk-copy because it doesn't directly encode the
      offset into the tvf file.

      I worked out a patch that changes the tvx format slightly, by storing
      the absolute position in the tvf file for the start of each document
      into the tvx file, just like it does for tvd now. This adds an extra
      8 bytes (long) in the tvx file, per document.

      Then, I removed a vLong (the first "position" stored inside the tvd
      file), which makes tvd contents fully position independent (so you can
      just copy the bytes).

      This adds up to 7 bytes per document (less for larger indices) that
      have term vectors enabled, but I think this small increase in index
      size is acceptable for the gains in indexing performance?

      With this change, the time spent merging term vectors dropped from 26%
      to 3%. Of course, this only applies if your documents are "regular".
      I think in the future we could have Lucene try hard to assign the same
      field number for a given field name, if it had been seen before in the
      index...

      Merging terms now dominates the merge cost (~20% over overall time
      building the Wikipedia index).

      I also beefed up TestBackwardsCompatibility unit test: test a non-CFS
      and a CFS of versions 1.9, 2.0, 2.1, 2.2 index formats, and added some
      term vector fields to these indices.

      1. LUCENE-1120.patch
        32 kB
        Michael McCandless
      2. LUCENE-1120.take2.patch
        34 kB
        Michael McCandless

        Activity

        Hide
        mikemccand Michael McCandless added a comment -

        Attached patch. All tests pass.

        (Note that the TestBackwardsCompatibility test will fail if you apply the patch because the new *.zip files I added aren't in the patch).

        I think we should commit this for 2.3? It's a sizable gain in merging
        performance.

        Show
        mikemccand Michael McCandless added a comment - Attached patch. All tests pass. (Note that the TestBackwardsCompatibility test will fail if you apply the patch because the new *.zip files I added aren't in the patch). I think we should commit this for 2.3? It's a sizable gain in merging performance.
        Hide
        michaelbusch Michael Busch added a comment -

        I think we should commit this for 2.3? It's a sizable gain in merging

        HI Mike,

        I'm planning to branch the trunk today. Considering the file format changes, it might be a bit risky to apply this patch last minute. I think we should commit this for 2.4. What do you think?

        -Michael

        Show
        michaelbusch Michael Busch added a comment - I think we should commit this for 2.3? It's a sizable gain in merging HI Mike, I'm planning to branch the trunk today. Considering the file format changes, it might be a bit risky to apply this patch last minute. I think we should commit this for 2.4. What do you think? -Michael
        Hide
        mikemccand Michael McCandless added a comment -

        Considering the file format changes, it might be a bit risky to apply this patch last minute. I think we should commit this for 2.4. What do you think?

        OK, I agree, it is somewhat risky, so let's wait. (Though it is a sizable gain in performance!).

        Show
        mikemccand Michael McCandless added a comment - Considering the file format changes, it might be a bit risky to apply this patch last minute. I think we should commit this for 2.4. What do you think? OK, I agree, it is somewhat risky, so let's wait. (Though it is a sizable gain in performance!).
        Hide
        michaelbusch Michael Busch added a comment -

        Indexing all of Wikipedia, with term vectors on, under the YourKit
        profiler, shows that 26% of the time (!!) was spent merging the
        vectors.

        I wonder how accurate these profiling numbers are? Java profiling slows
        down (non-native) method calls, but not I/O operations. Did you measure
        the merge time before and after applying this patch without a profiling
        tool?

        -Michael

        Show
        michaelbusch Michael Busch added a comment - Indexing all of Wikipedia, with term vectors on, under the YourKit profiler, shows that 26% of the time (!!) was spent merging the vectors. I wonder how accurate these profiling numbers are? Java profiling slows down (non-native) method calls, but not I/O operations. Did you measure the merge time before and after applying this patch without a profiling tool? -Michael
        Hide
        mikemccand Michael McCandless added a comment -

        I wonder how accurate these profiling numbers are? Java profiling slows
        down (non-native) method calls, but not I/O operations. Did you measure
        the merge time before and after applying this patch without a profiling
        tool?

        Good point – I haven't measured outside of profiling. I plan to build a full Wiki index with and without this change to test....

        Show
        mikemccand Michael McCandless added a comment - I wonder how accurate these profiling numbers are? Java profiling slows down (non-native) method calls, but not I/O operations. Did you measure the merge time before and after applying this patch without a profiling tool? Good point – I haven't measured outside of profiling. I plan to build a full Wiki index with and without this change to test....
        Hide
        mikemccand Michael McCandless added a comment -

        Attached patch updated to current trunk. All tests pass. I plan to
        commit after 2.3 is out...

        OK I ran a performance test with this patch, indexing the first 200K
        docs of Wikipedia using this alg:

        analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
        doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker

        doc.stored = true
        doc.term.vector = true
        doc.term.vector.positions = true
        doc.term.vector.offsets = true

        docs.file=/Volumes/External/lucene/wiki.txt

        directory=FSDirectory

        merge.scheduler=org.apache.lucene.index.SerialMergeScheduler

        { "Rounds"
        ResetSystemErase
        { "BuildIndex"
        CreateIndex

        { "AddDocs" AddDoc > : 200000 CloseIndex }

        NewRound
        } : 3

        RepSumByPrefRound BuildIndex

        I used SerialMergeScheduler so that I could measure time saved due to
        faster merging.

        Without the patch, best of 3 was 509.0 sec; with patch, best of 3 was
        448.8 sec = 11.8% overall speedup.

        Show
        mikemccand Michael McCandless added a comment - Attached patch updated to current trunk. All tests pass. I plan to commit after 2.3 is out... OK I ran a performance test with this patch, indexing the first 200K docs of Wikipedia using this alg: analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker doc.stored = true doc.term.vector = true doc.term.vector.positions = true doc.term.vector.offsets = true docs.file=/Volumes/External/lucene/wiki.txt directory=FSDirectory merge.scheduler=org.apache.lucene.index.SerialMergeScheduler { "Rounds" ResetSystemErase { "BuildIndex" CreateIndex { "AddDocs" AddDoc > : 200000 CloseIndex } NewRound } : 3 RepSumByPrefRound BuildIndex I used SerialMergeScheduler so that I could measure time saved due to faster merging. Without the patch, best of 3 was 509.0 sec; with patch, best of 3 was 448.8 sec = 11.8% overall speedup.
        Hide
        mikemccand Michael McCandless added a comment -

        Attaching the right patch this time...

        Show
        mikemccand Michael McCandless added a comment - Attaching the right patch this time...
        Hide
        mikemccand Michael McCandless added a comment -

        I just committed this. Note that this is a [small] change to the index format, so if you use trunk to build an index, 2.3 won't be able to read it!

        Show
        mikemccand Michael McCandless added a comment - I just committed this. Note that this is a [small] change to the index format, so if you use trunk to build an index, 2.3 won't be able to read it!

          People

          • Assignee:
            mikemccand Michael McCandless
            Reporter:
            mikemccand Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development