Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.3, 2.3.1, 2.3.2, 2.4
    • Fix Version/s: 2.4
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I've been working on refactoring DocumentsWriter to make it more
      modular, so that adding new indexing functionality (like column-stride
      stored fields, LUCENE-1231) is just a matter of adding a plugin into
      the indexing chain.

      This is an initial step towards flexible indexing (but there is still
      alot more to do!).

      And it's very much still a work in progress – there are intemittant
      thread safety issues, I need to add tests cases and test/iterate on
      performance, many "nocommits", etc. This is a snapshot of my current
      state...

      The approach introduces "consumers" (abstract classes defining the
      interface) at different levels during indexing. EG DocConsumer
      consumes the whole document. DocFieldConsumer consumes separate
      fields, one at a time. InvertedDocConsumer consumes tokens produced
      by running each field through the analyzer. TermsHashConsumer writes
      its own bytes into in-memory posting lists stored in byte slices,
      indexed by term, etc.

      DocumentsWriter*.java is then much simpler: it only interacts with a
      DocConsumer and has no idea what that consumer is doing. Under that
      DocConsumer there is a whole "indexing chain" that does the real work:

      • NormsWriter holds norms in memory and then flushes them to _X.nrm.
      • FreqProxTermsWriter holds postings data in memory and then flushes
        to _X.frq/prx.
      • StoredFieldsWriter flushes immediately to _X.fdx/fdt
      • TermVectorsTermsWriter flushes immediately to _X.tvx/tvf/tvd

      DocumentsWriter still manages things like flushing a segment, closing
      doc stores, buffering & applying deletes, freeing memory, aborting
      when necesary, etc.

      In this first step, everything is package-private, and, the indexing
      chain is hardwired (instantiated in DocumentsWriter) to the chain
      currently matching Lucene trunk. Over time we can open this up.

      There are no changes to the index file format.

      For the most part this is just a [large] refactoring, except for these
      two small actual changes:

      • Improved concurrency with mixed large/small docs: previously the
        thread state would be tied up when docs finished indexing
        out-of-order. Now, it's not: instead I use a separate class to
        hold any pending state to flush to the doc stores, and immediately
        free up the thread state to index other docs.
      • Buffered norms in memory now remain sparse, until flushed to the
        _X.nrm file. Previously we would "fill holes" in norms in memory,
        as we go, which could easily use way too much memory. Really this
        isn't a solution to the problem of sparse norms (LUCENE-830); it
        just delays that issue from causing memory blowup during indexing;
        memory use will still blowup during searching.

      I expect performance (indexing throughput) will be worse with this
      change. I'll profile & iterate to minimize this, but I think we can
      accept some loss. I also plan to measure benefit of manually
      re-cycling RawPostingList instances from our own pool, vs letting GC
      recycle them.

      1. LUCENE-1301.patch
        338 kB
        Michael McCandless
      2. LUCENE-1301.take3.patch
        322 kB
        Michael McCandless
      3. LUCENE-1301.take2.patch
        308 kB
        Michael McCandless
      4. LUCENE-1301.patch
        295 kB
        Michael McCandless

        Activity

        Hide
        Michael Busch added a comment -

        Mike, I think the ArrayUtil class is missing in your patch?

        Show
        Michael Busch added a comment - Mike, I think the ArrayUtil class is missing in your patch?
        Hide
        Michael McCandless added a comment -

        Woops, sorry, I forgot to svn add that. I'm attaching my current
        state, with that file added. Does this one work? (You may need to
        forcefully remove DocumentsWriterFieldData.java if applying the patch
        doesn't do so).

        Show
        Michael McCandless added a comment - Woops, sorry, I forgot to svn add that. I'm attaching my current state, with that file added. Does this one work? (You may need to forcefully remove DocumentsWriterFieldData.java if applying the patch doesn't do so).
        Hide
        Michael Busch added a comment -

        Just a quick update, Mike:
        With your latest patch it's compiling fine now. Thanks!
        I'm seeing NullPointerExceptions in TestStressIndexing2 though,
        but I guess this patch is not final yet.

        I haven't read the patch yet, hope I'll find some time soon.

        Show
        Michael Busch added a comment - Just a quick update, Mike: With your latest patch it's compiling fine now. Thanks! I'm seeing NullPointerExceptions in TestStressIndexing2 though, but I guess this patch is not final yet. I haven't read the patch yet, hope I'll find some time soon.
        Hide
        Michael McCandless added a comment -

        Attached new rev of the patch.

        I'm seeing NullPointerExceptions in TestStressIndexing2 though,

        I believe this patch fixes that. All tests should now pass.

        Show
        Michael McCandless added a comment - Attached new rev of the patch. I'm seeing NullPointerExceptions in TestStressIndexing2 though, I believe this patch fixes that. All tests should now pass.
        Hide
        Michael McCandless added a comment -

        New rev of the patch attached. I've fixed all nocommits. All tests
        pass. I believe this version is ready to commit!

        I'll wait a few more days before committing...

        I ran some indexing throughput tests, indexing Wikipedia docs from a
        line file using StandardAnalyzer. Each result is best of 4. Here's
        the alg:

        analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
        
        doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
        
        docs.file=/Volumes/External/lucene/wiki.txt
        doc.stored = true
        doc.term.vector = true
        doc.add.log.step=2000
        
        directory=FSDirectory
        autocommit=false
        compound=false
        
        work.dir=/lucene/work
        ram.flush.mb=64
        
        { "Rounds"
          ResetSystemErase
          { "BuildIndex"
            - CreateIndex
             { "AddDocs" AddDoc > : 200000
            - CloseIndex
          }
          NewRound
        } : 4
        
        RepSumByPrefRound BuildIndex
        

        Gives these results with term vectors & stored fields:

        patch
          BuildIndex -  - 1 -  -   1 -  -  200000 -  -   900.4 -  - 222.12 - 410,938,688  1,029,046,272
        
        trunk
          BuildIndex -  - 1 -  -   1 -  -  200000 -  -   969.0 -  - 206.39 - 400,372,256  1,029,046,272
        
        2.3
          BuildIndex      2        1       200002        905.4      220.89   391,630,016  1,029,046,272
        

        And without term vectors & stored fields:

        patch
          BuildIndex -  - 3 -  -   1 -  -  200000 -  - 1,297.5 -  - 154.15 - 399,966,592  1,029,046,272
        
        trunk
          BuildIndex -  - 1 -  -   1 -  -  200000 -  - 1,372.5 -  - 145.72 - 390,581,376  1,029,046,272
        
        2.3
          BuildIndex -  - 1 -  -   1 -  -  200002 -  - 1,308.5 -  - 152.85 - 389,224,640  1,029,046,272
        

        So, the bad news is the refactoring had made things a bit (~5-7%)
        slower than the current trunk. But the good news is trunk was already
        6-7% faster than 2.4, so they nearly cancel out.

        If I repeat these tests using tiny docs (~100 bytes per body) instead,
        indexing the first 10 million docs, the slowdown is worse (~13-15% vs
        trunk, ~11-13% vs 2.3)... I think it's because the additional method calls
        with the refactoring become a bigger part of the time.

        With term vectors & stored fields:

        patch
          BuildIndex -  - 3 -  -   1 -   10000000 -   38,320.1 -  - 260.96 - 313,980,832  1,029,046,272
        
        trunk
          BuildIndex      2        1     10000000     45,194.1      221.27   414,987,072  1,029,046,272
        
        2.3
          BuildIndex -  - 1 -  -   1 -   10000002 -   42,861.4 -  - 233.31 - 182,957,440  1,029,046,272
        

        Without term vectors & stored fields:

        patch
          BuildIndex -  - 1 -  -   1 -   10000000 -   60,778.4 -  - 164.53 - 341,611,456  1,029,046,272
        
        trunk
          BuildIndex      2        1     10000000     68,387.8      146.23   405,388,960  1,029,046,272
        
        2.3
          BuildIndex      0        1     10000002     68,052.7      146.95   330,334,912  1,029,046,272
        

        I think these small slowdowns are worth the improvement in code
        clarity.

        Show
        Michael McCandless added a comment - New rev of the patch attached. I've fixed all nocommits. All tests pass. I believe this version is ready to commit! I'll wait a few more days before committing... I ran some indexing throughput tests, indexing Wikipedia docs from a line file using StandardAnalyzer. Each result is best of 4. Here's the alg: analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker docs.file=/Volumes/External/lucene/wiki.txt doc.stored = true doc.term.vector = true doc.add.log.step=2000 directory=FSDirectory autocommit= false compound= false work.dir=/lucene/work ram.flush.mb=64 { "Rounds" ResetSystemErase { "BuildIndex" - CreateIndex { "AddDocs" AddDoc > : 200000 - CloseIndex } NewRound } : 4 RepSumByPrefRound BuildIndex Gives these results with term vectors & stored fields: patch BuildIndex - - 1 - - 1 - - 200000 - - 900.4 - - 222.12 - 410,938,688 1,029,046,272 trunk BuildIndex - - 1 - - 1 - - 200000 - - 969.0 - - 206.39 - 400,372,256 1,029,046,272 2.3 BuildIndex 2 1 200002 905.4 220.89 391,630,016 1,029,046,272 And without term vectors & stored fields: patch BuildIndex - - 3 - - 1 - - 200000 - - 1,297.5 - - 154.15 - 399,966,592 1,029,046,272 trunk BuildIndex - - 1 - - 1 - - 200000 - - 1,372.5 - - 145.72 - 390,581,376 1,029,046,272 2.3 BuildIndex - - 1 - - 1 - - 200002 - - 1,308.5 - - 152.85 - 389,224,640 1,029,046,272 So, the bad news is the refactoring had made things a bit (~5-7%) slower than the current trunk. But the good news is trunk was already 6-7% faster than 2.4, so they nearly cancel out. If I repeat these tests using tiny docs (~100 bytes per body) instead, indexing the first 10 million docs, the slowdown is worse (~13-15% vs trunk, ~11-13% vs 2.3)... I think it's because the additional method calls with the refactoring become a bigger part of the time. With term vectors & stored fields: patch BuildIndex - - 3 - - 1 - 10000000 - 38,320.1 - - 260.96 - 313,980,832 1,029,046,272 trunk BuildIndex 2 1 10000000 45,194.1 221.27 414,987,072 1,029,046,272 2.3 BuildIndex - - 1 - - 1 - 10000002 - 42,861.4 - - 233.31 - 182,957,440 1,029,046,272 Without term vectors & stored fields: patch BuildIndex - - 1 - - 1 - 10000000 - 60,778.4 - - 164.53 - 341,611,456 1,029,046,272 trunk BuildIndex 2 1 10000000 68,387.8 146.23 405,388,960 1,029,046,272 2.3 BuildIndex 0 1 10000002 68,052.7 146.95 330,334,912 1,029,046,272 I think these small slowdowns are worth the improvement in code clarity.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development