Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: modules/benchmark
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      I was interested in measuring the performance of IndexWriter.addIndexes(Directory) vs. IndexWriter.addIndexes(IndexReader). I wrote an AddIndexesTask and a matching .alg. The task takes a parameter whether to use the IndexReader or Directory variants. I'll upload the patch and describe the perf results.

        Activity

        Hide
        Shai Erera added a comment -

        Patch adds AddIndexesTask (+Test) and an addIndexs.alg.

        Because of how Benchmark works, the input directory can only be a location on the file system (and not e.g. RAMDirectory).

        I haven't included yet PayloadProcessorProvider, I think it can be added separately.

        Show
        Shai Erera added a comment - Patch adds AddIndexesTask (+Test) and an addIndexs.alg. Because of how Benchmark works, the input directory can only be a location on the file system (and not e.g. RAMDirectory). I haven't included yet PayloadProcessorProvider, I think it can be added separately.
        Hide
        Shai Erera added a comment -

        I ran a small benchmark over an index with 1M documents, that was generated using that .alg file:

        writer.version=LUCENE_40
        ram.flush.mb=128
        analyzer=org.apache.lucene.analysis.core.WhitespaceAnalyzer
        directory=FSDirectory
        work.dir=input
        doc.stored=false
        doc.tokenized=true
        doc.term.vector=false
        log.step=20000
        content.source=org.apache.lucene.benchmark.byTask.feeds.SingleDocSource
        task.max.depth.log=2
        
        # ----------------------------------------------------------------------------
        ResetSystemErase
        CreateIndex
        [ { "AddDocs" AddDoc > : 125000 ] : 8
        CloseIndex
        RepSumByName
        

        Then I ran the following addIndexes.alg file

        writer.version=LUCENE_40
        analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
        directory=FSDirectory
        work.dir=output
        task.max.depth.log=2
        
        # directory to add to the target index
        addindexes.input.dir=input/index
        
        # -----------------------------------------------------------------------
        
        # call addIndexes (Directory)
        ResetSystemErase
        CreateIndex
        { "AddIndexesDirectory" AddIndexes(true) >
        CloseIndex
        
        # call addIndexes (IndexReader)
        ResetSystemErase
        CreateIndex
        { "AddIndexesReader" AddIndexes(false) >
        CloseIndex
        
        RepSumByPref AddIndexes
        

        The run reports:

        ------------> Report Sum By Prefix (AddIndexes) (2 about 2 out of 7)
        Operation           round   runCnt   recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
        AddIndexesDirectory     0        1            1         1.33        0.75     3,873,528      4,194,304
        AddIndexesReader        0        1            1         0.06       17.82     5,795,936      7,906,304
        

        To highlight, the addIndexes(Directory) version is x23 faster than addIndexes(IndexReader), and that's on a fairly small and simple index (376 MB, not so many posting lists). That means that on a more complex index, with more posting lists, more CPU encoding/decoding work will happen, while I suspect the raw file-system file copies done in addIndexes(Dir) will not be affected much.

        This shows how important it is to use addIndexes(Dir) whenever possible ...

        Show
        Shai Erera added a comment - I ran a small benchmark over an index with 1M documents, that was generated using that .alg file: writer.version=LUCENE_40 ram.flush.mb=128 analyzer=org.apache.lucene.analysis.core.WhitespaceAnalyzer directory=FSDirectory work.dir=input doc.stored= false doc.tokenized= true doc.term.vector= false log.step=20000 content.source=org.apache.lucene.benchmark.byTask.feeds.SingleDocSource task.max.depth.log=2 # ---------------------------------------------------------------------------- ResetSystemErase CreateIndex [ { "AddDocs" AddDoc > : 125000 ] : 8 CloseIndex RepSumByName Then I ran the following addIndexes.alg file writer.version=LUCENE_40 analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer directory=FSDirectory work.dir=output task.max.depth.log=2 # directory to add to the target index addindexes.input.dir=input/index # ----------------------------------------------------------------------- # call addIndexes (Directory) ResetSystemErase CreateIndex { "AddIndexesDirectory" AddIndexes( true ) > CloseIndex # call addIndexes (IndexReader) ResetSystemErase CreateIndex { "AddIndexesReader" AddIndexes( false ) > CloseIndex RepSumByPref AddIndexes The run reports: ------------> Report Sum By Prefix (AddIndexes) (2 about 2 out of 7) Operation round runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem AddIndexesDirectory 0 1 1 1.33 0.75 3,873,528 4,194,304 AddIndexesReader 0 1 1 0.06 17.82 5,795,936 7,906,304 To highlight, the addIndexes(Directory) version is x23 faster than addIndexes(IndexReader), and that's on a fairly small and simple index (376 MB, not so many posting lists). That means that on a more complex index, with more posting lists, more CPU encoding/decoding work will happen, while I suspect the raw file-system file copies done in addIndexes(Dir) will not be affected much. This shows how important it is to use addIndexes(Dir) whenever possible ...
        Hide
        Shai Erera added a comment -

        Committed revision 1335363.

        Show
        Shai Erera added a comment - Committed revision 1335363.

          People

          • Assignee:
            Shai Erera
            Reporter:
            Shai Erera
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development