Lucene - Core
  1. Lucene - Core
  2. LUCENE-843

improve how IndexWriter uses RAM to buffer added documents

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.2
    • Fix Version/s: 2.3
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      I'm working on a new class (MultiDocumentWriter) that writes more than
      one document directly into a single Lucene segment, more efficiently
      than the current approach.

      This only affects the creation of an initial segment from added
      documents. I haven't changed anything after that, eg how segments are
      merged.

      The basic ideas are:

      • Write stored fields and term vectors directly to disk (don't
        use up RAM for these).
      • Gather posting lists & term infos in RAM, but periodically do
        in-RAM merges. Once RAM is full, flush buffers to disk (and
        merge them later when it's time to make a real segment).
      • Recycle objects/buffers to reduce time/stress in GC.
      • Other various optimizations.

      Some of these changes are similar to how KinoSearch builds a segment.
      But, I haven't made any changes to Lucene's file format nor added
      requirements for a global fields schema.

      So far the only externally visible change is a new method
      "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
      deprecated) so that it flushes according to RAM usage and not a fixed
      number documents added.

      1. index.presharedstores.cfs.zip
        2 kB
        Michael McCandless
      2. index.presharedstores.nocfs.zip
        5 kB
        Michael McCandless
      3. LUCENE-843.patch
        141 kB
        Michael McCandless
      4. LUCENE-843.take2.patch
        148 kB
        Michael McCandless
      5. LUCENE-843.take3.patch
        156 kB
        Michael McCandless
      6. LUCENE-843.take4.patch
        188 kB
        Michael McCandless
      7. LUCENE-843.take5.patch
        239 kB
        Michael McCandless
      8. LUCENE-843.take6.patch
        210 kB
        Michael McCandless
      9. LUCENE-843.take7.patch
        189 kB
        Michael McCandless
      10. LUCENE-843.take8.patch
        203 kB
        Michael McCandless
      11. LUCENE-843.take9.patch
        204 kB
        Michael McCandless

        Issue Links

          Activity

          Michael McCandless created issue -
          Michael McCandless made changes -
          Field Original Value New Value
          Status Open [ 1 ] In Progress [ 3 ]
          Hide
          Michael McCandless added a comment -

          I'm attaching a patch with my current state. NOTE: this is very rough
          and very much a work in progress and nowhere near ready to commit! I
          wanted to get it out there sooner rather than later to get feedback,
          maybe entice some daring early adopters, iterate, etc.

          It passes all unit tests except the disk-full tests.

          There are some big issues yet to resolve:

          • Merge policy has problems when you "flush by RAM" (this is true
            even before my patch). Not sure how to fix yet.
          • Thread safety and thread concurrency aren't there yet.
          • Norms are not flushed (just use up RAM until you close the
            writer).
          • Many other things on my TODO list
          Show
          Michael McCandless added a comment - I'm attaching a patch with my current state. NOTE: this is very rough and very much a work in progress and nowhere near ready to commit! I wanted to get it out there sooner rather than later to get feedback, maybe entice some daring early adopters, iterate, etc. It passes all unit tests except the disk-full tests. There are some big issues yet to resolve: Merge policy has problems when you "flush by RAM" (this is true even before my patch). Not sure how to fix yet. Thread safety and thread concurrency aren't there yet. Norms are not flushed (just use up RAM until you close the writer). Many other things on my TODO list
          Michael McCandless made changes -
          Attachment LUCENE-843.patch [ 12353973 ]
          Hide
          Michael McCandless added a comment -

          New rev of the patch:

          • Fixed at least one data corruption case
          • Added more asserts (run with "java -ea" so asserts run)
          • Some more small optimizations
          • Updated to current trunk so patch applies cleanly
          Show
          Michael McCandless added a comment - New rev of the patch: Fixed at least one data corruption case Added more asserts (run with "java -ea" so asserts run) Some more small optimizations Updated to current trunk so patch applies cleanly
          Michael McCandless made changes -
          Attachment LUCENE-843.take2.patch [ 12354163 ]
          Hide
          Michael McCandless added a comment -

          Another rev of the patch:

          • Got thread concurrency working: removed "synchronized" from entire
            call to MultiDocWriter.addDocument and instead synchronize two
            quick steps (init/finish) addDocument leaving the real work
            (processDocument) unsynchronized.
          • Fixed bug that was failing to delete temp files from index
          • Reduced memory usage of Posting by inlining positions, start
            offset, end offset into a single int array.
          • Enabled IndexLineFiles.java (tool I use for local benchmarking) to
            run multiple threads
          • Other small optimizations

          BTW, one of the nice side effects of this patch is it cleans up the
          mergeSegments method of IndexWriter by separating out "flush" of added
          docs & deletions because it's no longer a merge, from the "true"
          mergeSegments whose purpose is then to merge disk segments.
          Previously mergeSegments was getting rather confusing with the
          different cases/combinations of added docs or not, deleted docs or
          not, any merges or not.

          Show
          Michael McCandless added a comment - Another rev of the patch: Got thread concurrency working: removed "synchronized" from entire call to MultiDocWriter.addDocument and instead synchronize two quick steps (init/finish) addDocument leaving the real work (processDocument) unsynchronized. Fixed bug that was failing to delete temp files from index Reduced memory usage of Posting by inlining positions, start offset, end offset into a single int array. Enabled IndexLineFiles.java (tool I use for local benchmarking) to run multiple threads Other small optimizations BTW, one of the nice side effects of this patch is it cleans up the mergeSegments method of IndexWriter by separating out "flush" of added docs & deletions because it's no longer a merge, from the "true" mergeSegments whose purpose is then to merge disk segments. Previously mergeSegments was getting rather confusing with the different cases/combinations of added docs or not, deleted docs or not, any merges or not.
          Michael McCandless made changes -
          Attachment LUCENE-843.take3.patch [ 12354431 ]
          Hide
          Michael McCandless added a comment -

          Another rev of the patch. All tests pass except disk full tests. The
          code is still rather "dirty" and not well commented.

          I think I'm close to finishing optimizing and now I will focus on
          error handling (eg disk full), adding some deeper unit tests, more
          testing on corner cases like massive docs or docs with massive terms,
          etc., flushing pending norms to disk, cleaning up / commenting the
          code and various other smaller items.

          Here are the changes in this rev:

          • A proposed backwards compatible change to the Token API to also
            allow the term text to be delivered as a slice (offset & length)
            into a char[] array instead of String. With an analyzer/tokenizer
            that takes advantage of this, this was a decent performance gain
            in my local testing. I've created a SimpleSpaceAnalyzer that only
            splits words at the space character to test this.
          • Added more asserts (run java -ea to enable asserts). The asserts
            are quite useful and now often catch a bug I've introduced before
            the unit tests do.
          • Changed to custom int[] block buffering for postings to store
            freq, prox's and offsets. With this buffering we no longer have
            to double the size of int[] arrays while adding positions, nor do
            we have to copy ints whenever we needs more space for these
            arrays. Instead I allocate larger slices out of the shared int[]
            arrays. This reduces memory and improves performance.
          • Changed to custom char[] block buffering for postings to store
            term text. This also reduces memory and improves performance.
          • Changed to single file for RAM & flushed partial segments (was 3
            separate files before)
          • Changed how I merge flushed partial segments to match what's
            described in LUCENE-854
          • Reduced memory usage when indexing large docs (25 MB plain text
            each). I'm still consuming more RAM in this case than the
            baseline (trunk) so I'm still working on this one ...
          • Fixed a slow memory leak when building large (20+ GB) indices
          Show
          Michael McCandless added a comment - Another rev of the patch. All tests pass except disk full tests. The code is still rather "dirty" and not well commented. I think I'm close to finishing optimizing and now I will focus on error handling (eg disk full), adding some deeper unit tests, more testing on corner cases like massive docs or docs with massive terms, etc., flushing pending norms to disk, cleaning up / commenting the code and various other smaller items. Here are the changes in this rev: A proposed backwards compatible change to the Token API to also allow the term text to be delivered as a slice (offset & length) into a char[] array instead of String. With an analyzer/tokenizer that takes advantage of this, this was a decent performance gain in my local testing. I've created a SimpleSpaceAnalyzer that only splits words at the space character to test this. Added more asserts (run java -ea to enable asserts). The asserts are quite useful and now often catch a bug I've introduced before the unit tests do. Changed to custom int[] block buffering for postings to store freq, prox's and offsets. With this buffering we no longer have to double the size of int[] arrays while adding positions, nor do we have to copy ints whenever we needs more space for these arrays. Instead I allocate larger slices out of the shared int[] arrays. This reduces memory and improves performance. Changed to custom char[] block buffering for postings to store term text. This also reduces memory and improves performance. Changed to single file for RAM & flushed partial segments (was 3 separate files before) Changed how I merge flushed partial segments to match what's described in LUCENE-854 Reduced memory usage when indexing large docs (25 MB plain text each). I'm still consuming more RAM in this case than the baseline (trunk) so I'm still working on this one ... Fixed a slow memory leak when building large (20+ GB) indices
          Michael McCandless made changes -
          Attachment LUCENE-843.take4.patch [ 12354752 ]
          Hide
          Michael McCandless added a comment -

          Some details on how I measure RAM usage: both the baseline (current
          lucene trunk) and my patch have two general classes of RAM usage.

          The first class, "document processing RAM", is RAM used while
          processing a single doc. This RAM is re-used for each document (in the
          trunk, it's GC'd and new RAM is allocated; in my patch, I explicitly
          re-use these objects) and how large it gets is driven by how big each
          document is.

          The second class, "indexed documents RAM", is the RAM used up by
          previously indexed documents. This RAM grows with each added
          document and how large it gets is driven by the number and size of
          docs indexed since the last flush.

          So when I say the writer is allowed to use 32 MB of RAM, I'm only
          measuring the "indexed documents RAM". With trunk I do this by
          calling ramSizeInBytes(), and with my patch I do the analagous thing
          by measuring how many RAM buffers are held up storing previously
          indexed documents.

          I then define "RAM efficiency" (docs/MB) as how many docs we can hold
          in "indexed documents RAM" per MB RAM, at the point that we flush to
          disk. I think this is an important metric because it drives how large
          your initial (level 0) segments are. The larger these segments are
          then generally the less merging you need to do, for a given # docs in
          the index.

          I also measure overall RAM used in the JVM (using
          MemoryMXBean.getHeapMemoryUsage().getUsed()) just prior to each flush
          except the last, to also capture the "document processing RAM", object
          overhead, etc.

          Show
          Michael McCandless added a comment - Some details on how I measure RAM usage: both the baseline (current lucene trunk) and my patch have two general classes of RAM usage. The first class, "document processing RAM", is RAM used while processing a single doc. This RAM is re-used for each document (in the trunk, it's GC'd and new RAM is allocated; in my patch, I explicitly re-use these objects) and how large it gets is driven by how big each document is. The second class, "indexed documents RAM", is the RAM used up by previously indexed documents. This RAM grows with each added document and how large it gets is driven by the number and size of docs indexed since the last flush. So when I say the writer is allowed to use 32 MB of RAM, I'm only measuring the "indexed documents RAM". With trunk I do this by calling ramSizeInBytes(), and with my patch I do the analagous thing by measuring how many RAM buffers are held up storing previously indexed documents. I then define "RAM efficiency" (docs/MB) as how many docs we can hold in "indexed documents RAM" per MB RAM, at the point that we flush to disk. I think this is an important metric because it drives how large your initial (level 0) segments are. The larger these segments are then generally the less merging you need to do, for a given # docs in the index. I also measure overall RAM used in the JVM (using MemoryMXBean.getHeapMemoryUsage().getUsed()) just prior to each flush except the last, to also capture the "document processing RAM", object overhead, etc.
          Hide
          Michael McCandless added a comment -

          To do the benchmarking I created a simple standalone tool
          (demo/IndexLineFiles, in the last patch) that indexes one line at a
          time from a large previously created file, optionally using multiple
          threads. I do it this way to minimize IO cost of pulling the document
          source because I want to measure just indexing time as much as possible.

          Each line is read and a doc is created with field "contents" that is
          not stored, is tokenized, and optionally has term vectors with
          position+offsets. I also optionally add two small only-stored fields
          ("path" and "modified"). I think these are fairly trivial documents
          compared to typical usage of Lucene.

          For the corpus, I took Europarl's "en" content, stripped tags, and
          processed into 3 files: one with 100 tokens per line (= ~550 bytes),
          one with 1000 tokens per line (= ~5,500 bytes) and with 10000 tokens
          per line (= ~55,000 bytes) plain text per line.

          All settings (mergeFactor, compound file, etc.) are left at defaults.
          I don't optimize the index in the end. I'm using my new
          SimpleSpaceAnalyzer (just splits token on the space character and
          creates token text as slice into a char[] array instead of new
          String(...)) to minimize the cost of tokenization.

          I ran the tests with Java 1.5 on a Mac Pro quad (2 Intel CPUs, each
          dual core) OS X box with 2 GB RAM. I give java 1 GB heap (-Xmx1024m).

          Show
          Michael McCandless added a comment - To do the benchmarking I created a simple standalone tool (demo/IndexLineFiles, in the last patch) that indexes one line at a time from a large previously created file, optionally using multiple threads. I do it this way to minimize IO cost of pulling the document source because I want to measure just indexing time as much as possible. Each line is read and a doc is created with field "contents" that is not stored, is tokenized, and optionally has term vectors with position+offsets. I also optionally add two small only-stored fields ("path" and "modified"). I think these are fairly trivial documents compared to typical usage of Lucene. For the corpus, I took Europarl's "en" content, stripped tags, and processed into 3 files: one with 100 tokens per line (= ~550 bytes), one with 1000 tokens per line (= ~5,500 bytes) and with 10000 tokens per line (= ~55,000 bytes) plain text per line. All settings (mergeFactor, compound file, etc.) are left at defaults. I don't optimize the index in the end. I'm using my new SimpleSpaceAnalyzer (just splits token on the space character and creates token text as slice into a char[] array instead of new String(...)) to minimize the cost of tokenization. I ran the tests with Java 1.5 on a Mac Pro quad (2 Intel CPUs, each dual core) OS X box with 2 GB RAM. I give java 1 GB heap (-Xmx1024m).
          Hide
          Michael McCandless added a comment -

          A couple more details on the testing: I run java -server to get all
          optimizations in the JVM, and the IO system is a local OS X RAID 0 of
          4 SATA drives.

          Using the above tool I ran an initial set of benchmarks comparing old
          (= Lucene trunk) vs new (= this patch), varying document size (~550
          bytes to ~5,500 bytes to ~55,000 bytes of plain text from Europarl
          "en").

          For each document size I run 4 combinations of whether term vectors
          and stored fields are on or off and whether autoCommit is true or
          false. I measure net docs/sec (= total # docs indexed divided by
          total time taken), RAM efficiency (= avg # docs flushed with each
          flush divided by RAM buffer size), and avg HEAP RAM usage before each
          flush.

          Here are the results for the 10K tokens (= ~55,000 bytes plain text)
          per document:

          20000 DOCS @ ~55,000 bytes plain text
          RAM = 32 MB
          NUM THREADS = 1
          MERGE FACTOR = 10

          No term vectors nor stored fields

          AUTOCOMMIT = true (commit whenever RAM is full)

          old
          20000 docs in 200.3 secs
          index size = 358M

          new
          20000 docs in 126.0 secs
          index size = 356M

          Total Docs/sec: old 99.8; new 158.7 [ 59.0% faster]
          Docs/MB @ flush: old 24.2; new 49.1 [ 102.5% more]
          Avg RAM used (MB) @ flush: old 74.5; new 36.2 [ 51.4% less]

          AUTOCOMMIT = false (commit only once at the end)

          old
          20000 docs in 202.7 secs
          index size = 358M

          new
          20000 docs in 120.0 secs
          index size = 354M

          Total Docs/sec: old 98.7; new 166.7 [ 69.0% faster]
          Docs/MB @ flush: old 24.2; new 48.9 [ 101.7% more]
          Avg RAM used (MB) @ flush: old 74.3; new 37.0 [ 50.2% less]

          With term vectors (positions + offsets) and 2 small stored fields

          AUTOCOMMIT = true (commit whenever RAM is full)

          old
          20000 docs in 374.7 secs
          index size = 1.4G

          new
          20000 docs in 236.1 secs
          index size = 1.4G

          Total Docs/sec: old 53.4; new 84.7 [ 58.7% faster]
          Docs/MB @ flush: old 10.2; new 49.1 [ 382.8% more]
          Avg RAM used (MB) @ flush: old 129.3; new 36.6 [ 71.7% less]

          AUTOCOMMIT = false (commit only once at the end)

          old
          20000 docs in 385.7 secs
          index size = 1.4G

          new
          20000 docs in 182.8 secs
          index size = 1.4G

          Total Docs/sec: old 51.9; new 109.4 [ 111.0% faster]
          Docs/MB @ flush: old 10.2; new 48.9 [ 380.9% more]
          Avg RAM used (MB) @ flush: old 76.0; new 37.3 [ 50.9% less]

          Show
          Michael McCandless added a comment - A couple more details on the testing: I run java -server to get all optimizations in the JVM, and the IO system is a local OS X RAID 0 of 4 SATA drives. Using the above tool I ran an initial set of benchmarks comparing old (= Lucene trunk) vs new (= this patch), varying document size (~550 bytes to ~5,500 bytes to ~55,000 bytes of plain text from Europarl "en"). For each document size I run 4 combinations of whether term vectors and stored fields are on or off and whether autoCommit is true or false. I measure net docs/sec (= total # docs indexed divided by total time taken), RAM efficiency (= avg # docs flushed with each flush divided by RAM buffer size), and avg HEAP RAM usage before each flush. Here are the results for the 10K tokens (= ~55,000 bytes plain text) per document: 20000 DOCS @ ~55,000 bytes plain text RAM = 32 MB NUM THREADS = 1 MERGE FACTOR = 10 No term vectors nor stored fields AUTOCOMMIT = true (commit whenever RAM is full) old 20000 docs in 200.3 secs index size = 358M new 20000 docs in 126.0 secs index size = 356M Total Docs/sec: old 99.8; new 158.7 [ 59.0% faster] Docs/MB @ flush: old 24.2; new 49.1 [ 102.5% more] Avg RAM used (MB) @ flush: old 74.5; new 36.2 [ 51.4% less] AUTOCOMMIT = false (commit only once at the end) old 20000 docs in 202.7 secs index size = 358M new 20000 docs in 120.0 secs index size = 354M Total Docs/sec: old 98.7; new 166.7 [ 69.0% faster] Docs/MB @ flush: old 24.2; new 48.9 [ 101.7% more] Avg RAM used (MB) @ flush: old 74.3; new 37.0 [ 50.2% less] With term vectors (positions + offsets) and 2 small stored fields AUTOCOMMIT = true (commit whenever RAM is full) old 20000 docs in 374.7 secs index size = 1.4G new 20000 docs in 236.1 secs index size = 1.4G Total Docs/sec: old 53.4; new 84.7 [ 58.7% faster] Docs/MB @ flush: old 10.2; new 49.1 [ 382.8% more] Avg RAM used (MB) @ flush: old 129.3; new 36.6 [ 71.7% less] AUTOCOMMIT = false (commit only once at the end) old 20000 docs in 385.7 secs index size = 1.4G new 20000 docs in 182.8 secs index size = 1.4G Total Docs/sec: old 51.9; new 109.4 [ 111.0% faster] Docs/MB @ flush: old 10.2; new 48.9 [ 380.9% more] Avg RAM used (MB) @ flush: old 76.0; new 37.3 [ 50.9% less]
          Hide
          Michael McCandless added a comment -

          Here are the results for "normal" sized docs (1K tokens = ~5,500 bytes plain text each):

          200000 DOCS @ ~5,500 bytes plain text
          RAM = 32 MB
          NUM THREADS = 1
          MERGE FACTOR = 10

          No term vectors nor stored fields

          AUTOCOMMIT = true (commit whenever RAM is full)

          old
          200000 docs in 397.6 secs
          index size = 415M

          new
          200000 docs in 167.5 secs
          index size = 411M

          Total Docs/sec: old 503.1; new 1194.1 [ 137.3% faster]
          Docs/MB @ flush: old 81.6; new 406.2 [ 397.6% more]
          Avg RAM used (MB) @ flush: old 87.3; new 35.2 [ 59.7% less]

          AUTOCOMMIT = false (commit only once at the end)

          old
          200000 docs in 394.6 secs
          index size = 415M

          new
          200000 docs in 168.4 secs
          index size = 408M

          Total Docs/sec: old 506.9; new 1187.7 [ 134.3% faster]
          Docs/MB @ flush: old 81.6; new 432.2 [ 429.4% more]
          Avg RAM used (MB) @ flush: old 126.6; new 36.9 [ 70.8% less]

          With term vectors (positions + offsets) and 2 small stored fields

          AUTOCOMMIT = true (commit whenever RAM is full)

          old
          200000 docs in 754.2 secs
          index size = 1.7G

          new
          200000 docs in 304.9 secs
          index size = 1.7G

          Total Docs/sec: old 265.2; new 656.0 [ 147.4% faster]
          Docs/MB @ flush: old 46.7; new 406.2 [ 769.6% more]
          Avg RAM used (MB) @ flush: old 92.9; new 35.2 [ 62.1% less]

          AUTOCOMMIT = false (commit only once at the end)

          old
          200000 docs in 743.9 secs
          index size = 1.7G

          new
          200000 docs in 244.3 secs
          index size = 1.7G

          Total Docs/sec: old 268.9; new 818.7 [ 204.5% faster]
          Docs/MB @ flush: old 46.7; new 432.2 [ 825.2% more]
          Avg RAM used (MB) @ flush: old 93.0; new 36.6 [ 60.6% less]

          Show
          Michael McCandless added a comment - Here are the results for "normal" sized docs (1K tokens = ~5,500 bytes plain text each): 200000 DOCS @ ~5,500 bytes plain text RAM = 32 MB NUM THREADS = 1 MERGE FACTOR = 10 No term vectors nor stored fields AUTOCOMMIT = true (commit whenever RAM is full) old 200000 docs in 397.6 secs index size = 415M new 200000 docs in 167.5 secs index size = 411M Total Docs/sec: old 503.1; new 1194.1 [ 137.3% faster] Docs/MB @ flush: old 81.6; new 406.2 [ 397.6% more] Avg RAM used (MB) @ flush: old 87.3; new 35.2 [ 59.7% less] AUTOCOMMIT = false (commit only once at the end) old 200000 docs in 394.6 secs index size = 415M new 200000 docs in 168.4 secs index size = 408M Total Docs/sec: old 506.9; new 1187.7 [ 134.3% faster] Docs/MB @ flush: old 81.6; new 432.2 [ 429.4% more] Avg RAM used (MB) @ flush: old 126.6; new 36.9 [ 70.8% less] With term vectors (positions + offsets) and 2 small stored fields AUTOCOMMIT = true (commit whenever RAM is full) old 200000 docs in 754.2 secs index size = 1.7G new 200000 docs in 304.9 secs index size = 1.7G Total Docs/sec: old 265.2; new 656.0 [ 147.4% faster] Docs/MB @ flush: old 46.7; new 406.2 [ 769.6% more] Avg RAM used (MB) @ flush: old 92.9; new 35.2 [ 62.1% less] AUTOCOMMIT = false (commit only once at the end) old 200000 docs in 743.9 secs index size = 1.7G new 200000 docs in 244.3 secs index size = 1.7G Total Docs/sec: old 268.9; new 818.7 [ 204.5% faster] Docs/MB @ flush: old 46.7; new 432.2 [ 825.2% more] Avg RAM used (MB) @ flush: old 93.0; new 36.6 [ 60.6% less]
          Hide
          Michael McCandless added a comment -

          Last is the results for small docs (100 tokens = ~550 bytes plain text each):

          2000000 DOCS @ ~550 bytes plain text
          RAM = 32 MB
          NUM THREADS = 1
          MERGE FACTOR = 10

          No term vectors nor stored fields

          AUTOCOMMIT = true (commit whenever RAM is full)

          old
          2000000 docs in 886.7 secs
          index size = 438M

          new
          2000000 docs in 230.5 secs
          index size = 435M

          Total Docs/sec: old 2255.6; new 8676.4 [ 284.7% faster]
          Docs/MB @ flush: old 128.0; new 4194.6 [ 3176.2% more]
          Avg RAM used (MB) @ flush: old 107.3; new 37.7 [ 64.9% less]

          AUTOCOMMIT = false (commit only once at the end)

          old
          2000000 docs in 888.7 secs
          index size = 438M

          new
          2000000 docs in 239.6 secs
          index size = 432M

          Total Docs/sec: old 2250.5; new 8348.7 [ 271.0% faster]
          Docs/MB @ flush: old 128.0; new 4146.8 [ 3138.9% more]
          Avg RAM used (MB) @ flush: old 108.1; new 38.9 [ 64.0% less]

          With term vectors (positions + offsets) and 2 small stored fields

          AUTOCOMMIT = true (commit whenever RAM is full)

          old
          2000000 docs in 1480.1 secs
          index size = 2.1G

          new
          2000000 docs in 462.0 secs
          index size = 2.1G

          Total Docs/sec: old 1351.2; new 4329.3 [ 220.4% faster]
          Docs/MB @ flush: old 93.1; new 4194.6 [ 4405.7% more]
          Avg RAM used (MB) @ flush: old 296.4; new 38.3 [ 87.1% less]

          AUTOCOMMIT = false (commit only once at the end)

          old
          2000000 docs in 1489.4 secs
          index size = 2.1G

          new
          2000000 docs in 347.9 secs
          index size = 2.1G

          Total Docs/sec: old 1342.8; new 5749.4 [ 328.2% faster]
          Docs/MB @ flush: old 93.1; new 4146.8 [ 4354.5% more]
          Avg RAM used (MB) @ flush: old 297.1; new 38.6 [ 87.0% less]

          200000 DOCS @ ~5,500 bytes plain text

          No term vectors nor stored fields

          AUTOCOMMIT = true (commit whenever RAM is full)

          old
          200000 docs in 397.6 secs
          index size = 415M

          new
          200000 docs in 167.5 secs
          index size = 411M

          Total Docs/sec: old 503.1; new 1194.1 [ 137.3% faster]
          Docs/MB @ flush: old 81.6; new 406.2 [ 397.6% more]
          Avg RAM used (MB) @ flush: old 87.3; new 35.2 [ 59.7% less]

          AUTOCOMMIT = false (commit only once at the end)

          old
          200000 docs in 394.6 secs
          index size = 415M

          new
          200000 docs in 168.4 secs
          index size = 408M

          Total Docs/sec: old 506.9; new 1187.7 [ 134.3% faster]
          Docs/MB @ flush: old 81.6; new 432.2 [ 429.4% more]
          Avg RAM used (MB) @ flush: old 126.6; new 36.9 [ 70.8% less]

          With term vectors (positions + offsets) and 2 small stored fields

          AUTOCOMMIT = true (commit whenever RAM is full)

          old
          200000 docs in 754.2 secs
          index size = 1.7G

          new
          200000 docs in 304.9 secs
          index size = 1.7G

          Total Docs/sec: old 265.2; new 656.0 [ 147.4% faster]
          Docs/MB @ flush: old 46.7; new 406.2 [ 769.6% more]
          Avg RAM used (MB) @ flush: old 92.9; new 35.2 [ 62.1% less]

          AUTOCOMMIT = false (commit only once at the end)

          old
          200000 docs in 743.9 secs
          index size = 1.7G

          new
          200000 docs in 244.3 secs
          index size = 1.7G

          Total Docs/sec: old 268.9; new 818.7 [ 204.5% faster]
          Docs/MB @ flush: old 46.7; new 432.2 [ 825.2% more]
          Avg RAM used (MB) @ flush: old 93.0; new 36.6 [ 60.6% less]

          Show
          Michael McCandless added a comment - Last is the results for small docs (100 tokens = ~550 bytes plain text each): 2000000 DOCS @ ~550 bytes plain text RAM = 32 MB NUM THREADS = 1 MERGE FACTOR = 10 No term vectors nor stored fields AUTOCOMMIT = true (commit whenever RAM is full) old 2000000 docs in 886.7 secs index size = 438M new 2000000 docs in 230.5 secs index size = 435M Total Docs/sec: old 2255.6; new 8676.4 [ 284.7% faster] Docs/MB @ flush: old 128.0; new 4194.6 [ 3176.2% more] Avg RAM used (MB) @ flush: old 107.3; new 37.7 [ 64.9% less] AUTOCOMMIT = false (commit only once at the end) old 2000000 docs in 888.7 secs index size = 438M new 2000000 docs in 239.6 secs index size = 432M Total Docs/sec: old 2250.5; new 8348.7 [ 271.0% faster] Docs/MB @ flush: old 128.0; new 4146.8 [ 3138.9% more] Avg RAM used (MB) @ flush: old 108.1; new 38.9 [ 64.0% less] With term vectors (positions + offsets) and 2 small stored fields AUTOCOMMIT = true (commit whenever RAM is full) old 2000000 docs in 1480.1 secs index size = 2.1G new 2000000 docs in 462.0 secs index size = 2.1G Total Docs/sec: old 1351.2; new 4329.3 [ 220.4% faster] Docs/MB @ flush: old 93.1; new 4194.6 [ 4405.7% more] Avg RAM used (MB) @ flush: old 296.4; new 38.3 [ 87.1% less] AUTOCOMMIT = false (commit only once at the end) old 2000000 docs in 1489.4 secs index size = 2.1G new 2000000 docs in 347.9 secs index size = 2.1G Total Docs/sec: old 1342.8; new 5749.4 [ 328.2% faster] Docs/MB @ flush: old 93.1; new 4146.8 [ 4354.5% more] Avg RAM used (MB) @ flush: old 297.1; new 38.6 [ 87.0% less] 200000 DOCS @ ~5,500 bytes plain text No term vectors nor stored fields AUTOCOMMIT = true (commit whenever RAM is full) old 200000 docs in 397.6 secs index size = 415M new 200000 docs in 167.5 secs index size = 411M Total Docs/sec: old 503.1; new 1194.1 [ 137.3% faster] Docs/MB @ flush: old 81.6; new 406.2 [ 397.6% more] Avg RAM used (MB) @ flush: old 87.3; new 35.2 [ 59.7% less] AUTOCOMMIT = false (commit only once at the end) old 200000 docs in 394.6 secs index size = 415M new 200000 docs in 168.4 secs index size = 408M Total Docs/sec: old 506.9; new 1187.7 [ 134.3% faster] Docs/MB @ flush: old 81.6; new 432.2 [ 429.4% more] Avg RAM used (MB) @ flush: old 126.6; new 36.9 [ 70.8% less] With term vectors (positions + offsets) and 2 small stored fields AUTOCOMMIT = true (commit whenever RAM is full) old 200000 docs in 754.2 secs index size = 1.7G new 200000 docs in 304.9 secs index size = 1.7G Total Docs/sec: old 265.2; new 656.0 [ 147.4% faster] Docs/MB @ flush: old 46.7; new 406.2 [ 769.6% more] Avg RAM used (MB) @ flush: old 92.9; new 35.2 [ 62.1% less] AUTOCOMMIT = false (commit only once at the end) old 200000 docs in 743.9 secs index size = 1.7G new 200000 docs in 244.3 secs index size = 1.7G Total Docs/sec: old 268.9; new 818.7 [ 204.5% faster] Docs/MB @ flush: old 46.7; new 432.2 [ 825.2% more] Avg RAM used (MB) @ flush: old 93.0; new 36.6 [ 60.6% less]
          Hide
          Michael McCandless added a comment -

          A few notes from these results:

          • A real Lucene app won't see these gains because frequently the
            retrieval of docs from the content source, and the tokenization,
            take substantial amounts of time whereas for this test I've
            intentionally minimized the cost of those steps but they are very
            low for this test because I'm 1) pulling one line at a time from a
            big text file, and 2) using my simplistic SimpleSpaceAnalyzer
            which just breaks tokens at the space character.
          • Best speedup is ~4.3X faster, for tiny docs (~550 bytes) with term
            vectors and stored fields enabled and using autoCommit=false.
          • Least speedup is still ~1.6X faster, for large docs (~55,000
            bytes) with autoCommit=true.
          • The autoCommit=false cases are a little unfair to the new patch
            because with the new patch, you get a single-segment (optimized)
            index in the end, but with existing Lucene trunk, you don't.
          • With term vectors and/or stored fields, autoCommit=false is quite
            a bit faster with the patch, because we never pay the price to
            merge them since they are written once.
          • With term vectors and/or stored fields, the new patch has
            substantially better RAM efficiency.
          • The patch is especially faster and has better RAM efficiency with
            smaller documents.
          • The actual HEAP RAM usage is quite a bit more stable with the
            patch, especially with term vectors & stored fields enabled. I
            think this is because the patch creates far less garbage for GC to
            periodically reclaim. I think this also means you could push your
            RAM buffer size even higher to get better performance.
          Show
          Michael McCandless added a comment - A few notes from these results: A real Lucene app won't see these gains because frequently the retrieval of docs from the content source, and the tokenization, take substantial amounts of time whereas for this test I've intentionally minimized the cost of those steps but they are very low for this test because I'm 1) pulling one line at a time from a big text file, and 2) using my simplistic SimpleSpaceAnalyzer which just breaks tokens at the space character. Best speedup is ~4.3X faster, for tiny docs (~550 bytes) with term vectors and stored fields enabled and using autoCommit=false. Least speedup is still ~1.6X faster, for large docs (~55,000 bytes) with autoCommit=true. The autoCommit=false cases are a little unfair to the new patch because with the new patch, you get a single-segment (optimized) index in the end, but with existing Lucene trunk, you don't. With term vectors and/or stored fields, autoCommit=false is quite a bit faster with the patch, because we never pay the price to merge them since they are written once. With term vectors and/or stored fields, the new patch has substantially better RAM efficiency. The patch is especially faster and has better RAM efficiency with smaller documents. The actual HEAP RAM usage is quite a bit more stable with the patch, especially with term vectors & stored fields enabled. I think this is because the patch creates far less garbage for GC to periodically reclaim. I think this also means you could push your RAM buffer size even higher to get better performance.
          Hide
          Marvin Humphrey added a comment -

          > The actual HEAP RAM usage is quite a bit more
          > stable with the patch, especially with term vectors
          > & stored fields enabled. I think this is because the
          > patch creates far less garbage for GC to periodically
          > reclaim. I think this also means you could push your
          > RAM buffer size even higher to get better performance.

          For KinoSearch, the sweet spot seems to be a buffer of around 16 MB when benchmarking with the Reuters corpus on my G4 laptop. Larger than that and things actually slow down, unless the buffer is large enough that it never needs flushing. My hypothesis is that RAM fragmentation is slowing down malloc/free. I'll be interested as to whether you see the same effect.

          Show
          Marvin Humphrey added a comment - > The actual HEAP RAM usage is quite a bit more > stable with the patch, especially with term vectors > & stored fields enabled. I think this is because the > patch creates far less garbage for GC to periodically > reclaim. I think this also means you could push your > RAM buffer size even higher to get better performance. For KinoSearch, the sweet spot seems to be a buffer of around 16 MB when benchmarking with the Reuters corpus on my G4 laptop. Larger than that and things actually slow down, unless the buffer is large enough that it never needs flushing. My hypothesis is that RAM fragmentation is slowing down malloc/free. I'll be interested as to whether you see the same effect.
          Hide
          Michael McCandless added a comment -

          >> The actual HEAP RAM usage is quite a bit more
          >> stable with the patch, especially with term vectors
          >> & stored fields enabled. I think this is because the
          >> patch creates far less garbage for GC to periodically
          >> reclaim. I think this also means you could push your
          >> RAM buffer size even higher to get better performance.
          >
          > For KinoSearch, the sweet spot seems to be a buffer of around 16 MB
          > when benchmarking with the Reuters corpus on my G4 laptop. Larger
          > than that and things actually slow down, unless the buffer is large
          > enough that it never needs flushing. My hypothesis is that RAM
          > fragmentation is slowing down malloc/free. I'll be interested as to
          > whether you see the same effect.

          Interesting. OK I will run the benchmark across increasing RAM sizes
          to see where the sweet spot seems to be!

          Show
          Michael McCandless added a comment - >> The actual HEAP RAM usage is quite a bit more >> stable with the patch, especially with term vectors >> & stored fields enabled. I think this is because the >> patch creates far less garbage for GC to periodically >> reclaim. I think this also means you could push your >> RAM buffer size even higher to get better performance. > > For KinoSearch, the sweet spot seems to be a buffer of around 16 MB > when benchmarking with the Reuters corpus on my G4 laptop. Larger > than that and things actually slow down, unless the buffer is large > enough that it never needs flushing. My hypothesis is that RAM > fragmentation is slowing down malloc/free. I'll be interested as to > whether you see the same effect. Interesting. OK I will run the benchmark across increasing RAM sizes to see where the sweet spot seems to be!
          Hide
          Michael McCandless added a comment -

          OK I ran old (trunk) vs new (this patch) with increasing RAM buffer
          sizes up to 96 MB.

          I used the "normal" sized docs (~5,500 bytes plain text), left stored
          fields and term vectors (positions + offsets) on, and
          autoCommit=false.

          Here're the results:

          NUM THREADS = 1
          MERGE FACTOR = 10
          With term vectors (positions + offsets) and 2 small stored fields
          AUTOCOMMIT = false (commit only once at the end)

          1 MB

          old
          200000 docs in 862.2 secs
          index size = 1.7G

          new
          200000 docs in 297.1 secs
          index size = 1.7G

          Total Docs/sec: old 232.0; new 673.2 [ 190.2% faster]
          Docs/MB @ flush: old 47.2; new 278.4 [ 489.6% more]
          Avg RAM used (MB) @ flush: old 34.5; new 3.4 [ 90.1% less]

          2 MB

          old
          200000 docs in 828.7 secs
          index size = 1.7G

          new
          200000 docs in 279.0 secs
          index size = 1.7G

          Total Docs/sec: old 241.3; new 716.8 [ 197.0% faster]
          Docs/MB @ flush: old 47.0; new 322.4 [ 586.7% more]
          Avg RAM used (MB) @ flush: old 37.9; new 4.5 [ 88.0% less]

          4 MB

          old
          200000 docs in 840.5 secs
          index size = 1.7G

          new
          200000 docs in 260.8 secs
          index size = 1.7G

          Total Docs/sec: old 237.9; new 767.0 [ 222.3% faster]
          Docs/MB @ flush: old 46.8; new 363.1 [ 675.4% more]
          Avg RAM used (MB) @ flush: old 33.9; new 6.5 [ 80.9% less]

          8 MB

          old
          200000 docs in 678.8 secs
          index size = 1.7G

          new
          200000 docs in 248.8 secs
          index size = 1.7G

          Total Docs/sec: old 294.6; new 803.7 [ 172.8% faster]
          Docs/MB @ flush: old 46.8; new 392.4 [ 739.1% more]
          Avg RAM used (MB) @ flush: old 60.3; new 10.7 [ 82.2% less]

          16 MB

          old
          200000 docs in 660.6 secs
          index size = 1.7G

          new
          200000 docs in 247.3 secs
          index size = 1.7G

          Total Docs/sec: old 302.8; new 808.7 [ 167.1% faster]
          Docs/MB @ flush: old 46.7; new 415.4 [ 788.8% more]
          Avg RAM used (MB) @ flush: old 47.1; new 19.2 [ 59.3% less]

          24 MB

          old
          200000 docs in 658.1 secs
          index size = 1.7G

          new
          200000 docs in 243.0 secs
          index size = 1.7G

          Total Docs/sec: old 303.9; new 823.0 [ 170.8% faster]
          Docs/MB @ flush: old 46.7; new 430.9 [ 822.2% more]
          Avg RAM used (MB) @ flush: old 70.0; new 27.5 [ 60.8% less]

          32 MB

          old
          200000 docs in 714.2 secs
          index size = 1.7G

          new
          200000 docs in 239.2 secs
          index size = 1.7G

          Total Docs/sec: old 280.0; new 836.0 [ 198.5% faster]
          Docs/MB @ flush: old 46.7; new 432.2 [ 825.2% more]
          Avg RAM used (MB) @ flush: old 92.5; new 36.7 [ 60.3% less]

          48 MB

          old
          200000 docs in 640.3 secs
          index size = 1.7G

          new
          200000 docs in 236.0 secs
          index size = 1.7G

          Total Docs/sec: old 312.4; new 847.5 [ 171.3% faster]
          Docs/MB @ flush: old 46.7; new 438.5 [ 838.8% more]
          Avg RAM used (MB) @ flush: old 138.9; new 52.8 [ 62.0% less]

          64 MB

          old
          200000 docs in 649.3 secs
          index size = 1.7G

          new
          200000 docs in 238.3 secs
          index size = 1.7G

          Total Docs/sec: old 308.0; new 839.3 [ 172.5% faster]
          Docs/MB @ flush: old 46.7; new 441.3 [ 844.7% more]
          Avg RAM used (MB) @ flush: old 302.6; new 72.7 [ 76.0% less]

          80 MB

          old
          200000 docs in 670.2 secs
          index size = 1.7G

          new
          200000 docs in 227.2 secs
          index size = 1.7G

          Total Docs/sec: old 298.4; new 880.5 [ 195.0% faster]
          Docs/MB @ flush: old 46.7; new 446.2 [ 855.2% more]
          Avg RAM used (MB) @ flush: old 231.7; new 94.3 [ 59.3% less]

          96 MB

          old
          200000 docs in 683.4 secs
          index size = 1.7G

          new
          200000 docs in 226.8 secs
          index size = 1.7G

          Total Docs/sec: old 292.7; new 882.0 [ 201.4% faster]
          Docs/MB @ flush: old 46.7; new 448.0 [ 859.1% more]
          Avg RAM used (MB) @ flush: old 274.5; new 112.7 [ 59.0% less]

          Some observations:

          • Remember the test is already biased against "new" because with the
            patch you get an optimized index in the end but with "old" you
            don't.
          • Sweet spot for old (trunk) seems to be 48 MB: that is the peak
            docs/sec @ 312.4.
          • New (with patch) seems to just get faster the more memory you give
            it, though gradually. The peak was 96 MB (the largest I ran). So
            no sweet spot (or maybe I need to give more memory, but, above 96
            MB the trunk was starting to swap on my test env).
          • New gets better and better RAM efficiency, the more RAM you give.
            This makes sense: it's better able to compress the terms dict, the
            more docs are merged in RAM before having to flush to disk. I
            would also expect this curve to be somewhat content dependent.
          Show
          Michael McCandless added a comment - OK I ran old (trunk) vs new (this patch) with increasing RAM buffer sizes up to 96 MB. I used the "normal" sized docs (~5,500 bytes plain text), left stored fields and term vectors (positions + offsets) on, and autoCommit=false. Here're the results: NUM THREADS = 1 MERGE FACTOR = 10 With term vectors (positions + offsets) and 2 small stored fields AUTOCOMMIT = false (commit only once at the end) 1 MB old 200000 docs in 862.2 secs index size = 1.7G new 200000 docs in 297.1 secs index size = 1.7G Total Docs/sec: old 232.0; new 673.2 [ 190.2% faster] Docs/MB @ flush: old 47.2; new 278.4 [ 489.6% more] Avg RAM used (MB) @ flush: old 34.5; new 3.4 [ 90.1% less] 2 MB old 200000 docs in 828.7 secs index size = 1.7G new 200000 docs in 279.0 secs index size = 1.7G Total Docs/sec: old 241.3; new 716.8 [ 197.0% faster] Docs/MB @ flush: old 47.0; new 322.4 [ 586.7% more] Avg RAM used (MB) @ flush: old 37.9; new 4.5 [ 88.0% less] 4 MB old 200000 docs in 840.5 secs index size = 1.7G new 200000 docs in 260.8 secs index size = 1.7G Total Docs/sec: old 237.9; new 767.0 [ 222.3% faster] Docs/MB @ flush: old 46.8; new 363.1 [ 675.4% more] Avg RAM used (MB) @ flush: old 33.9; new 6.5 [ 80.9% less] 8 MB old 200000 docs in 678.8 secs index size = 1.7G new 200000 docs in 248.8 secs index size = 1.7G Total Docs/sec: old 294.6; new 803.7 [ 172.8% faster] Docs/MB @ flush: old 46.8; new 392.4 [ 739.1% more] Avg RAM used (MB) @ flush: old 60.3; new 10.7 [ 82.2% less] 16 MB old 200000 docs in 660.6 secs index size = 1.7G new 200000 docs in 247.3 secs index size = 1.7G Total Docs/sec: old 302.8; new 808.7 [ 167.1% faster] Docs/MB @ flush: old 46.7; new 415.4 [ 788.8% more] Avg RAM used (MB) @ flush: old 47.1; new 19.2 [ 59.3% less] 24 MB old 200000 docs in 658.1 secs index size = 1.7G new 200000 docs in 243.0 secs index size = 1.7G Total Docs/sec: old 303.9; new 823.0 [ 170.8% faster] Docs/MB @ flush: old 46.7; new 430.9 [ 822.2% more] Avg RAM used (MB) @ flush: old 70.0; new 27.5 [ 60.8% less] 32 MB old 200000 docs in 714.2 secs index size = 1.7G new 200000 docs in 239.2 secs index size = 1.7G Total Docs/sec: old 280.0; new 836.0 [ 198.5% faster] Docs/MB @ flush: old 46.7; new 432.2 [ 825.2% more] Avg RAM used (MB) @ flush: old 92.5; new 36.7 [ 60.3% less] 48 MB old 200000 docs in 640.3 secs index size = 1.7G new 200000 docs in 236.0 secs index size = 1.7G Total Docs/sec: old 312.4; new 847.5 [ 171.3% faster] Docs/MB @ flush: old 46.7; new 438.5 [ 838.8% more] Avg RAM used (MB) @ flush: old 138.9; new 52.8 [ 62.0% less] 64 MB old 200000 docs in 649.3 secs index size = 1.7G new 200000 docs in 238.3 secs index size = 1.7G Total Docs/sec: old 308.0; new 839.3 [ 172.5% faster] Docs/MB @ flush: old 46.7; new 441.3 [ 844.7% more] Avg RAM used (MB) @ flush: old 302.6; new 72.7 [ 76.0% less] 80 MB old 200000 docs in 670.2 secs index size = 1.7G new 200000 docs in 227.2 secs index size = 1.7G Total Docs/sec: old 298.4; new 880.5 [ 195.0% faster] Docs/MB @ flush: old 46.7; new 446.2 [ 855.2% more] Avg RAM used (MB) @ flush: old 231.7; new 94.3 [ 59.3% less] 96 MB old 200000 docs in 683.4 secs index size = 1.7G new 200000 docs in 226.8 secs index size = 1.7G Total Docs/sec: old 292.7; new 882.0 [ 201.4% faster] Docs/MB @ flush: old 46.7; new 448.0 [ 859.1% more] Avg RAM used (MB) @ flush: old 274.5; new 112.7 [ 59.0% less] Some observations: Remember the test is already biased against "new" because with the patch you get an optimized index in the end but with "old" you don't. Sweet spot for old (trunk) seems to be 48 MB: that is the peak docs/sec @ 312.4. New (with patch) seems to just get faster the more memory you give it, though gradually. The peak was 96 MB (the largest I ran). So no sweet spot (or maybe I need to give more memory, but, above 96 MB the trunk was starting to swap on my test env). New gets better and better RAM efficiency, the more RAM you give. This makes sense: it's better able to compress the terms dict, the more docs are merged in RAM before having to flush to disk. I would also expect this curve to be somewhat content dependent.
          Hide
          Michael McCandless added a comment -

          I attached a new iteration of the patch. It's quite different from
          the last patch.

          After discussion on java-dev last time, I decided to retry the
          "persistent hash" approach, where the Postings hash lasts across many
          docs and then a single flush produces a partial segment containing all
          of those docs. This is in contrast to the previous approach where
          each doc makes its own segment and then they are merged.

          It turns out this is even faster than my previous approach, especially
          for smaller docs and especially when term vectors are off (because no
          quicksort() is needed until the segment is flushed). I will attach
          new benchmark results.

          Other changes:

          • Changed my benchmarking tool / testing (IndexLineFiles):
          • I turned off compound file (to reduce time NOT spent on
            indexing).
          • I noticed I was not downcasing the terms, so I fixed that
          • I now do my own line processing to reduce GC cost of
            "BufferedReader.readLine" (to reduct time NOT spent on
            indexing).
          • Norms now properly flush to disk in the autoCommit=false case
          • All unit tests pass except disk full
          • I turned on asserts for unit tests (jvm arg -ea added to junit ant
            task). I think we should use asserts when running tests. I have
            quite a few asserts now.

          With this new approach, as I process each term in the document I
          immediately write the prox/freq in their compact (vints) format into
          shared byte[] buffers, rather than accumulating int[] arrays that then
          need to be re-processed into the vint encoding. This speeds things up
          because we don't double-process the postings. It also uses less
          per-document RAM overhead because intermediate postings are stored as
          vints not as ints.

          When enough RAM is used by the Posting entries plus the byte[]
          buffers, I flush them to a partial RAM segment. When enough of these
          RAM segments have accumulated I flush to a real Lucene segment
          (autoCommit=true) or to on-disk partial segments (autoCommit=false)
          which are then merged in the end to create a real Lucene segment.

          Show
          Michael McCandless added a comment - I attached a new iteration of the patch. It's quite different from the last patch. After discussion on java-dev last time, I decided to retry the "persistent hash" approach, where the Postings hash lasts across many docs and then a single flush produces a partial segment containing all of those docs. This is in contrast to the previous approach where each doc makes its own segment and then they are merged. It turns out this is even faster than my previous approach, especially for smaller docs and especially when term vectors are off (because no quicksort() is needed until the segment is flushed). I will attach new benchmark results. Other changes: Changed my benchmarking tool / testing (IndexLineFiles): I turned off compound file (to reduce time NOT spent on indexing). I noticed I was not downcasing the terms, so I fixed that I now do my own line processing to reduce GC cost of "BufferedReader.readLine" (to reduct time NOT spent on indexing). Norms now properly flush to disk in the autoCommit=false case All unit tests pass except disk full I turned on asserts for unit tests (jvm arg -ea added to junit ant task). I think we should use asserts when running tests. I have quite a few asserts now. With this new approach, as I process each term in the document I immediately write the prox/freq in their compact (vints) format into shared byte[] buffers, rather than accumulating int[] arrays that then need to be re-processed into the vint encoding. This speeds things up because we don't double-process the postings. It also uses less per-document RAM overhead because intermediate postings are stored as vints not as ints. When enough RAM is used by the Posting entries plus the byte[] buffers, I flush them to a partial RAM segment. When enough of these RAM segments have accumulated I flush to a real Lucene segment (autoCommit=true) or to on-disk partial segments (autoCommit=false) which are then merged in the end to create a real Lucene segment.
          Michael McCandless made changes -
          Attachment LUCENE-843.take5.patch [ 12356500 ]
          Hide
          Marvin Humphrey added a comment -

          How are you writing the frq data in compressed format? The works fine for
          prx data, because the deltas are all within a single doc – but for the freq
          data, the deltas are tied up in doc num deltas, so you have to decompress it
          when performing merges.

          To continue our discussion from java-dev...

          • I haven't been able to come up with a file format tweak that
            gets around this doc-num-delta-decompression problem to enhance the speed
            of frq data merging. I toyed with splitting off the freq from the
            doc_delta, at the price of increasing the file size in the common case of
            freq == 1, but went back to the old design. It's not worth the size
            increase for what's at best a minor indexing speedup.
          • I've added a custom MemoryPool class to KS which grabs memory in 1 meg
            chunks, allows resizing (downwards) of only the last allocation, and can
            only release everything at once. From one of these pools, I'm allocating
            RawPosting objects, each of which is a doc_num, a freq, the term_text, and
            the pre-packed prx data (which varies based on which Posting subclass
            created the RawPosting object). I haven't got things 100% stable yet, but
            preliminary results seem to indicate that this technique, which is a riff
            on your persistent arrays, improves indexing speed by about 15%.
          Show
          Marvin Humphrey added a comment - How are you writing the frq data in compressed format? The works fine for prx data, because the deltas are all within a single doc – but for the freq data, the deltas are tied up in doc num deltas, so you have to decompress it when performing merges. To continue our discussion from java-dev... I haven't been able to come up with a file format tweak that gets around this doc-num-delta-decompression problem to enhance the speed of frq data merging. I toyed with splitting off the freq from the doc_delta, at the price of increasing the file size in the common case of freq == 1, but went back to the old design. It's not worth the size increase for what's at best a minor indexing speedup. I've added a custom MemoryPool class to KS which grabs memory in 1 meg chunks, allows resizing (downwards) of only the last allocation, and can only release everything at once. From one of these pools, I'm allocating RawPosting objects, each of which is a doc_num, a freq, the term_text, and the pre-packed prx data (which varies based on which Posting subclass created the RawPosting object). I haven't got things 100% stable yet, but preliminary results seem to indicate that this technique, which is a riff on your persistent arrays, improves indexing speed by about 15%.
          Hide
          Michael McCandless added a comment -

          > How are you writing the frq data in compressed format? The works fine for
          > prx data, because the deltas are all within a single doc – but for the freq
          > data, the deltas are tied up in doc num deltas, so you have to decompress it
          > when performing merges.

          For each Posting I keep track of the last docID that its term occurred
          in; when this differs from the current docID I record the "delta code"
          that needs to be written and then I later write it with the final freq
          for this document.

          > * I haven't been able to come up with a file format tweak that
          > gets around this doc-num-delta-decompression problem to enhance the speed
          > of frq data merging. I toyed with splitting off the freq from the
          > doc_delta, at the price of increasing the file size in the common case of
          > freq == 1, but went back to the old design. It's not worth the size
          > increase for what's at best a minor indexing speedup.

          I'm just doing the "stitching" approach here: it's only the very first
          docCode (& freq when freq==1) that must be re-encoded on merging. The
          one catch is you must store the last docID of the previous segment so
          you can compute the new delta at the boundary. Then I do a raw
          "copyBytes" for the remainder of the freq postings.

          Note that I'm only doing this for the "internal" merges (of partial
          RAM segments and flushed partial segments) I do before creating a real
          Lucene segment. I haven't changed how the "normal" Lucene segment
          merging works (though I think we should look into it – I opened a
          separate issue): it still re-interprets and then re-encodes all
          docID/freq's.

          > * I've added a custom MemoryPool class to KS which grabs memory in 1 meg
          > chunks, allows resizing (downwards) of only the last allocation, and can
          > only release everything at once. From one of these pools, I'm allocating
          > RawPosting objects, each of which is a doc_num, a freq, the term_text, and
          > the pre-packed prx data (which varies based on which Posting subclass
          > created the RawPosting object). I haven't got things 100% stable yet, but
          > preliminary results seem to indicate that this technique, which is a riff
          > on your persistent arrays, improves indexing speed by about 15%.

          Fabulous!!

          I think it's the custom memory management I'm doing with slices into
          shared byte[] arrays for the postings that made the persistent hash
          approach work well, this time around (when I had previously tried this
          it was slower).

          Show
          Michael McCandless added a comment - > How are you writing the frq data in compressed format? The works fine for > prx data, because the deltas are all within a single doc – but for the freq > data, the deltas are tied up in doc num deltas, so you have to decompress it > when performing merges. For each Posting I keep track of the last docID that its term occurred in; when this differs from the current docID I record the "delta code" that needs to be written and then I later write it with the final freq for this document. > * I haven't been able to come up with a file format tweak that > gets around this doc-num-delta-decompression problem to enhance the speed > of frq data merging. I toyed with splitting off the freq from the > doc_delta, at the price of increasing the file size in the common case of > freq == 1, but went back to the old design. It's not worth the size > increase for what's at best a minor indexing speedup. I'm just doing the "stitching" approach here: it's only the very first docCode (& freq when freq==1) that must be re-encoded on merging. The one catch is you must store the last docID of the previous segment so you can compute the new delta at the boundary. Then I do a raw "copyBytes" for the remainder of the freq postings. Note that I'm only doing this for the "internal" merges (of partial RAM segments and flushed partial segments) I do before creating a real Lucene segment. I haven't changed how the "normal" Lucene segment merging works (though I think we should look into it – I opened a separate issue): it still re-interprets and then re-encodes all docID/freq's. > * I've added a custom MemoryPool class to KS which grabs memory in 1 meg > chunks, allows resizing (downwards) of only the last allocation, and can > only release everything at once. From one of these pools, I'm allocating > RawPosting objects, each of which is a doc_num, a freq, the term_text, and > the pre-packed prx data (which varies based on which Posting subclass > created the RawPosting object). I haven't got things 100% stable yet, but > preliminary results seem to indicate that this technique, which is a riff > on your persistent arrays, improves indexing speed by about 15%. Fabulous!! I think it's the custom memory management I'm doing with slices into shared byte[] arrays for the postings that made the persistent hash approach work well, this time around (when I had previously tried this it was slower).
          Hide
          Michael McCandless added a comment -

          Results with the above patch:

          RAM = 32 MB
          NUM THREADS = 1
          MERGE FACTOR = 10

          2000000 DOCS @ ~550 bytes plain text

          No term vectors nor stored fields

          AUTOCOMMIT = true (commit whenever RAM is full)

          old
          2000000 docs in 782.8 secs
          index size = 436M

          new
          2000000 docs in 93.4 secs
          index size = 430M

          Total Docs/sec: old 2554.8; new 21421.1 [ 738.5% faster]
          Docs/MB @ flush: old 128.0; new 4058.0 [ 3069.6% more]
          Avg RAM used (MB) @ flush: old 140.2; new 38.0 [ 72.9% less]

          AUTOCOMMIT = false (commit only once at the end)

          old
          2000000 docs in 780.2 secs
          index size = 436M

          new
          2000000 docs in 90.6 secs
          index size = 427M

          Total Docs/sec: old 2563.3; new 22086.8 [ 761.7% faster]
          Docs/MB @ flush: old 128.0; new 4118.4 [ 3116.7% more]
          Avg RAM used (MB) @ flush: old 144.6; new 36.4 [ 74.8% less]

          With term vectors (positions + offsets) and 2 small stored fields

          AUTOCOMMIT = true (commit whenever RAM is full)

          old
          2000000 docs in 1227.6 secs
          index size = 2.1G

          new
          2000000 docs in 559.8 secs
          index size = 2.1G

          Total Docs/sec: old 1629.2; new 3572.5 [ 119.3% faster]
          Docs/MB @ flush: old 93.1; new 4058.0 [ 4259.1% more]
          Avg RAM used (MB) @ flush: old 193.4; new 38.5 [ 80.1% less]

          AUTOCOMMIT = false (commit only once at the end)

          old
          2000000 docs in 1229.2 secs
          index size = 2.1G

          new
          2000000 docs in 241.0 secs
          index size = 2.1G

          Total Docs/sec: old 1627.0; new 8300.0 [ 410.1% faster]
          Docs/MB @ flush: old 93.1; new 4118.4 [ 4323.9% more]
          Avg RAM used (MB) @ flush: old 150.5; new 38.4 [ 74.5% less]

          200000 DOCS @ ~5,500 bytes plain text

          No term vectors nor stored fields

          AUTOCOMMIT = true (commit whenever RAM is full)

          old
          200000 docs in 352.2 secs
          index size = 406M

          new
          200000 docs in 86.4 secs
          index size = 403M

          Total Docs/sec: old 567.9; new 2313.7 [ 307.4% faster]
          Docs/MB @ flush: old 83.5; new 420.0 [ 402.7% more]
          Avg RAM used (MB) @ flush: old 76.8; new 38.1 [ 50.4% less]

          AUTOCOMMIT = false (commit only once at the end)

          old
          200000 docs in 399.2 secs
          index size = 406M

          new
          200000 docs in 89.6 secs
          index size = 400M

          Total Docs/sec: old 501.0; new 2231.0 [ 345.3% faster]
          Docs/MB @ flush: old 83.5; new 422.6 [ 405.8% more]
          Avg RAM used (MB) @ flush: old 76.7; new 36.2 [ 52.7% less]

          With term vectors (positions + offsets) and 2 small stored fields

          AUTOCOMMIT = true (commit whenever RAM is full)

          old
          200000 docs in 594.2 secs
          index size = 1.7G

          new
          200000 docs in 229.0 secs
          index size = 1.7G

          Total Docs/sec: old 336.6; new 873.3 [ 159.5% faster]
          Docs/MB @ flush: old 47.9; new 420.0 [ 776.9% more]
          Avg RAM used (MB) @ flush: old 157.8; new 38.0 [ 75.9% less]

          AUTOCOMMIT = false (commit only once at the end)

          old
          200000 docs in 605.1 secs
          index size = 1.7G

          new
          200000 docs in 181.3 secs
          index size = 1.7G

          Total Docs/sec: old 330.5; new 1103.1 [ 233.7% faster]
          Docs/MB @ flush: old 47.9; new 422.6 [ 782.2% more]
          Avg RAM used (MB) @ flush: old 132.0; new 37.1 [ 71.9% less]

          20000 DOCS @ ~55,000 bytes plain text

          No term vectors nor stored fields

          AUTOCOMMIT = true (commit whenever RAM is full)

          old
          20000 docs in 180.8 secs
          index size = 350M

          new
          20000 docs in 79.1 secs
          index size = 349M

          Total Docs/sec: old 110.6; new 252.8 [ 128.5% faster]
          Docs/MB @ flush: old 25.0; new 46.8 [ 87.0% more]
          Avg RAM used (MB) @ flush: old 112.2; new 44.3 [ 60.5% less]

          AUTOCOMMIT = false (commit only once at the end)

          old
          20000 docs in 180.1 secs
          index size = 350M

          new
          20000 docs in 75.9 secs
          index size = 347M

          Total Docs/sec: old 111.0; new 263.5 [ 137.3% faster]
          Docs/MB @ flush: old 25.0; new 47.5 [ 89.7% more]
          Avg RAM used (MB) @ flush: old 111.1; new 42.5 [ 61.7% less]

          With term vectors (positions + offsets) and 2 small stored fields

          AUTOCOMMIT = true (commit whenever RAM is full)

          old
          20000 docs in 323.1 secs
          index size = 1.4G

          new
          20000 docs in 183.9 secs
          index size = 1.4G

          Total Docs/sec: old 61.9; new 108.7 [ 75.7% faster]
          Docs/MB @ flush: old 10.4; new 46.8 [ 348.3% more]
          Avg RAM used (MB) @ flush: old 74.2; new 44.9 [ 39.5% less]

          AUTOCOMMIT = false (commit only once at the end)

          old
          20000 docs in 323.5 secs
          index size = 1.4G

          new
          20000 docs in 135.6 secs
          index size = 1.4G

          Total Docs/sec: old 61.8; new 147.5 [ 138.5% faster]
          Docs/MB @ flush: old 10.4; new 47.5 [ 354.8% more]
          Avg RAM used (MB) @ flush: old 74.3; new 42.9 [ 42.2% less]

          Show
          Michael McCandless added a comment - Results with the above patch: RAM = 32 MB NUM THREADS = 1 MERGE FACTOR = 10 2000000 DOCS @ ~550 bytes plain text No term vectors nor stored fields AUTOCOMMIT = true (commit whenever RAM is full) old 2000000 docs in 782.8 secs index size = 436M new 2000000 docs in 93.4 secs index size = 430M Total Docs/sec: old 2554.8; new 21421.1 [ 738.5% faster] Docs/MB @ flush: old 128.0; new 4058.0 [ 3069.6% more] Avg RAM used (MB) @ flush: old 140.2; new 38.0 [ 72.9% less] AUTOCOMMIT = false (commit only once at the end) old 2000000 docs in 780.2 secs index size = 436M new 2000000 docs in 90.6 secs index size = 427M Total Docs/sec: old 2563.3; new 22086.8 [ 761.7% faster] Docs/MB @ flush: old 128.0; new 4118.4 [ 3116.7% more] Avg RAM used (MB) @ flush: old 144.6; new 36.4 [ 74.8% less] With term vectors (positions + offsets) and 2 small stored fields AUTOCOMMIT = true (commit whenever RAM is full) old 2000000 docs in 1227.6 secs index size = 2.1G new 2000000 docs in 559.8 secs index size = 2.1G Total Docs/sec: old 1629.2; new 3572.5 [ 119.3% faster] Docs/MB @ flush: old 93.1; new 4058.0 [ 4259.1% more] Avg RAM used (MB) @ flush: old 193.4; new 38.5 [ 80.1% less] AUTOCOMMIT = false (commit only once at the end) old 2000000 docs in 1229.2 secs index size = 2.1G new 2000000 docs in 241.0 secs index size = 2.1G Total Docs/sec: old 1627.0; new 8300.0 [ 410.1% faster] Docs/MB @ flush: old 93.1; new 4118.4 [ 4323.9% more] Avg RAM used (MB) @ flush: old 150.5; new 38.4 [ 74.5% less] 200000 DOCS @ ~5,500 bytes plain text No term vectors nor stored fields AUTOCOMMIT = true (commit whenever RAM is full) old 200000 docs in 352.2 secs index size = 406M new 200000 docs in 86.4 secs index size = 403M Total Docs/sec: old 567.9; new 2313.7 [ 307.4% faster] Docs/MB @ flush: old 83.5; new 420.0 [ 402.7% more] Avg RAM used (MB) @ flush: old 76.8; new 38.1 [ 50.4% less] AUTOCOMMIT = false (commit only once at the end) old 200000 docs in 399.2 secs index size = 406M new 200000 docs in 89.6 secs index size = 400M Total Docs/sec: old 501.0; new 2231.0 [ 345.3% faster] Docs/MB @ flush: old 83.5; new 422.6 [ 405.8% more] Avg RAM used (MB) @ flush: old 76.7; new 36.2 [ 52.7% less] With term vectors (positions + offsets) and 2 small stored fields AUTOCOMMIT = true (commit whenever RAM is full) old 200000 docs in 594.2 secs index size = 1.7G new 200000 docs in 229.0 secs index size = 1.7G Total Docs/sec: old 336.6; new 873.3 [ 159.5% faster] Docs/MB @ flush: old 47.9; new 420.0 [ 776.9% more] Avg RAM used (MB) @ flush: old 157.8; new 38.0 [ 75.9% less] AUTOCOMMIT = false (commit only once at the end) old 200000 docs in 605.1 secs index size = 1.7G new 200000 docs in 181.3 secs index size = 1.7G Total Docs/sec: old 330.5; new 1103.1 [ 233.7% faster] Docs/MB @ flush: old 47.9; new 422.6 [ 782.2% more] Avg RAM used (MB) @ flush: old 132.0; new 37.1 [ 71.9% less] 20000 DOCS @ ~55,000 bytes plain text No term vectors nor stored fields AUTOCOMMIT = true (commit whenever RAM is full) old 20000 docs in 180.8 secs index size = 350M new 20000 docs in 79.1 secs index size = 349M Total Docs/sec: old 110.6; new 252.8 [ 128.5% faster] Docs/MB @ flush: old 25.0; new 46.8 [ 87.0% more] Avg RAM used (MB) @ flush: old 112.2; new 44.3 [ 60.5% less] AUTOCOMMIT = false (commit only once at the end) old 20000 docs in 180.1 secs index size = 350M new 20000 docs in 75.9 secs index size = 347M Total Docs/sec: old 111.0; new 263.5 [ 137.3% faster] Docs/MB @ flush: old 25.0; new 47.5 [ 89.7% more] Avg RAM used (MB) @ flush: old 111.1; new 42.5 [ 61.7% less] With term vectors (positions + offsets) and 2 small stored fields AUTOCOMMIT = true (commit whenever RAM is full) old 20000 docs in 323.1 secs index size = 1.4G new 20000 docs in 183.9 secs index size = 1.4G Total Docs/sec: old 61.9; new 108.7 [ 75.7% faster] Docs/MB @ flush: old 10.4; new 46.8 [ 348.3% more] Avg RAM used (MB) @ flush: old 74.2; new 44.9 [ 39.5% less] AUTOCOMMIT = false (commit only once at the end) old 20000 docs in 323.5 secs index size = 1.4G new 20000 docs in 135.6 secs index size = 1.4G Total Docs/sec: old 61.8; new 147.5 [ 138.5% faster] Docs/MB @ flush: old 10.4; new 47.5 [ 354.8% more] Avg RAM used (MB) @ flush: old 74.3; new 42.9 [ 42.2% less]
          Hide
          Yonik Seeley added a comment -

          How does this work with pending deletes?
          I assume that if autocommit is false, then you need to wait until the end when you get a real lucene segment to delete the pending terms?

          Also, how has the merge policy (or index invariants) of lucene segments changed?
          If autocommit is off, then you wait until the end to create a big lucene segment. This new segment may be much larger than segments to it's "left". I suppose the idea of merging rightmost segments should just be dropped in favor of merging the smallest adjacent segments? Sorry if this has already been covered... as I said, I'm trying to follow along at a high level.

          Show
          Yonik Seeley added a comment - How does this work with pending deletes? I assume that if autocommit is false, then you need to wait until the end when you get a real lucene segment to delete the pending terms? Also, how has the merge policy (or index invariants) of lucene segments changed? If autocommit is off, then you wait until the end to create a big lucene segment. This new segment may be much larger than segments to it's "left". I suppose the idea of merging rightmost segments should just be dropped in favor of merging the smallest adjacent segments? Sorry if this has already been covered... as I said, I'm trying to follow along at a high level.
          Hide
          Michael McCandless added a comment -

          > How does this work with pending deletes?
          > I assume that if autocommit is false, then you need to wait until the end when you get a real lucene segment to delete the pending terms?

          Yes, all of this sits "below" the pending deletes layer since this
          change writes a single segment either when RAM is full
          (autoCommit=true) or when writer is closed (autoCommit=false). Then
          the deletes get applied like normal (I haven't changed that part).

          > Also, how has the merge policy (or index invariants) of lucene segments changed?
          > If autocommit is off, then you wait until the end to create a big lucene segment. This new segment may be much larger than segments to it's "left". I suppose the idea of merging rightmost segments should just be dropped in favor of merging the smallest adjacent segments? Sorry if this has already been covered... as I said, I'm trying to follow along at a high level.

          Has not been covered, and as usual these are excellent questions
          Yonik!

          I haven't yet changed anything about merge policy, but you're right
          the current invariants won't hold anymore. In fact they already don't
          hold if you "flush by RAM" now (APIs are exposed in 2.1 to let you do
          this). So we need to do something.

          I like your idea to relax merge policy (& invariants) to allow
          "merging of any adjacent segments" (not just rightmost ones) and then
          make the policy merge the smallest ones / most similarly sized ones,
          measuring size by net # bytes in the segment. This would preserve the
          "docID monotonicity invariance".

          If we take that approach then it would automatically resolve
          LUCENE-845 as well (which would otherwise block this issue).

          Show
          Michael McCandless added a comment - > How does this work with pending deletes? > I assume that if autocommit is false, then you need to wait until the end when you get a real lucene segment to delete the pending terms? Yes, all of this sits "below" the pending deletes layer since this change writes a single segment either when RAM is full (autoCommit=true) or when writer is closed (autoCommit=false). Then the deletes get applied like normal (I haven't changed that part). > Also, how has the merge policy (or index invariants) of lucene segments changed? > If autocommit is off, then you wait until the end to create a big lucene segment. This new segment may be much larger than segments to it's "left". I suppose the idea of merging rightmost segments should just be dropped in favor of merging the smallest adjacent segments? Sorry if this has already been covered... as I said, I'm trying to follow along at a high level. Has not been covered, and as usual these are excellent questions Yonik! I haven't yet changed anything about merge policy, but you're right the current invariants won't hold anymore. In fact they already don't hold if you "flush by RAM" now (APIs are exposed in 2.1 to let you do this). So we need to do something. I like your idea to relax merge policy (& invariants) to allow "merging of any adjacent segments" (not just rightmost ones) and then make the policy merge the smallest ones / most similarly sized ones, measuring size by net # bytes in the segment. This would preserve the "docID monotonicity invariance". If we take that approach then it would automatically resolve LUCENE-845 as well (which would otherwise block this issue).
          Michael McCandless made changes -
          Link This issue is blocked by LUCENE-845 [ LUCENE-845 ]
          Hide
          Michael McCandless added a comment -

          Attached latest patch.

          I'm now working towards simplify & cleaning up the code & design:
          eliminated dead code leftover from the previous iterations, use
          existing RAMFile instead of my own new class, refactored
          duplicate/confusing code, added comments, etc. It's getting closer to
          a committable state but still has a ways to go.

          I also renamed the class from MultiDocumentWriter to DocumentsWriter.

          To summarize the current design:

          1. Write stored fields & term vectors to files in the Directory
          immediately (don't buffer these in RAM).

          2. Write freq & prox postings to RAM directly as a byte stream
          instead of first pass as int[] and then second pass as a byte
          stream. This single-pass instead of double-pass is a big
          savings. I use slices into shared byte[] arrays to efficiently
          allocate bytes to the postings the need them.

          3. Build Postings hash that holds the Postings for many documents at
          once instead of a single doc, keyed by unique term. Not tearing
          down & rebuilding the Postings hash w/ every doc saves alot of
          time. Also when term vectors are off this saves quicksort for
          every doc and this gives very good performance gain.

          When the Postings hash is full (used up the allowed RAM usage) I
          then create a real Lucene segment when autoCommit=true, else a
          "partial segment".

          4. Use my own "partial segment" format that differs from Lucene's
          normal segments in that it is optimized for merging (and unusable
          for searching). This format, and the merger I created to work
          with this format, performs merging mostly by copying blocks of
          bytes instead of reinterpreting every vInt in each Postings list.
          These partial segments are are only created when IndexWriter has
          autoCommit=false, and then on commit they are merged into the
          real Lucene segment format.

          5. Reuse the Posting, PostingVector, char[] and byte[] objects that
          are used by the Postings hash.

          I plan to keep simplifying the design & implementation. Specifically,
          I'm going to test removing #4 above entirely (using my own "partial
          segment" format that's optimized for merging not searching).

          While doing this may give back some of the performance gains, that
          code is the source of much added complexity in the patch, and, it
          duplicates the current SegmentMerger code. It was more necessary
          before (when we would merge thousands of single-doc segments in
          memory) but now that each segment contains many docs I think we are no
          longer gaining as much performance from it.

          I plan instead to write all segments in the "real" Lucene segment
          format and use the current SegmentMerger, possibly w/ some small
          changes, to do the merges even when autoCommit=false. Since we have
          another issue (LUCENE-856) to optimize segment merging I can carry
          over any optimizations that we may want to keep into that issue. If
          this doesn't lose much performance it will make the approach here even
          simpler.

          Show
          Michael McCandless added a comment - Attached latest patch. I'm now working towards simplify & cleaning up the code & design: eliminated dead code leftover from the previous iterations, use existing RAMFile instead of my own new class, refactored duplicate/confusing code, added comments, etc. It's getting closer to a committable state but still has a ways to go. I also renamed the class from MultiDocumentWriter to DocumentsWriter. To summarize the current design: 1. Write stored fields & term vectors to files in the Directory immediately (don't buffer these in RAM). 2. Write freq & prox postings to RAM directly as a byte stream instead of first pass as int[] and then second pass as a byte stream. This single-pass instead of double-pass is a big savings. I use slices into shared byte[] arrays to efficiently allocate bytes to the postings the need them. 3. Build Postings hash that holds the Postings for many documents at once instead of a single doc, keyed by unique term. Not tearing down & rebuilding the Postings hash w/ every doc saves alot of time. Also when term vectors are off this saves quicksort for every doc and this gives very good performance gain. When the Postings hash is full (used up the allowed RAM usage) I then create a real Lucene segment when autoCommit=true, else a "partial segment". 4. Use my own "partial segment" format that differs from Lucene's normal segments in that it is optimized for merging (and unusable for searching). This format, and the merger I created to work with this format, performs merging mostly by copying blocks of bytes instead of reinterpreting every vInt in each Postings list. These partial segments are are only created when IndexWriter has autoCommit=false, and then on commit they are merged into the real Lucene segment format. 5. Reuse the Posting, PostingVector, char[] and byte[] objects that are used by the Postings hash. I plan to keep simplifying the design & implementation. Specifically, I'm going to test removing #4 above entirely (using my own "partial segment" format that's optimized for merging not searching). While doing this may give back some of the performance gains, that code is the source of much added complexity in the patch, and, it duplicates the current SegmentMerger code. It was more necessary before (when we would merge thousands of single-doc segments in memory) but now that each segment contains many docs I think we are no longer gaining as much performance from it. I plan instead to write all segments in the "real" Lucene segment format and use the current SegmentMerger, possibly w/ some small changes, to do the merges even when autoCommit=false. Since we have another issue ( LUCENE-856 ) to optimize segment merging I can carry over any optimizations that we may want to keep into that issue. If this doesn't lose much performance it will make the approach here even simpler.
          Michael McCandless made changes -
          Attachment LUCENE-843.take6.patch [ 12357792 ]
          Michael McCandless committed 544464 (1 file)
          Reviews: none

          LUCENE-843: making a small backwards compatible API change to return the newly added FieldInfo instance

          Hide
          Michael McCandless added a comment -

          Latest working patch attached.

          I've cutover to using Lucene's normal segment merging for all merging
          (ie, I no longer use a different merge-efficient format for segments
          when autoCommit=false); this has substantially simplified the code.

          All unit tests pass except disk-full test and certain contrib tests
          (gdata-server, lucli, similarity, wordnet) that I think I'm not
          causing.

          Other changes:

          • Consolidated flushing of a new segment back into IndexWriter
            (previously DocumentsWriter would do its own flushing when
            autoCommit=false).

          I would also like to consolidate merging entirely into
          IndexWriter; right now DocumentsWriter does its own merging of the
          flushed segments when autoCommit=false (this is because those
          segments are "partial" meaning they do not have their own stored
          fields or term vectors). I'm trying to find a clean way to do
          this...

          • Thread concurrency now works: each thread writes into a separate
            Postings hash (up until a limit (currently 5) at which point the
            threads share the Postings hashes) and then when flushing the
            segment I merge the docIDs together. I flush when the total RAM
            used across threads is over the limit. I ran a test comparing
            thread concurrency on current trunk vs this patch, which I'll post
            next.
          • Reduced bytes used per-unique-term to be lower than current
            Lucene. This means the worst-case document (many terms, all of
            which are unique) should use less RAM overall than Lucene trunk
            does.
          • Added some new unit test cases; added missing "writer.close()" to
            one of the contrib tests.
          • Cleanup, comments, etc. I think the code is getting more
            "approachable" now.
          Show
          Michael McCandless added a comment - Latest working patch attached. I've cutover to using Lucene's normal segment merging for all merging (ie, I no longer use a different merge-efficient format for segments when autoCommit=false); this has substantially simplified the code. All unit tests pass except disk-full test and certain contrib tests (gdata-server, lucli, similarity, wordnet) that I think I'm not causing. Other changes: Consolidated flushing of a new segment back into IndexWriter (previously DocumentsWriter would do its own flushing when autoCommit=false). I would also like to consolidate merging entirely into IndexWriter; right now DocumentsWriter does its own merging of the flushed segments when autoCommit=false (this is because those segments are "partial" meaning they do not have their own stored fields or term vectors). I'm trying to find a clean way to do this... Thread concurrency now works: each thread writes into a separate Postings hash (up until a limit (currently 5) at which point the threads share the Postings hashes) and then when flushing the segment I merge the docIDs together. I flush when the total RAM used across threads is over the limit. I ran a test comparing thread concurrency on current trunk vs this patch, which I'll post next. Reduced bytes used per-unique-term to be lower than current Lucene. This means the worst-case document (many terms, all of which are unique) should use less RAM overall than Lucene trunk does. Added some new unit test cases; added missing "writer.close()" to one of the contrib tests. Cleanup, comments, etc. I think the code is getting more "approachable" now.
          Michael McCandless made changes -
          Attachment LUCENE-843.take7.patch [ 12359276 ]
          Hide
          Michael McCandless added a comment -

          I ran a benchmark using more than 1 thread to do indexing, in order to
          test & compare concurrency of trunk and the patch. The test is the
          same as above, and runs on a 4 core Mac Pro (OS X) box with 4 drive
          RAID 0 IO system.

          Here are the raw results:

          DOCS = ~5,500 bytes plain text
          RAM = 32 MB
          MERGE FACTOR = 10
          With term vectors (positions + offsets) and 2 small stored fields
          AUTOCOMMIT = false (commit only once at the end)

          NUM THREADS = 1

          new
          200000 docs in 172.3 secs
          index size = 1.7G

          old
          200000 docs in 539.5 secs
          index size = 1.7G

          Total Docs/sec: old 370.7; new 1161.0 [ 213.2% faster]
          Docs/MB @ flush: old 47.9; new 334.6 [ 598.7% more]
          Avg RAM used (MB) @ flush: old 131.9; new 33.1 [ 74.9% less]

          NUM THREADS = 2

          new
          200001 docs in 130.8 secs
          index size = 1.7G

          old
          200001 docs in 452.8 secs
          index size = 1.7G

          Total Docs/sec: old 441.7; new 1529.3 [ 246.2% faster]
          Docs/MB @ flush: old 47.9; new 301.5 [ 529.7% more]
          Avg RAM used (MB) @ flush: old 226.1; new 35.2 [ 84.4% less]

          NUM THREADS = 3

          new
          200002 docs in 105.4 secs
          index size = 1.7G

          old
          200002 docs in 428.4 secs
          index size = 1.7G

          Total Docs/sec: old 466.8; new 1897.9 [ 306.6% faster]
          Docs/MB @ flush: old 47.9; new 277.8 [ 480.2% more]
          Avg RAM used (MB) @ flush: old 289.8; new 37.0 [ 87.2% less]

          NUM THREADS = 4

          new
          200003 docs in 104.8 secs
          index size = 1.7G

          old
          200003 docs in 440.4 secs
          index size = 1.7G

          Total Docs/sec: old 454.1; new 1908.5 [ 320.3% faster]
          Docs/MB @ flush: old 47.9; new 259.9 [ 442.9% more]
          Avg RAM used (MB) @ flush: old 293.7; new 37.1 [ 87.3% less]

          NUM THREADS = 5

          new
          200004 docs in 99.5 secs
          index size = 1.7G

          old
          200004 docs in 425.0 secs
          index size = 1.7G

          Total Docs/sec: old 470.6; new 2010.5 [ 327.2% faster]
          Docs/MB @ flush: old 47.9; new 245.3 [ 412.6% more]
          Avg RAM used (MB) @ flush: old 390.9; new 38.3 [ 90.2% less]

          NUM THREADS = 6

          new
          200005 docs in 106.3 secs
          index size = 1.7G

          old
          200005 docs in 427.1 secs
          index size = 1.7G

          Total Docs/sec: old 468.2; new 1882.3 [ 302.0% faster]
          Docs/MB @ flush: old 47.8; new 248.5 [ 419.3% more]
          Avg RAM used (MB) @ flush: old 340.9; new 38.7 [ 88.6% less]

          NUM THREADS = 7

          new
          200006 docs in 106.1 secs
          index size = 1.7G

          old
          200006 docs in 435.2 secs
          index size = 1.7G

          Total Docs/sec: old 459.6; new 1885.3 [ 310.2% faster]
          Docs/MB @ flush: old 47.8; new 248.7 [ 420.0% more]
          Avg RAM used (MB) @ flush: old 408.6; new 39.1 [ 90.4% less]

          NUM THREADS = 8

          new
          200007 docs in 109.0 secs
          index size = 1.7G

          old
          200007 docs in 469.2 secs
          index size = 1.7G

          Total Docs/sec: old 426.3; new 1835.2 [ 330.5% faster]
          Docs/MB @ flush: old 47.8; new 251.3 [ 425.5% more]
          Avg RAM used (MB) @ flush: old 448.9; new 39.0 [ 91.3% less]

          Some quick comments:

          • Both trunk & the patch show speedups if you use more than 1 thread
            to do indexing. This is expected since the machine has concurrency.
          • The biggest speedup is from 1->2 threads but still good gains from
            2->5 threads.
          • Best seems to be 5 threads.
          • The patch allows better concurrency: relatively speaking it speeds
            up faster than the trunk (the % faster increases as we add
            threads) as you increase # threads. I think this makes sense
            because we flush less often with the patch, and, flushing is time
            consuming and single threaded.
          Show
          Michael McCandless added a comment - I ran a benchmark using more than 1 thread to do indexing, in order to test & compare concurrency of trunk and the patch. The test is the same as above, and runs on a 4 core Mac Pro (OS X) box with 4 drive RAID 0 IO system. Here are the raw results: DOCS = ~5,500 bytes plain text RAM = 32 MB MERGE FACTOR = 10 With term vectors (positions + offsets) and 2 small stored fields AUTOCOMMIT = false (commit only once at the end) NUM THREADS = 1 new 200000 docs in 172.3 secs index size = 1.7G old 200000 docs in 539.5 secs index size = 1.7G Total Docs/sec: old 370.7; new 1161.0 [ 213.2% faster] Docs/MB @ flush: old 47.9; new 334.6 [ 598.7% more] Avg RAM used (MB) @ flush: old 131.9; new 33.1 [ 74.9% less] NUM THREADS = 2 new 200001 docs in 130.8 secs index size = 1.7G old 200001 docs in 452.8 secs index size = 1.7G Total Docs/sec: old 441.7; new 1529.3 [ 246.2% faster] Docs/MB @ flush: old 47.9; new 301.5 [ 529.7% more] Avg RAM used (MB) @ flush: old 226.1; new 35.2 [ 84.4% less] NUM THREADS = 3 new 200002 docs in 105.4 secs index size = 1.7G old 200002 docs in 428.4 secs index size = 1.7G Total Docs/sec: old 466.8; new 1897.9 [ 306.6% faster] Docs/MB @ flush: old 47.9; new 277.8 [ 480.2% more] Avg RAM used (MB) @ flush: old 289.8; new 37.0 [ 87.2% less] NUM THREADS = 4 new 200003 docs in 104.8 secs index size = 1.7G old 200003 docs in 440.4 secs index size = 1.7G Total Docs/sec: old 454.1; new 1908.5 [ 320.3% faster] Docs/MB @ flush: old 47.9; new 259.9 [ 442.9% more] Avg RAM used (MB) @ flush: old 293.7; new 37.1 [ 87.3% less] NUM THREADS = 5 new 200004 docs in 99.5 secs index size = 1.7G old 200004 docs in 425.0 secs index size = 1.7G Total Docs/sec: old 470.6; new 2010.5 [ 327.2% faster] Docs/MB @ flush: old 47.9; new 245.3 [ 412.6% more] Avg RAM used (MB) @ flush: old 390.9; new 38.3 [ 90.2% less] NUM THREADS = 6 new 200005 docs in 106.3 secs index size = 1.7G old 200005 docs in 427.1 secs index size = 1.7G Total Docs/sec: old 468.2; new 1882.3 [ 302.0% faster] Docs/MB @ flush: old 47.8; new 248.5 [ 419.3% more] Avg RAM used (MB) @ flush: old 340.9; new 38.7 [ 88.6% less] NUM THREADS = 7 new 200006 docs in 106.1 secs index size = 1.7G old 200006 docs in 435.2 secs index size = 1.7G Total Docs/sec: old 459.6; new 1885.3 [ 310.2% faster] Docs/MB @ flush: old 47.8; new 248.7 [ 420.0% more] Avg RAM used (MB) @ flush: old 408.6; new 39.1 [ 90.4% less] NUM THREADS = 8 new 200007 docs in 109.0 secs index size = 1.7G old 200007 docs in 469.2 secs index size = 1.7G Total Docs/sec: old 426.3; new 1835.2 [ 330.5% faster] Docs/MB @ flush: old 47.8; new 251.3 [ 425.5% more] Avg RAM used (MB) @ flush: old 448.9; new 39.0 [ 91.3% less] Some quick comments: Both trunk & the patch show speedups if you use more than 1 thread to do indexing. This is expected since the machine has concurrency. The biggest speedup is from 1->2 threads but still good gains from 2->5 threads. Best seems to be 5 threads. The patch allows better concurrency: relatively speaking it speeds up faster than the trunk (the % faster increases as we add threads) as you increase # threads. I think this makes sense because we flush less often with the patch, and, flushing is time consuming and single threaded.
          Hide
          Michael McCandless added a comment -

          Attached latest patch.

          I think this patch is ready to commit. I will let it sit for a while
          so people can review it.

          We still need to do LUCENE-845 before it can be committed as is.

          However one option instead would be to commit this patch, but leave
          IndexWriter flushing by doc count by default and then later switch it
          to flush by net RAM usage once LUCENE-845 is done. I like this option
          best.

          All tests pass (I've re-enabled the disk full tests and fixed error
          handling so they now pass) on Windows XP, Debian Linux and OS X.

          Summary of the changes in this rev:

          • Finished cleaning up & commenting code
          • Exception handling: if there is a disk full or any other exception
            while adding a document or flushing then the index is rolled back
            to the last commit point.
          • Added more unit tests
          • Removed my profiling tool from the patch (not intended to be
            committed)
          • Fixed a thread safety issue where if you flush by doc count you
            would sometimes get more than the doc count at flush than you
            requested. I moved the thread synchronization for determining
            flush time down into DocumentsWriter.
          • Also fixed thread safety of calling flush with one thread while
            other threads are still adding documents.
          • The biggest change is: absorbed all merging logic back into
            IndexWriter.

          Previously in DocumentsWriter I was tracking my own
          flushed/partial segments and merging them on my own (but using
          SegmentMerger). This makes DocumentsWriter much simpler: now its
          sole purpose is to gather added docs and write a new segment.

          This turns out to be a big win:

          • Code is much simpler (no duplication of "merging"
            policy/logic)
          • 21-25% additional performance gain for autoCommit=false case
            when stored fields & vectors are used
          • IndexWriter.close() no longer takes an unexpected long time to
            close in autoCommit=false case

          However I had to make a change to the index format to do this.
          The basic idea is to allow multiple segments to share access to
          the "doc store" (stored fields, vectors) index files.

          The change is quite simple: FieldsReader/VectorsReader are now
          told the doc offset that they should start from when seeking in
          the index stream (this info is stored in SegmentInfo). When
          merging segments we don't merge the "doc store" files when all
          segments are sharing the same ones (big performance gain), else,
          we make a private copy of the "doc store" files (ie as segments
          normally are on the trunk today).

          The change is fully backwards compatible (I added a test case to
          the backwards compatibility unit test to be sure) and the change
          is only used when autoCommit=false.

          When autoCommit=false, the writer will append stored fields /
          vectors to a single set of files even though it is flushing normal
          segments whenever RAM is full. These normal segments all refer to
          the single shared set of "doc store" files. Then when segments
          are merged, the newly merged segment has its own "private" doc
          stores again. So the sharing only occurs for the "level 0"
          segments.

          I still need to update fileformats doc with this change.

          Show
          Michael McCandless added a comment - Attached latest patch. I think this patch is ready to commit. I will let it sit for a while so people can review it. We still need to do LUCENE-845 before it can be committed as is. However one option instead would be to commit this patch, but leave IndexWriter flushing by doc count by default and then later switch it to flush by net RAM usage once LUCENE-845 is done. I like this option best. All tests pass (I've re-enabled the disk full tests and fixed error handling so they now pass) on Windows XP, Debian Linux and OS X. Summary of the changes in this rev: Finished cleaning up & commenting code Exception handling: if there is a disk full or any other exception while adding a document or flushing then the index is rolled back to the last commit point. Added more unit tests Removed my profiling tool from the patch (not intended to be committed) Fixed a thread safety issue where if you flush by doc count you would sometimes get more than the doc count at flush than you requested. I moved the thread synchronization for determining flush time down into DocumentsWriter. Also fixed thread safety of calling flush with one thread while other threads are still adding documents. The biggest change is: absorbed all merging logic back into IndexWriter. Previously in DocumentsWriter I was tracking my own flushed/partial segments and merging them on my own (but using SegmentMerger). This makes DocumentsWriter much simpler: now its sole purpose is to gather added docs and write a new segment. This turns out to be a big win: Code is much simpler (no duplication of "merging" policy/logic) 21-25% additional performance gain for autoCommit=false case when stored fields & vectors are used IndexWriter.close() no longer takes an unexpected long time to close in autoCommit=false case However I had to make a change to the index format to do this. The basic idea is to allow multiple segments to share access to the "doc store" (stored fields, vectors) index files. The change is quite simple: FieldsReader/VectorsReader are now told the doc offset that they should start from when seeking in the index stream (this info is stored in SegmentInfo). When merging segments we don't merge the "doc store" files when all segments are sharing the same ones (big performance gain), else, we make a private copy of the "doc store" files (ie as segments normally are on the trunk today). The change is fully backwards compatible (I added a test case to the backwards compatibility unit test to be sure) and the change is only used when autoCommit=false. When autoCommit=false, the writer will append stored fields / vectors to a single set of files even though it is flushing normal segments whenever RAM is full. These normal segments all refer to the single shared set of "doc store" files. Then when segments are merged, the newly merged segment has its own "private" doc stores again. So the sharing only occurs for the "level 0" segments. I still need to update fileformats doc with this change.
          Michael McCandless made changes -
          Attachment LUCENE-843.take8.patch [ 12359906 ]
          Hide
          Yonik Seeley added a comment -

          > When merging segments we don't merge the "doc store" files when all segments are sharing the same ones (big performance gain),

          Is this only in the case where the segments have no deleted docs?

          Show
          Yonik Seeley added a comment - > When merging segments we don't merge the "doc store" files when all segments are sharing the same ones (big performance gain), Is this only in the case where the segments have no deleted docs?
          Hide
          Michael McCandless added a comment -

          > > When merging segments we don't merge the "doc store" files when all segments are sharing the same ones (big performance gain),
          >
          > Is this only in the case where the segments have no deleted docs?

          Right. Also the segments must be contiguous which the current merge
          policy ensures but future merge policies may not.

          Show
          Michael McCandless added a comment - > > When merging segments we don't merge the "doc store" files when all segments are sharing the same ones (big performance gain), > > Is this only in the case where the segments have no deleted docs? Right. Also the segments must be contiguous which the current merge policy ensures but future merge policies may not.
          Hide
          Michael McCandless added a comment -

          OK, I attached a new version (take9) of the patch that reverts back to
          the default of "flush after every 10 documents added" in IndexWriter.
          This removes the dependency on LUCENE-845.

          However, I still think we should later (once LUCENE-845 is done)
          default IndexWriter to flush by RAM usage since this will generally
          give the best "out of the box" performance. I will open a separate
          issue to change the default after this issue is resolved.

          Show
          Michael McCandless added a comment - OK, I attached a new version (take9) of the patch that reverts back to the default of "flush after every 10 documents added" in IndexWriter. This removes the dependency on LUCENE-845 . However, I still think we should later (once LUCENE-845 is done) default IndexWriter to flush by RAM usage since this will generally give the best "out of the box" performance. I will open a separate issue to change the default after this issue is resolved.
          Michael McCandless made changes -
          Attachment LUCENE-843.take9.patch [ 12360022 ]
          Hide
          Steven Parkes added a comment -

          I've started looking at this, what it would take to merge with the merge policy stuff (LUCENE-847). Noticed that there are a couple of test failures?

          Show
          Steven Parkes added a comment - I've started looking at this, what it would take to merge with the merge policy stuff ( LUCENE-847 ). Noticed that there are a couple of test failures?
          Hide
          Michael McCandless added a comment -

          Oh, were the test failures only in the TestBackwardsCompatibility?

          Because I changed the index file format, I added 2 more ZIP files to
          that unit test, but, "svn diff" doesn't pick up the new zip files. So
          I'm attaching them. Can you pull off these zip files into your
          src/test/org/apache/lucene/index and test again? Thanks.

          Show
          Michael McCandless added a comment - Oh, were the test failures only in the TestBackwardsCompatibility? Because I changed the index file format, I added 2 more ZIP files to that unit test, but, "svn diff" doesn't pick up the new zip files. So I'm attaching them. Can you pull off these zip files into your src/test/org/apache/lucene/index and test again? Thanks.
          Michael McCandless made changes -
          Attachment index.presharedstores.cfs.zip [ 12360213 ]
          Attachment index.presharedstores.nocfs.zip [ 12360214 ]
          Hide
          Steven Parkes added a comment -

          Yeah, that was it.

          I'll be delving more into the code as I try to figure out how it will dove tail with the merge policy factoring.

          Show
          Steven Parkes added a comment - Yeah, that was it. I'll be delving more into the code as I try to figure out how it will dove tail with the merge policy factoring.
          Hide
          Michael McCandless added a comment -

          > Yeah, that was it.

          Phew!

          > I'll be delving more into the code as I try to figure out how it will
          > dove tail with the merge policy factoring.

          OK, thanks. I am very eager to get some other eyeballs looking for
          issues with this patch!

          I think this patch and the merge policy refactoring should be fairly
          separate.

          With this patch, "flushing" RAM -> Lucene segment is no longer a
          "mergeSegments" call which I think simplifies IndexWriter. Previously
          mergeSegments had lots of extra logic to tell if it was merging RAM
          segments (= a flush) vs merging "real" segments but now it's simpler
          because mergeSegments really only merges segments.

          Show
          Michael McCandless added a comment - > Yeah, that was it. Phew! > I'll be delving more into the code as I try to figure out how it will > dove tail with the merge policy factoring. OK, thanks. I am very eager to get some other eyeballs looking for issues with this patch! I think this patch and the merge policy refactoring should be fairly separate. With this patch, "flushing" RAM -> Lucene segment is no longer a "mergeSegments" call which I think simplifies IndexWriter. Previously mergeSegments had lots of extra logic to tell if it was merging RAM segments (= a flush) vs merging "real" segments but now it's simpler because mergeSegments really only merges segments.
          Hide
          Michael Busch added a comment -

          Hi Mike,

          my first comment on this patch is: Impressive!

          It's also quite overwhelming at the beginning, but I'm trying to dig into it. I'll probably have more questions, here's the first one:

          Does DocumentsWriter also solve the problem DocumentWriter had before LUCENE-880? I believe the answer is yes. Even though you close the TokenStreams in the finally clause of invertField() like DocumentWriter did before 880 this is safe, because addPosition() serializes the term strings and payload bytes into the posting hash table right away. Is that right?

          Show
          Michael Busch added a comment - Hi Mike, my first comment on this patch is: Impressive! It's also quite overwhelming at the beginning, but I'm trying to dig into it. I'll probably have more questions, here's the first one: Does DocumentsWriter also solve the problem DocumentWriter had before LUCENE-880 ? I believe the answer is yes. Even though you close the TokenStreams in the finally clause of invertField() like DocumentWriter did before 880 this is safe, because addPosition() serializes the term strings and payload bytes into the posting hash table right away. Is that right?
          Hide
          Michael Busch added a comment -

          Mike,

          the benchmarks you run focus on measuring the pure indexing performance. I think it would be interesting to know how big the speedup is in real-life scenarios, i. e. with StandardAnalyzer and maybe even HTML parsing? For sure the speedup will be less, but it should still be a significant improvement. Did you run those kinds of benchmarks already?

          Show
          Michael Busch added a comment - Mike, the benchmarks you run focus on measuring the pure indexing performance. I think it would be interesting to know how big the speedup is in real-life scenarios, i. e. with StandardAnalyzer and maybe even HTML parsing? For sure the speedup will be less, but it should still be a significant improvement. Did you run those kinds of benchmarks already?
          Hide
          Michael McCandless added a comment -

          > Does DocumentsWriter also solve the problem DocumentWriter had
          > before LUCENE-880? I believe the answer is yes. Even though you
          > close the TokenStreams in the finally clause of invertField() like
          > DocumentWriter did before 880 this is safe, because addPosition()
          > serializes the term strings and payload bytes into the posting hash
          > table right away. Is that right?

          That's right. When I merged in the fix for LUCENE-880, I realized
          with this patch it's fine to close the token stream immediately after
          processing all of its tokens because everything about the token stream
          has been "absorbed" into postings hash.

          > the benchmarks you run focus on measuring the pure indexing
          > performance. I think it would be interesting to know how big the
          > speedup is in real-life scenarios, i. e. with StandardAnalyzer and
          > maybe even HTML parsing? For sure the speedup will be less, but it
          > should still be a significant improvement. Did you run those kinds
          > of benchmarks already?

          Good question ... I haven't measured the performance cost of using
          StandardAnalyzer or HTML parsing but I will test & post back.

          Show
          Michael McCandless added a comment - > Does DocumentsWriter also solve the problem DocumentWriter had > before LUCENE-880 ? I believe the answer is yes. Even though you > close the TokenStreams in the finally clause of invertField() like > DocumentWriter did before 880 this is safe, because addPosition() > serializes the term strings and payload bytes into the posting hash > table right away. Is that right? That's right. When I merged in the fix for LUCENE-880 , I realized with this patch it's fine to close the token stream immediately after processing all of its tokens because everything about the token stream has been "absorbed" into postings hash. > the benchmarks you run focus on measuring the pure indexing > performance. I think it would be interesting to know how big the > speedup is in real-life scenarios, i. e. with StandardAnalyzer and > maybe even HTML parsing? For sure the speedup will be less, but it > should still be a significant improvement. Did you run those kinds > of benchmarks already? Good question ... I haven't measured the performance cost of using StandardAnalyzer or HTML parsing but I will test & post back.
          Hide
          Michael McCandless added a comment -

          OK I ran tests comparing analyzer performance.

          It's the same test framework as above, using the ~5,500 byte Europarl
          docs with autoCommit=true, 32 MB RAM buffer, no stored fields nor
          vectors, and CFS=false, indexing 200,000 documents.

          The SimpleSpaceAnalyzer is my own whitespace analyzer that minimizes
          GC cost by not allocating a Term or String for every token in every
          document.

          Each run is best time of 2 runs:

          ANALYZER PATCH (sec) TRUNK (sec) SPEEDUP
          SimpleSpaceAnalyzer 79.0 326.5 4.1 X
          StandardAnalyzer 449.0 674.1 1.5 X
          WhitespaceAnalyzer 104.0 338.9 3.3 X
          SimpleAnalyzer 104.7 328.0 3.1 X

          StandardAnalyzer is definiteely rather time consuming!

          Show
          Michael McCandless added a comment - OK I ran tests comparing analyzer performance. It's the same test framework as above, using the ~5,500 byte Europarl docs with autoCommit=true, 32 MB RAM buffer, no stored fields nor vectors, and CFS=false, indexing 200,000 documents. The SimpleSpaceAnalyzer is my own whitespace analyzer that minimizes GC cost by not allocating a Term or String for every token in every document. Each run is best time of 2 runs: ANALYZER PATCH (sec) TRUNK (sec) SPEEDUP SimpleSpaceAnalyzer 79.0 326.5 4.1 X StandardAnalyzer 449.0 674.1 1.5 X WhitespaceAnalyzer 104.0 338.9 3.3 X SimpleAnalyzer 104.7 328.0 3.1 X StandardAnalyzer is definiteely rather time consuming!
          Hide
          Michael Busch added a comment -

          > OK I ran tests comparing analyzer performance.

          Thanks for the numbers Mike. Yes the gain is less with StandardAnalyzer
          but 1.5X faster is still very good!

          I have some question about the extensibility of your code. For flexible
          indexing we want to be able in the future to implement different posting
          formats and we might even want to allow our users to implement own
          posting formats.

          When I implemented multi-level skipping I tried to keep this in mind.
          Therefore I put most of the functionality in the two abstract classes
          MultiLevelSkipListReader/Writer. Subclasses implement the actual format
          of the skip data. I think with this design it should be quite easy to
          implement different formats in the future while limiting the code
          complexity.

          With the old DocumentWriter I think this is quite simple to do too by
          adding a class like PostingListWriter, where subclasses define the actual
          format (because DocumentWriter is so simple).

          Do you think your code is easily extensible in this regard? I'm
          wondering because of all the optimizations you're doing like e. g.
          sharing byte arrays. But I'm certainly not familiar enough with your code
          yet, so I'm only guessing here.

          Show
          Michael Busch added a comment - > OK I ran tests comparing analyzer performance. Thanks for the numbers Mike. Yes the gain is less with StandardAnalyzer but 1.5X faster is still very good! I have some question about the extensibility of your code. For flexible indexing we want to be able in the future to implement different posting formats and we might even want to allow our users to implement own posting formats. When I implemented multi-level skipping I tried to keep this in mind. Therefore I put most of the functionality in the two abstract classes MultiLevelSkipListReader/Writer. Subclasses implement the actual format of the skip data. I think with this design it should be quite easy to implement different formats in the future while limiting the code complexity. With the old DocumentWriter I think this is quite simple to do too by adding a class like PostingListWriter, where subclasses define the actual format (because DocumentWriter is so simple). Do you think your code is easily extensible in this regard? I'm wondering because of all the optimizations you're doing like e. g. sharing byte arrays. But I'm certainly not familiar enough with your code yet, so I'm only guessing here.
          Hide
          Michael McCandless added a comment -

          > Do you think your code is easily extensible in this regard? I'm
          > wondering because of all the optimizations you're doing like e. g.
          > sharing byte arrays. But I'm certainly not familiar enough with your code
          > yet, so I'm only guessing here.

          Good question!

          DocumentsWriter is definitely more complex than DocumentWriter, but it
          doesn't prevent extensibility and I think will work very well when we
          do flexible indexing.

          The patch now has dedicated methods for writing into the freq/prox/etc
          streams ('writeFreqByte', 'writeFreqVInt', 'writeProxByte',
          'writeProxVInt', etc.), but, this could easily be changed to instead
          use true IndexOutput streams. This would then hide all details of
          shared byte arrays from whoever is doing the writing.

          The way I roughly see flexible indexing working in the future is
          DocumentsWriter will be responsible for keeping track of unique terms
          seen (in its hash table), holding the Posting instance (which could be
          subclassed in the future) for each term, flushing a real segment when
          full, handling shared byte arrays, etc. Ie all the "infrastructure".

          But then the specific logic of what bytes are written into which
          streams (freq/prox/vectors/others) will be handled by a separate class
          or classes that we can plug/unplug according to some "schema".
          DocumentsWriter would call on these classes and provide the
          IndexOutput's for all streams for the Posting, per position, and these
          classes write their own format into the IndexOutputs.

          I think a separation like that would work well: we could have good
          performance and also extensibility. Devil is in the details of
          course...

          I obviously haven't factored DocumentsWriter in this way (it has its
          own addPosition that writes the current Lucene index format) but I
          think this is very doable in the future.

          Show
          Michael McCandless added a comment - > Do you think your code is easily extensible in this regard? I'm > wondering because of all the optimizations you're doing like e. g. > sharing byte arrays. But I'm certainly not familiar enough with your code > yet, so I'm only guessing here. Good question! DocumentsWriter is definitely more complex than DocumentWriter, but it doesn't prevent extensibility and I think will work very well when we do flexible indexing. The patch now has dedicated methods for writing into the freq/prox/etc streams ('writeFreqByte', 'writeFreqVInt', 'writeProxByte', 'writeProxVInt', etc.), but, this could easily be changed to instead use true IndexOutput streams. This would then hide all details of shared byte arrays from whoever is doing the writing. The way I roughly see flexible indexing working in the future is DocumentsWriter will be responsible for keeping track of unique terms seen (in its hash table), holding the Posting instance (which could be subclassed in the future) for each term, flushing a real segment when full, handling shared byte arrays, etc. Ie all the "infrastructure". But then the specific logic of what bytes are written into which streams (freq/prox/vectors/others) will be handled by a separate class or classes that we can plug/unplug according to some "schema". DocumentsWriter would call on these classes and provide the IndexOutput's for all streams for the Posting, per position, and these classes write their own format into the IndexOutputs. I think a separation like that would work well: we could have good performance and also extensibility. Devil is in the details of course... I obviously haven't factored DocumentsWriter in this way (it has its own addPosition that writes the current Lucene index format) but I think this is very doable in the future.
          Hide
          Doron Cohen added a comment -

          Mike, I am considering testing the performance of this patch on a somewhat different use case, real one I think. After indexing 25M docs of TREC .gov2 (~500GB of docs) I pushed the index terms to create a spell correction index, by using the contrib spell checker. Docs here are very short - For each index term a document is created, containing some N-GRAMS. On the specific machine I used there are 2 CPUs but the SpellChecker indexing does not take advantage of that. Anyhow, 126,684,685 words==documents were indexed.
          For the docs adding step I had:
          mergeFactor = 100,000
          maxBufferedDocs = 10,000
          So no merging took place.
          This step took 21 hours, and created 12,685 segments, total size 15 - 20 GB.
          Then I optimized the index with
          mergeFactor = 400
          (Larger values were hard on the open files limits.)

          I thought it would be interesting to see how the new code performs in this scenario, what do you think?

          If you too find this comparison interesting, I have two more questions:

          • what settings do you recommend?
          • is there any chance for speed-up in optimize()? I didn't read your
            new code yet, but at least from some comments here it seems that
            on disk merging was not changed... is this (still) so? I would skip the
            optimize part if this is not of interest for the comparison. (In fact I am
            still waiting for my optimize() to complete, but if it is not of interest I
            will just interrupt it...)

          Thanks,
          Doron

          Show
          Doron Cohen added a comment - Mike, I am considering testing the performance of this patch on a somewhat different use case, real one I think. After indexing 25M docs of TREC .gov2 (~500GB of docs) I pushed the index terms to create a spell correction index, by using the contrib spell checker. Docs here are very short - For each index term a document is created, containing some N-GRAMS. On the specific machine I used there are 2 CPUs but the SpellChecker indexing does not take advantage of that. Anyhow, 126,684,685 words==documents were indexed. For the docs adding step I had: mergeFactor = 100,000 maxBufferedDocs = 10,000 So no merging took place. This step took 21 hours, and created 12,685 segments, total size 15 - 20 GB. Then I optimized the index with mergeFactor = 400 (Larger values were hard on the open files limits.) I thought it would be interesting to see how the new code performs in this scenario, what do you think? If you too find this comparison interesting, I have two more questions: what settings do you recommend? is there any chance for speed-up in optimize()? I didn't read your new code yet, but at least from some comments here it seems that on disk merging was not changed... is this (still) so? I would skip the optimize part if this is not of interest for the comparison. (In fact I am still waiting for my optimize() to complete, but if it is not of interest I will just interrupt it...) Thanks, Doron
          Hide
          Michael McCandless added a comment -

          > I thought it would be interesting to see how the new code performs in this scenario, what do you think?

          Yes I'd be very interested to see the results of this. It's a
          somewhat "unusual" indexing situation (such tiny docs) but it's a real
          world test case. Thanks!

          > - what settings do you recommend?

          I think these are likely the important ones in this case:

          • Flush by RAM instead of doc count
            (writer.setRAMBufferSizeMB(...)).
          • Give it as much RAM as you can.
          • Use maybe 3 indexing threads (if you can).
          • Turn off compound file.
          • If you have stored fields/vectors (seems not in this case) use
            autoCommit=false.
          • Use a trivial analyzer that doesn't create new String/new Token
            (re-use the same Token, and use the char[] based term text
            storage instead of the String one).
          • Re-use Document/Field instances. The DocumentsWriter is fine with
            this and it saves substantial time from GC especially because your
            docs are so tiny (per-doc overhead is otherwise a killer). In
            IndexLineFiles I made a StringReader that lets me reset its String
            value; this way I didn't have to change the Field instances stored
            in the Document.

          > - is there any chance for speed-up in optimize()? I didn't read
          > your new code yet, but at least from some comments here it seems
          > that on disk merging was not changed... is this (still) so? I would

          Correct: my patch doesn't touch merging and optimizing. All it does
          now is gather many docs in RAM and then flush a new segment when it's
          time. I've opened a separate issue (LUCENE-856) for optimizations
          in segment merging.

          Show
          Michael McCandless added a comment - > I thought it would be interesting to see how the new code performs in this scenario, what do you think? Yes I'd be very interested to see the results of this. It's a somewhat "unusual" indexing situation (such tiny docs) but it's a real world test case. Thanks! > - what settings do you recommend? I think these are likely the important ones in this case: Flush by RAM instead of doc count (writer.setRAMBufferSizeMB(...)). Give it as much RAM as you can. Use maybe 3 indexing threads (if you can). Turn off compound file. If you have stored fields/vectors (seems not in this case) use autoCommit=false. Use a trivial analyzer that doesn't create new String/new Token (re-use the same Token, and use the char[] based term text storage instead of the String one). Re-use Document/Field instances. The DocumentsWriter is fine with this and it saves substantial time from GC especially because your docs are so tiny (per-doc overhead is otherwise a killer). In IndexLineFiles I made a StringReader that lets me reset its String value; this way I didn't have to change the Field instances stored in the Document. > - is there any chance for speed-up in optimize()? I didn't read > your new code yet, but at least from some comments here it seems > that on disk merging was not changed... is this (still) so? I would Correct: my patch doesn't touch merging and optimizing. All it does now is gather many docs in RAM and then flush a new segment when it's time. I've opened a separate issue ( LUCENE-856 ) for optimizations in segment merging.
          Hide
          Doron Cohen added a comment -

          Just to clarify your comment on reusing field and doc instances - to my understanding reusing a field instance is ok only after the containing doc was added to the index.

          For a "fair" comparison I ended up not following most of your recommendations, including the reuse field/docs one and the non-compound one (apologies), but I might use them later.

          For the first 100,000,000 docs (==speller words) the speed-up is quite amazing:
          Orig: Speller: added 100000000 words in 10912 seconds = 3 hours 1 minutes 52 seconds
          New: Speller: added 100000000 words in 58490 seconds = 16 hours 14 minutes 50 seconds
          This is 5.3 times faster !!!

          This btw was with maxBufDocs=100,000 (I forgot to set the MEM param).
          I stopped the run now, I don't expect to learn anything new by letting it continue.

          When trying with MEM=512MB, it at first seemed faster, but then there were now and then local slow-downs, and eventually it became a bit slower than the previous run. I know these are not merges, so they are either flushes (RAM directed), or GC activity. I will perhaps run with GC debug flags and perhaps add a print at flush so to tell the culprit for these local slow-downs.

          Other than that, I will perhaps try to index .GOV2 (25 Million HTML docs) with this patch. The way I indexed it before it took about 4 days - running in 4 threads, and creating 36 indexes. This is even more a real life scenario, it involves HTML parsing, standard analysis, and merging (to some extent). Since there are 4 threads each one will get, say, 250MB. Again, for a "fair" comparison, I will remain with compound.

          Show
          Doron Cohen added a comment - Just to clarify your comment on reusing field and doc instances - to my understanding reusing a field instance is ok only after the containing doc was added to the index. For a "fair" comparison I ended up not following most of your recommendations, including the reuse field/docs one and the non-compound one (apologies ), but I might use them later. For the first 100,000,000 docs (==speller words) the speed-up is quite amazing: Orig: Speller: added 100000000 words in 10912 seconds = 3 hours 1 minutes 52 seconds New: Speller: added 100000000 words in 58490 seconds = 16 hours 14 minutes 50 seconds This is 5.3 times faster !!! This btw was with maxBufDocs=100,000 (I forgot to set the MEM param). I stopped the run now, I don't expect to learn anything new by letting it continue. When trying with MEM=512MB, it at first seemed faster, but then there were now and then local slow-downs, and eventually it became a bit slower than the previous run. I know these are not merges, so they are either flushes (RAM directed), or GC activity. I will perhaps run with GC debug flags and perhaps add a print at flush so to tell the culprit for these local slow-downs. Other than that, I will perhaps try to index .GOV2 (25 Million HTML docs) with this patch. The way I indexed it before it took about 4 days - running in 4 threads, and creating 36 indexes. This is even more a real life scenario, it involves HTML parsing, standard analysis, and merging (to some extent). Since there are 4 threads each one will get, say, 250MB. Again, for a "fair" comparison, I will remain with compound.
          Hide
          Michael McCandless added a comment -

          > Just to clarify your comment on reusing field and doc instances - to my
          > understanding reusing a field instance is ok only after the containing
          > doc was added to the index.

          Right, if your documents are very "regular" you should get a sizable
          speedup (especially for tiny docs), with or without this patch, if you
          make a single Document and add separate Field instances to it for
          each field, and then reuse the Document and Field instances for all
          the docs you want to add.

          It's not easy to reuse Field instances now (there's no
          setStringValue()). I made a ReusableStringReader to do this but you
          could also make your own class that implements Fieldable.

          > For a "fair" comparison I ended up not following most of your
          > recommendations, including the reuse field/docs one and the non-compound
          > one (apologies), but I might use them later.

          OK, when you say "fair" I think you mean because you already had a
          previous run that used compound file, you had to use compound file in
          the run with the LUCENE-843 patch (etc)? The recommendations above
          should speed up Lucene with or without my patch.

          > For the first 100,000,000 docs (==speller words) the speed-up is quite
          > amazing:
          > Orig: Speller: added 100000000 words in 10912 seconds = 3 hours 1
          > minutes 52 seconds
          > New: Speller: added 100000000 words in 58490 seconds = 16 hours 14
          > minutes 50 seconds
          > This is 5.3 times faster !!!

          Wow! I think the speedup might be even more if both of your runs followed
          the suggestions above.

          > This btw was with maxBufDocs=100,000 (I forgot to set the MEM param).
          > I stopped the run now, I don't expect to learn anything new by letting it
          > continue.
          >
          > When trying with MEM=512MB, it at first seemed faster, but then there
          > were now and then local slow-downs, and eventually it became a bit slower
          > than the previous run. I know these are not merges, so they are either
          > flushes (RAM directed), or GC activity. I will perhaps run with GC debug
          > flags and perhaps add a print at flush so to tell the culprit for these
          > local slow-downs.

          Hurm, odd. I haven't pushed RAM buffer up to 512 MB so it could be GC
          cost somehow makes things worse ... curious.

          > Other than that, I will perhaps try to index .GOV2 (25 Million HTML docs)
          > with this patch. The way I indexed it before it took about 4 days -
          > running in 4 threads, and creating 36 indexes. This is even more a real
          > life scenario, it involves HTML parsing, standard analysis, and merging
          > (to some extent). Since there are 4 threads each one will get, say,
          > 250MB. Again, for a "fair" comparison, I will remain with compound.

          OK, because you're doing StandardAnalyzer and HTML parsing and
          presumably loading one-doc-per-file, most of your time is spent
          outside of Lucene indexing so I'd expect less that 50% speedup in
          this case.

          Show
          Michael McCandless added a comment - > Just to clarify your comment on reusing field and doc instances - to my > understanding reusing a field instance is ok only after the containing > doc was added to the index. Right, if your documents are very "regular" you should get a sizable speedup (especially for tiny docs), with or without this patch, if you make a single Document and add separate Field instances to it for each field, and then reuse the Document and Field instances for all the docs you want to add. It's not easy to reuse Field instances now (there's no setStringValue()). I made a ReusableStringReader to do this but you could also make your own class that implements Fieldable. > For a "fair" comparison I ended up not following most of your > recommendations, including the reuse field/docs one and the non-compound > one (apologies ), but I might use them later. OK, when you say "fair" I think you mean because you already had a previous run that used compound file, you had to use compound file in the run with the LUCENE-843 patch (etc)? The recommendations above should speed up Lucene with or without my patch. > For the first 100,000,000 docs (==speller words) the speed-up is quite > amazing: > Orig: Speller: added 100000000 words in 10912 seconds = 3 hours 1 > minutes 52 seconds > New: Speller: added 100000000 words in 58490 seconds = 16 hours 14 > minutes 50 seconds > This is 5.3 times faster !!! Wow! I think the speedup might be even more if both of your runs followed the suggestions above. > This btw was with maxBufDocs=100,000 (I forgot to set the MEM param). > I stopped the run now, I don't expect to learn anything new by letting it > continue. > > When trying with MEM=512MB, it at first seemed faster, but then there > were now and then local slow-downs, and eventually it became a bit slower > than the previous run. I know these are not merges, so they are either > flushes (RAM directed), or GC activity. I will perhaps run with GC debug > flags and perhaps add a print at flush so to tell the culprit for these > local slow-downs. Hurm, odd. I haven't pushed RAM buffer up to 512 MB so it could be GC cost somehow makes things worse ... curious. > Other than that, I will perhaps try to index .GOV2 (25 Million HTML docs) > with this patch. The way I indexed it before it took about 4 days - > running in 4 threads, and creating 36 indexes. This is even more a real > life scenario, it involves HTML parsing, standard analysis, and merging > (to some extent). Since there are 4 threads each one will get, say, > 250MB. Again, for a "fair" comparison, I will remain with compound. OK, because you're doing StandardAnalyzer and HTML parsing and presumably loading one-doc-per-file, most of your time is spent outside of Lucene indexing so I'd expect less that 50% speedup in this case.
          Michael McCandless committed 553235 (1 file)
          Reviews: none

          LUCENE-843: planting tag just before committing

          Michael McCandless committed 553236 (36 files)
          Reviews: none

          LUCENE-843: speed up IndexWriter performance

          Lucene trunk
          Michael McCandless made changes -
          Status In Progress [ 3 ] Resolved [ 5 ]
          Fix Version/s 2.3 [ 12312531 ]
          Resolution Fixed [ 1 ]
          Lucene Fields [New] [New, Patch Available]
          Michael McCandless committed 553268 (1 file)
          Reviews: none

          LUCENE-843: woops ... move the CHANGES.txt entries into the right section (trunk)

          Michael McCandless committed 553298 (3 files)
          Reviews: none

          LUCENE-843: SegmentInfo.clone() failed to copy some fields; in certain cases _x.nrm might not be created even when fieldInfos says it has norms

          Hide
          Michael McCandless added a comment -

          Re-opening this issue: I saw one failure of the contrib/benchmark
          TestPerfTasksLogic.testParallelDocMaker() tests due to an intermittant
          thread-safety issue. It's hard to get the failure to happen (it's
          happened only once in ~20 runs of contrib/benchmark) but I see where
          the issue is. Will commit a fix shortly.

          Show
          Michael McCandless added a comment - Re-opening this issue: I saw one failure of the contrib/benchmark TestPerfTasksLogic.testParallelDocMaker() tests due to an intermittant thread-safety issue. It's hard to get the failure to happen (it's happened only once in ~20 runs of contrib/benchmark) but I see where the issue is. Will commit a fix shortly.
          Michael McCandless made changes -
          Resolution Fixed [ 1 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          Lucene Fields [Patch Available, New] [New, Patch Available]
          Michael McCandless committed 553854 (1 file)
          Reviews: none

          LUCENE-843: add missing 'synchronized' to allThreadsIdle() method

          Hide
          Steven Parkes added a comment -

          Did we lose the triggered merge stuff from 887, i.e.,, should it be

          if (triggerMerge)

          { /* new merge policy if (0 == docWriter.getMaxBufferedDocs()) maybeMergeSegments(mergeFactor * numDocs / 2); else maybeMergeSegments(docWriter.getMaxBufferedDocs()); */ maybeMergeSegments(docWriter.getMaxBufferedDocs()); }
          Show
          Steven Parkes added a comment - Did we lose the triggered merge stuff from 887, i.e.,, should it be if (triggerMerge) { /* new merge policy if (0 == docWriter.getMaxBufferedDocs()) maybeMergeSegments(mergeFactor * numDocs / 2); else maybeMergeSegments(docWriter.getMaxBufferedDocs()); */ maybeMergeSegments(docWriter.getMaxBufferedDocs()); }
          Hide
          Michael McCandless added a comment -

          Woops ... you are right; thanks for catching it! I will add a unit
          test & fix it. I will also make the flush(boolean triggerMerge,
          boolean flushDocStores) protected, not public, and move the javadoc
          back to the public flush().

          Show
          Michael McCandless added a comment - Woops ... you are right; thanks for catching it! I will add a unit test & fix it. I will also make the flush(boolean triggerMerge, boolean flushDocStores) protected, not public, and move the javadoc back to the public flush().
          Michael McCandless committed 555793 (2 files)
          Michael McCandless made changes -
          Status Reopened [ 4 ] Resolved [ 5 ]
          Lucene Fields [Patch Available, New] [New, Patch Available]
          Resolution Fixed [ 1 ]
          Grant Ingersoll made changes -
          Link This issue is related to SOLR-342 [ SOLR-342 ]
          Michael McCandless committed 593114 (1 file)
          Reviews: none

          LUCENE-843: change public -> package protected for two advanced static methods on SegmentReader used only during merging

          Michael Busch made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Mark Thomas made changes -
          Workflow jira [ 12400220 ] Default workflow, editable Closed status [ 12564574 ]
          Mark Thomas made changes -
          Workflow Default workflow, editable Closed status [ 12564574 ] jira [ 12584981 ]

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Michael McCandless
            • Votes:
              5 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development