Issue Details (XML | Word | Printable)

Key: LUCENE-843
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Michael McCandless
Reporter: Michael McCandless
Votes: 5
Watchers: 4
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

improve how IndexWriter uses RAM to buffer added documents

Created: 22/Mar/07 05:05 PM   Updated: 25/Jan/08 03:23 AM
Return to search
Component/s: Index
Affects Version/s: 2.2
Fix Version/s: 2.3

Time Tracking:
Not Specified

File Attachments:
  Size
Zip Archive Licensed for inclusion in ASF works index.presharedstores.cfs.zip 2007-06-20 03:58 PM Michael McCandless 2 kB
Zip Archive Licensed for inclusion in ASF works index.presharedstores.nocfs.zip 2007-06-20 03:58 PM Michael McCandless 5 kB
Text File Licensed for inclusion in ASF works LUCENE-843.patch 2007-03-22 05:06 PM Michael McCandless 141 kB
Text File Licensed for inclusion in ASF works LUCENE-843.take2.patch 2007-03-25 02:30 PM Michael McCandless 148 kB
Text File Licensed for inclusion in ASF works LUCENE-843.take3.patch 2007-03-28 12:49 PM Michael McCandless 156 kB
Text File Licensed for inclusion in ASF works LUCENE-843.take4.patch 2007-04-02 02:43 PM Michael McCandless 188 kB
Text File Licensed for inclusion in ASF works LUCENE-843.take5.patch 2007-04-30 10:39 AM Michael McCandless 239 kB
Text File Licensed for inclusion in ASF works LUCENE-843.take6.patch 2007-05-21 06:14 PM Michael McCandless 210 kB
Text File Licensed for inclusion in ASF works LUCENE-843.take7.patch 2007-06-08 01:31 PM Michael McCandless 189 kB
Text File Licensed for inclusion in ASF works LUCENE-843.take8.patch 2007-06-15 07:00 PM Michael McCandless 203 kB
Text File Licensed for inclusion in ASF works LUCENE-843.take9.patch 2007-06-18 01:56 PM Michael McCandless 204 kB
Issue Links:
Blocker
 
Reference
 

Lucene Fields: New, Patch Available
Resolution Date: 12/Aug/07 03:48 PM


 Description  « Hide
I'm working on a new class (MultiDocumentWriter) that writes more than
one document directly into a single Lucene segment, more efficiently
than the current approach.

This only affects the creation of an initial segment from added
documents. I haven't changed anything after that, eg how segments are
merged.

The basic ideas are:

  • Write stored fields and term vectors directly to disk (don't
    use up RAM for these).
  • Gather posting lists & term infos in RAM, but periodically do
    in-RAM merges. Once RAM is full, flush buffers to disk (and
    merge them later when it's time to make a real segment).
  • Recycle objects/buffers to reduce time/stress in GC.
  • Other various optimizations.

Some of these changes are similar to how KinoSearch builds a segment.
But, I haven't made any changes to Lucene's file format nor added
requirements for a global fields schema.

So far the only externally visible change is a new method
"setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
deprecated) so that it flushes according to RAM usage and not a fixed
number documents added.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Michael McCandless added a comment - 22/Mar/07 05:06 PM
I'm attaching a patch with my current state. NOTE: this is very rough
and very much a work in progress and nowhere near ready to commit! I
wanted to get it out there sooner rather than later to get feedback,
maybe entice some daring early adopters, iterate, etc.

It passes all unit tests except the disk-full tests.

There are some big issues yet to resolve:

  • Merge policy has problems when you "flush by RAM" (this is true
    even before my patch). Not sure how to fix yet.
  • Thread safety and thread concurrency aren't there yet.
  • Norms are not flushed (just use up RAM until you close the
    writer).
  • Many other things on my TODO list

Michael McCandless added a comment - 25/Mar/07 02:30 PM
New rev of the patch:
  • Fixed at least one data corruption case
  • Added more asserts (run with "java -ea" so asserts run)
  • Some more small optimizations
  • Updated to current trunk so patch applies cleanly

Michael McCandless added a comment - 28/Mar/07 12:49 PM

Another rev of the patch:

  • Got thread concurrency working: removed "synchronized" from entire
    call to MultiDocWriter.addDocument and instead synchronize two
    quick steps (init/finish) addDocument leaving the real work
    (processDocument) unsynchronized.
  • Fixed bug that was failing to delete temp files from index
  • Reduced memory usage of Posting by inlining positions, start
    offset, end offset into a single int array.
  • Enabled IndexLineFiles.java (tool I use for local benchmarking) to
    run multiple threads
  • Other small optimizations

BTW, one of the nice side effects of this patch is it cleans up the
mergeSegments method of IndexWriter by separating out "flush" of added
docs & deletions because it's no longer a merge, from the "true"
mergeSegments whose purpose is then to merge disk segments.
Previously mergeSegments was getting rather confusing with the
different cases/combinations of added docs or not, deleted docs or
not, any merges or not.


Michael McCandless added a comment - 02/Apr/07 02:43 PM
Another rev of the patch. All tests pass except disk full tests. The
code is still rather "dirty" and not well commented.

I think I'm close to finishing optimizing and now I will focus on
error handling (eg disk full), adding some deeper unit tests, more
testing on corner cases like massive docs or docs with massive terms,
etc., flushing pending norms to disk, cleaning up / commenting the
code and various other smaller items.

Here are the changes in this rev:

  • A proposed backwards compatible change to the Token API to also
    allow the term text to be delivered as a slice (offset & length)
    into a char[] array instead of String. With an analyzer/tokenizer
    that takes advantage of this, this was a decent performance gain
    in my local testing. I've created a SimpleSpaceAnalyzer that only
    splits words at the space character to test this.
  • Added more asserts (run java -ea to enable asserts). The asserts
    are quite useful and now often catch a bug I've introduced before
    the unit tests do.
  • Changed to custom int[] block buffering for postings to store
    freq, prox's and offsets. With this buffering we no longer have
    to double the size of int[] arrays while adding positions, nor do
    we have to copy ints whenever we needs more space for these
    arrays. Instead I allocate larger slices out of the shared int[]
    arrays. This reduces memory and improves performance.
  • Changed to custom char[] block buffering for postings to store
    term text. This also reduces memory and improves performance.
  • Changed to single file for RAM & flushed partial segments (was 3
    separate files before)
  • Changed how I merge flushed partial segments to match what's
    described in LUCENE-854
  • Reduced memory usage when indexing large docs (25 MB plain text
    each). I'm still consuming more RAM in this case than the
    baseline (trunk) so I'm still working on this one ...
  • Fixed a slow memory leak when building large (20+ GB) indices

Michael McCandless added a comment - 03/Apr/07 10:20 AM
Some details on how I measure RAM usage: both the baseline (current
lucene trunk) and my patch have two general classes of RAM usage.

The first class, "document processing RAM", is RAM used while
processing a single doc. This RAM is re-used for each document (in the
trunk, it's GC'd and new RAM is allocated; in my patch, I explicitly
re-use these objects) and how large it gets is driven by how big each
document is.

The second class, "indexed documents RAM", is the RAM used up by
previously indexed documents. This RAM grows with each added
document and how large it gets is driven by the number and size of
docs indexed since the last flush.

So when I say the writer is allowed to use 32 MB of RAM, I'm only
measuring the "indexed documents RAM". With trunk I do this by
calling ramSizeInBytes(), and with my patch I do the analagous thing
by measuring how many RAM buffers are held up storing previously
indexed documents.

I then define "RAM efficiency" (docs/MB) as how many docs we can hold
in "indexed documents RAM" per MB RAM, at the point that we flush to
disk. I think this is an important metric because it drives how large
your initial (level 0) segments are. The larger these segments are
then generally the less merging you need to do, for a given # docs in
the index.

I also measure overall RAM used in the JVM (using
MemoryMXBean.getHeapMemoryUsage().getUsed()) just prior to each flush
except the last, to also capture the "document processing RAM", object
overhead, etc.


Michael McCandless added a comment - 03/Apr/07 10:20 AM
To do the benchmarking I created a simple standalone tool
(demo/IndexLineFiles, in the last patch) that indexes one line at a
time from a large previously created file, optionally using multiple
threads. I do it this way to minimize IO cost of pulling the document
source because I want to measure just indexing time as much as possible.

Each line is read and a doc is created with field "contents" that is
not stored, is tokenized, and optionally has term vectors with
position+offsets. I also optionally add two small only-stored fields
("path" and "modified"). I think these are fairly trivial documents
compared to typical usage of Lucene.

For the corpus, I took Europarl's "en" content, stripped tags, and
processed into 3 files: one with 100 tokens per line (= ~550 bytes),
one with 1000 tokens per line (= ~5,500 bytes) and with 10000 tokens
per line (= ~55,000 bytes) plain text per line.

All settings (mergeFactor, compound file, etc.) are left at defaults.
I don't optimize the index in the end. I'm using my new
SimpleSpaceAnalyzer (just splits token on the space character and
creates token text as slice into a char[] array instead of new
String(...)) to minimize the cost of tokenization.

I ran the tests with Java 1.5 on a Mac Pro quad (2 Intel CPUs, each
dual core) OS X box with 2 GB RAM. I give java 1 GB heap (-Xmx1024m).


Michael McCandless added a comment - 03/Apr/07 12:11 PM
A couple more details on the testing: I run java -server to get all
optimizations in the JVM, and the IO system is a local OS X RAID 0 of
4 SATA drives.

Using the above tool I ran an initial set of benchmarks comparing old
(= Lucene trunk) vs new (= this patch), varying document size (~550
bytes to ~5,500 bytes to ~55,000 bytes of plain text from Europarl
"en").

For each document size I run 4 combinations of whether term vectors
and stored fields are on or off and whether autoCommit is true or
false. I measure net docs/sec (= total # docs indexed divided by
total time taken), RAM efficiency (= avg # docs flushed with each
flush divided by RAM buffer size), and avg HEAP RAM usage before each
flush.

Here are the results for the 10K tokens (= ~55,000 bytes plain text)
per document:

20000 DOCS @ ~55,000 bytes plain text
RAM = 32 MB
NUM THREADS = 1
MERGE FACTOR = 10

No term vectors nor stored fields

AUTOCOMMIT = true (commit whenever RAM is full)

old
20000 docs in 200.3 secs
index size = 358M

new
20000 docs in 126.0 secs
index size = 356M

Total Docs/sec: old 99.8; new 158.7 [ 59.0% faster]
Docs/MB @ flush: old 24.2; new 49.1 [ 102.5% more]
Avg RAM used (MB) @ flush: old 74.5; new 36.2 [ 51.4% less]

AUTOCOMMIT = false (commit only once at the end)

old
20000 docs in 202.7 secs
index size = 358M

new
20000 docs in 120.0 secs
index size = 354M

Total Docs/sec: old 98.7; new 166.7 [ 69.0% faster]
Docs/MB @ flush: old 24.2; new 48.9 [ 101.7% more]
Avg RAM used (MB) @ flush: old 74.3; new 37.0 [ 50.2% less]

With term vectors (positions + offsets) and 2 small stored fields

AUTOCOMMIT = true (commit whenever RAM is full)

old
20000 docs in 374.7 secs
index size = 1.4G

new
20000 docs in 236.1 secs
index size = 1.4G

Total Docs/sec: old 53.4; new 84.7 [ 58.7% faster]
Docs/MB @ flush: old 10.2; new 49.1 [ 382.8% more]
Avg RAM used (MB) @ flush: old 129.3; new 36.6 [ 71.7% less]

AUTOCOMMIT = false (commit only once at the end)

old
20000 docs in 385.7 secs
index size = 1.4G

new
20000 docs in 182.8 secs
index size = 1.4G

Total Docs/sec: old 51.9; new 109.4 [ 111.0% faster]
Docs/MB @ flush: old 10.2; new 48.9 [ 380.9% more]
Avg RAM used (MB) @ flush: old 76.0; new 37.3 [ 50.9% less]


Michael McCandless added a comment - 03/Apr/07 12:15 PM
Here are the results for "normal" sized docs (1K tokens = ~5,500 bytes plain text each):

200000 DOCS @ ~5,500 bytes plain text
RAM = 32 MB
NUM THREADS = 1
MERGE FACTOR = 10

No term vectors nor stored fields

AUTOCOMMIT = true (commit whenever RAM is full)

old
200000 docs in 397.6 secs
index size = 415M

new
200000 docs in 167.5 secs
index size = 411M

Total Docs/sec: old 503.1; new 1194.1 [ 137.3% faster]
Docs/MB @ flush: old 81.6; new 406.2 [ 397.6% more]
Avg RAM used (MB) @ flush: old 87.3; new 35.2 [ 59.7% less]

AUTOCOMMIT = false (commit only once at the end)

old
200000 docs in 394.6 secs
index size = 415M

new
200000 docs in 168.4 secs
index size = 408M

Total Docs/sec: old 506.9; new 1187.7 [ 134.3% faster]
Docs/MB @ flush: old 81.6; new 432.2 [ 429.4% more]
Avg RAM used (MB) @ flush: old 126.6; new 36.9 [ 70.8% less]

With term vectors (positions + offsets) and 2 small stored fields

AUTOCOMMIT = true (commit whenever RAM is full)

old
200000 docs in 754.2 secs
index size = 1.7G

new
200000 docs in 304.9 secs
index size = 1.7G

Total Docs/sec: old 265.2; new 656.0 [ 147.4% faster]
Docs/MB @ flush: old 46.7; new 406.2 [ 769.6% more]
Avg RAM used (MB) @ flush: old 92.9; new 35.2 [ 62.1% less]

AUTOCOMMIT = false (commit only once at the end)

old
200000 docs in 743.9 secs
index size = 1.7G

new
200000 docs in 244.3 secs
index size = 1.7G

Total Docs/sec: old 268.9; new 818.7 [ 204.5% faster]
Docs/MB @ flush: old 46.7; new 432.2 [ 825.2% more]
Avg RAM used (MB) @ flush: old 93.0; new 36.6 [ 60.6% less]


Michael McCandless added a comment - 03/Apr/07 12:16 PM

Last is the results for small docs (100 tokens = ~550 bytes plain text each):

2000000 DOCS @ ~550 bytes plain text
RAM = 32 MB
NUM THREADS = 1
MERGE FACTOR = 10

No term vectors nor stored fields

AUTOCOMMIT = true (commit whenever RAM is full)

old
2000000 docs in 886.7 secs
index size = 438M

new
2000000 docs in 230.5 secs
index size = 435M

Total Docs/sec: old 2255.6; new 8676.4 [ 284.7% faster]
Docs/MB @ flush: old 128.0; new 4194.6 [ 3176.2% more]
Avg RAM used (MB) @ flush: old 107.3; new 37.7 [ 64.9% less]

AUTOCOMMIT = false (commit only once at the end)

old
2000000 docs in 888.7 secs
index size = 438M

new
2000000 docs in 239.6 secs
index size = 432M

Total Docs/sec: old 2250.5; new 8348.7 [ 271.0% faster]
Docs/MB @ flush: old 128.0; new 4146.8 [ 3138.9% more]
Avg RAM used (MB) @ flush: old 108.1; new 38.9 [ 64.0% less]

With term vectors (positions + offsets) and 2 small stored fields

AUTOCOMMIT = true (commit whenever RAM is full)

old
2000000 docs in 1480.1 secs
index size = 2.1G

new
2000000 docs in 462.0 secs
index size = 2.1G

Total Docs/sec: old 1351.2; new 4329.3 [ 220.4% faster]
Docs/MB @ flush: old 93.1; new 4194.6 [ 4405.7% more]
Avg RAM used (MB) @ flush: old 296.4; new 38.3 [ 87.1% less]

AUTOCOMMIT = false (commit only once at the end)

old
2000000 docs in 1489.4 secs
index size = 2.1G

new
2000000 docs in 347.9 secs
index size = 2.1G

Total Docs/sec: old 1342.8; new 5749.4 [ 328.2% faster]
Docs/MB @ flush: old 93.1; new 4146.8 [ 4354.5% more]
Avg RAM used (MB) @ flush: old 297.1; new 38.6 [ 87.0% less]

200000 DOCS @ ~5,500 bytes plain text

No term vectors nor stored fields

AUTOCOMMIT = true (commit whenever RAM is full)

old
200000 docs in 397.6 secs
index size = 415M

new
200000 docs in 167.5 secs
index size = 411M

Total Docs/sec: old 503.1; new 1194.1 [ 137.3% faster]
Docs/MB @ flush: old 81.6; new 406.2 [ 397.6% more]
Avg RAM used (MB) @ flush: old 87.3; new 35.2 [ 59.7% less]

AUTOCOMMIT = false (commit only once at the end)

old
200000 docs in 394.6 secs
index size = 415M

new
200000 docs in 168.4 secs
index size = 408M

Total Docs/sec: old 506.9; new 1187.7 [ 134.3% faster]
Docs/MB @ flush: old 81.6; new 432.2 [ 429.4% more]
Avg RAM used (MB) @ flush: old 126.6; new 36.9 [ 70.8% less]

With term vectors (positions + offsets) and 2 small stored fields

AUTOCOMMIT = true (commit whenever RAM is full)

old
200000 docs in 754.2 secs
index size = 1.7G

new
200000 docs in 304.9 secs
index size = 1.7G

Total Docs/sec: old 265.2; new 656.0 [ 147.4% faster]
Docs/MB @ flush: old 46.7; new 406.2 [ 769.6% more]
Avg RAM used (MB) @ flush: old 92.9; new 35.2 [ 62.1% less]

AUTOCOMMIT = false (commit only once at the end)

old
200000 docs in 743.9 secs
index size = 1.7G

new
200000 docs in 244.3 secs
index size = 1.7G

Total Docs/sec: old 268.9; new 818.7 [ 204.5% faster]
Docs/MB @ flush: old 46.7; new 432.2 [ 825.2% more]
Avg RAM used (MB) @ flush: old 93.0; new 36.6 [ 60.6% less]


Michael McCandless added a comment - 03/Apr/07 12:21 PM
A few notes from these results:
  • A real Lucene app won't see these gains because frequently the
    retrieval of docs from the content source, and the tokenization,
    take substantial amounts of time whereas for this test I've
    intentionally minimized the cost of those steps but they are very
    low for this test because I'm 1) pulling one line at a time from a
    big text file, and 2) using my simplistic SimpleSpaceAnalyzer
    which just breaks tokens at the space character.
  • Best speedup is ~4.3X faster, for tiny docs (~550 bytes) with term
    vectors and stored fields enabled and using autoCommit=false.
  • Least speedup is still ~1.6X faster, for large docs (~55,000
    bytes) with autoCommit=true.
  • The autoCommit=false cases are a little unfair to the new patch
    because with the new patch, you get a single-segment (optimized)
    index in the end, but with existing Lucene trunk, you don't.
  • With term vectors and/or stored fields, autoCommit=false is quite
    a bit faster with the patch, because we never pay the price to
    merge them since they are written once.
  • With term vectors and/or stored fields, the new patch has
    substantially better RAM efficiency.
  • The patch is especially faster and has better RAM efficiency with
    smaller documents.
  • The actual HEAP RAM usage is quite a bit more stable with the
    patch, especially with term vectors & stored fields enabled. I
    think this is because the patch creates far less garbage for GC to
    periodically reclaim. I think this also means you could push your
    RAM buffer size even higher to get better performance.

Marvin Humphrey added a comment - 03/Apr/07 02:23 PM
> The actual HEAP RAM usage is quite a bit more
> stable with the patch, especially with term vectors
> & stored fields enabled. I think this is because the
> patch creates far less garbage for GC to periodically
> reclaim. I think this also means you could push your
> RAM buffer size even higher to get better performance.

For KinoSearch, the sweet spot seems to be a buffer of around 16 MB when benchmarking with the Reuters corpus on my G4 laptop. Larger than that and things actually slow down, unless the buffer is large enough that it never needs flushing. My hypothesis is that RAM fragmentation is slowing down malloc/free. I'll be interested as to whether you see the same effect.


Michael McCandless added a comment - 03/Apr/07 03:33 PM

>> The actual HEAP RAM usage is quite a bit more
>> stable with the patch, especially with term vectors
>> & stored fields enabled. I think this is because the
>> patch creates far less garbage for GC to periodically
>> reclaim. I think this also means you could push your
>> RAM buffer size even higher to get better performance.
>
> For KinoSearch, the sweet spot seems to be a buffer of around 16 MB
> when benchmarking with the Reuters corpus on my G4 laptop. Larger
> than that and things actually slow down, unless the buffer is large
> enough that it never needs flushing. My hypothesis is that RAM
> fragmentation is slowing down malloc/free. I'll be interested as to
> whether you see the same effect.

Interesting. OK I will run the benchmark across increasing RAM sizes
to see where the sweet spot seems to be!


Michael McCandless added a comment - 05/Apr/07 01:21 PM

OK I ran old (trunk) vs new (this patch) with increasing RAM buffer
sizes up to 96 MB.

I used the "normal" sized docs (~5,500 bytes plain text), left stored
fields and term vectors (positions + offsets) on, and
autoCommit=false.

Here're the results:

NUM THREADS = 1
MERGE FACTOR = 10
With term vectors (positions + offsets) and 2 small stored fields
AUTOCOMMIT = false (commit only once at the end)

1 MB

old
200000 docs in 862.2 secs
index size = 1.7G

new
200000 docs in 297.1 secs
index size = 1.7G

Total Docs/sec: old 232.0; new 673.2 [ 190.2% faster]
Docs/MB @ flush: old 47.2; new 278.4 [ 489.6% more]
Avg RAM used (MB) @ flush: old 34.5; new 3.4 [ 90.1% less]

2 MB

old
200000 docs in 828.7 secs
index size = 1.7G

new
200000 docs in 279.0 secs
index size = 1.7G

Total Docs/sec: old 241.3; new 716.8 [ 197.0% faster]
Docs/MB @ flush: old 47.0; new 322.4 [ 586.7% more]
Avg RAM used (MB) @ flush: old 37.9; new 4.5 [ 88.0% less]

4 MB

old
200000 docs in 840.5 secs
index size = 1.7G

new
200000 docs in 260.8 secs
index size = 1.7G

Total Docs/sec: old 237.9; new 767.0 [ 222.3% faster]
Docs/MB @ flush: old 46.8; new 363.1 [ 675.4% more]
Avg RAM used (MB) @ flush: old 33.9; new 6.5 [ 80.9% less]

8 MB

old
200000 docs in 678.8 secs
index size = 1.7G

new
200000 docs in 248.8 secs
index size = 1.7G

Total Docs/sec: old 294.6; new 803.7 [ 172.8% faster]
Docs/MB @ flush: old 46.8; new 392.4 [ 739.1% more]
Avg RAM used (MB) @ flush: old 60.3; new 10.7 [ 82.2% less]

16 MB

old
200000 docs in 660.6 secs
index size = 1.7G

new
200000 docs in 247.3 secs
index size = 1.7G

Total Docs/sec: old 302.8; new 808.7 [ 167.1% faster]
Docs/MB @ flush: old 46.7; new 415.4 [ 788.8% more]
Avg RAM used (MB) @ flush: old 47.1; new 19.2 [ 59.3% less]

24 MB

old
200000 docs in 658.1 secs
index size = 1.7G

new
200000 docs in 243.0 secs
index size = 1.7G

Total Docs/sec: old 303.9; new 823.0 [ 170.8% faster]
Docs/MB @ flush: old 46.7; new 430.9 [ 822.2% more]
Avg RAM used (MB) @ flush: old 70.0; new 27.5 [ 60.8% less]

32 MB

old
200000 docs in 714.2 secs
index size = 1.7G

new
200000 docs in 239.2 secs
index size = 1.7G

Total Docs/sec: old 280.0; new 836.0 [ 198.5% faster]
Docs/MB @ flush: old 46.7; new 432.2 [ 825.2% more]
Avg RAM used (MB) @ flush: old 92.5; new 36.7 [ 60.3% less]

48 MB

old
200000 docs in 640.3 secs
index size = 1.7G

new
200000 docs in 236.0 secs
index size = 1.7G

Total Docs/sec: old 312.4; new 847.5 [ 171.3% faster]
Docs/MB @ flush: old 46.7; new 438.5 [ 838.8% more]
Avg RAM used (MB) @ flush: old 138.9; new 52.8 [ 62.0% less]

64 MB

old
200000 docs in 649.3 secs
index size = 1.7G

new
200000 docs in 238.3 secs
index size = 1.7G

Total Docs/sec: old 308.0; new 839.3 [ 172.5% faster]
Docs/MB @ flush: old 46.7; new 441.3 [ 844.7% more]
Avg RAM used (MB) @ flush: old 302.6; new 72.7 [ 76.0% less]

80 MB

old
200000 docs in 670.2 secs
index size = 1.7G

new
200000 docs in 227.2 secs
index size = 1.7G

Total Docs/sec: old 298.4; new 880.5 [ 195.0% faster]
Docs/MB @ flush: old 46.7; new 446.2 [ 855.2% more]
Avg RAM used (MB) @ flush: old 231.7; new 94.3 [ 59.3% less]

96 MB

old
200000 docs in 683.4 secs
index size = 1.7G

new
200000 docs in 226.8 secs
index size = 1.7G

Total Docs/sec: old 292.7; new 882.0 [ 201.4% faster]
Docs/MB @ flush: old 46.7; new 448.0 [ 859.1% more]
Avg RAM used (MB) @ flush: old 274.5; new 112.7 [ 59.0% less]

Some observations:

  • Remember the test is already biased against "new" because with the
    patch you get an optimized index in the end but with "old" you
    don't.
  • Sweet spot for old (trunk) seems to be 48 MB: that is the peak
    docs/sec @ 312.4.
  • New (with patch) seems to just get faster the more memory you give
    it, though gradually. The peak was 96 MB (the largest I ran). So
    no sweet spot (or maybe I need to give more memory, but, above 96
    MB the trunk was starting to swap on my test env).
  • New gets better and better RAM efficiency, the more RAM you give.
    This makes sense: it's better able to compress the terms dict, the
    more docs are merged in RAM before having to flush to disk. I
    would also expect this curve to be somewhat content dependent.

Michael McCandless added a comment - 30/Apr/07 10:39 AM
I attached a new iteration of the patch. It's quite different from
the last patch.

After discussion on java-dev last time, I decided to retry the
"persistent hash" approach, where the Postings hash lasts across many
docs and then a single flush produces a partial segment containing all
of those docs. This is in contrast to the previous approach where
each doc makes its own segment and then they are merged.

It turns out this is even faster than my previous approach, especially
for smaller docs and especially when term vectors are off (because no
quicksort() is needed until the segment is flushed). I will attach
new benchmark results.

Other changes:

  • Changed my benchmarking tool / testing (IndexLineFiles):
  • I turned off compound file (to reduce time NOT spent on
    indexing).
  • I noticed I was not downcasing the terms, so I fixed that
  • I now do my own line processing to reduce GC cost of
    "BufferedReader.readLine" (to reduct time NOT spent on
    indexing).
  • Norms now properly flush to disk in the autoCommit=false case
  • All unit tests pass except disk full
  • I turned on asserts for unit tests (jvm arg -ea added to junit ant
    task). I think we should use asserts when running tests. I have
    quite a few asserts now.

With this new approach, as I process each term in the document I
immediately write the prox/freq in their compact (vints) format into
shared byte[] buffers, rather than accumulating int[] arrays that then
need to be re-processed into the vint encoding. This speeds things up
because we don't double-process the postings. It also uses less
per-document RAM overhead because intermediate postings are stored as
vints not as ints.

When enough RAM is used by the Posting entries plus the byte[]
buffers, I flush them to a partial RAM segment. When enough of these
RAM segments have accumulated I flush to a real Lucene segment
(autoCommit=true) or to on-disk partial segments (autoCommit=false)
which are then merged in the end to create a real Lucene segment.


Marvin Humphrey added a comment - 30/Apr/07 11:44 AM
How are you writing the frq data in compressed format? The works fine for
prx data, because the deltas are all within a single doc – but for the freq
data, the deltas are tied up in doc num deltas, so you have to decompress it
when performing merges.

To continue our discussion from java-dev...

  • I haven't been able to come up with a file format tweak that
    gets around this doc-num-delta-decompression problem to enhance the speed
    of frq data merging. I toyed with splitting off the freq from the
    doc_delta, at the price of increasing the file size in the common case of
    freq == 1, but went back to the old design. It's not worth the size
    increase for what's at best a minor indexing speedup.
  • I've added a custom MemoryPool class to KS which grabs memory in 1 meg
    chunks, allows resizing (downwards) of only the last allocation, and can
    only release everything at once. From one of these pools, I'm allocating
    RawPosting objects, each of which is a doc_num, a freq, the term_text, and
    the pre-packed prx data (which varies based on which Posting subclass
    created the RawPosting object). I haven't got things 100% stable yet, but
    preliminary results seem to indicate that this technique, which is a riff
    on your persistent arrays, improves indexing speed by about 15%.

Michael McCandless added a comment - 30/Apr/07 12:14 PM

> How are you writing the frq data in compressed format? The works fine for
> prx data, because the deltas are all within a single doc – but for the freq
> data, the deltas are tied up in doc num deltas, so you have to decompress it
> when performing merges.

For each Posting I keep track of the last docID that its term occurred
in; when this differs from the current docID I record the "delta code"
that needs to be written and then I later write it with the final freq
for this document.

> * I haven't been able to come up with a file format tweak that
> gets around this doc-num-delta-decompression problem to enhance the speed
> of frq data merging. I toyed with splitting off the freq from the
> doc_delta, at the price of increasing the file size in the common case of
> freq == 1, but went back to the old design. It's not worth the size
> increase for what's at best a minor indexing speedup.

I'm just doing the "stitching" approach here: it's only the very first
docCode (& freq when freq==1) that must be re-encoded on merging. The
one catch is you must store the last docID of the previous segment so
you can compute the new delta at the boundary. Then I do a raw
"copyBytes" for the remainder of the freq postings.

Note that I'm only doing this for the "internal" merges (of partial
RAM segments and flushed partial segments) I do before creating a real
Lucene segment. I haven't changed how the "normal" Lucene segment
merging works (though I think we should look into it – I opened a
separate issue): it still re-interprets and then re-encodes all
docID/freq's.

> * I've added a custom MemoryPool class to KS which grabs memory in 1 meg
> chunks, allows resizing (downwards) of only the last allocation, and can
> only release everything at once. From one of these pools, I'm allocating
> RawPosting objects, each of which is a doc_num, a freq, the term_text, and
> the pre-packed prx data (which varies based on which Posting subclass
> created the RawPosting object). I haven't got things 100% stable yet, but
> preliminary results seem to indicate that this technique, which is a riff
> on your persistent arrays, improves indexing speed by about 15%.

Fabulous!!

I think it's the custom memory management I'm doing with slices into
shared byte[] arrays for the postings that made the persistent hash
approach work well, this time around (when I had previously tried this
it was slower).


Michael McCandless added a comment - 30/Apr/07 01:49 PM
Results with the above patch:

RAM = 32 MB
NUM THREADS = 1
MERGE FACTOR = 10

2000000 DOCS @ ~550 bytes plain text

No term vectors nor stored fields

AUTOCOMMIT = true (commit whenever RAM is full)

old
2000000 docs in 782.8 secs
index size = 436M

new
2000000 docs in 93.4 secs
index size = 430M

Total Docs/sec: old 2554.8; new 21421.1 [ 738.5% faster]
Docs/MB @ flush: old 128.0; new 4058.0 [ 3069.6% more]
Avg RAM used (MB) @ flush: old 140.2; new 38.0 [ 72.9% less]

AUTOCOMMIT = false (commit only once at the end)

old
2000000 docs in 780.2 secs
index size = 436M

new
2000000 docs in 90.6 secs
index size = 427M

Total Docs/sec: old 2563.3; new 22086.8 [ 761.7% faster]
Docs/MB @ flush: old 128.0; new 4118.4 [ 3116.7% more]
Avg RAM used (MB) @ flush: old 144.6; new 36.4 [ 74.8% less]

With term vectors (positions + offsets) and 2 small stored fields

AUTOCOMMIT = true (commit whenever RAM is full)

old
2000000 docs in 1227.6 secs
index size = 2.1G

new
2000000 docs in 559.8 secs
index size = 2.1G

Total Docs/sec: old 1629.2; new 3572.5 [ 119.3% faster]
Docs/MB @ flush: old 93.1; new 4058.0 [ 4259.1% more]
Avg RAM used (MB) @ flush: old 193.4; new 38.5 [ 80.1% less]

AUTOCOMMIT = false (commit only once at the end)

old
2000000 docs in 1229.2 secs
index size = 2.1G

new
2000000 docs in 241.0 secs
index size = 2.1G

Total Docs/sec: old 1627.0; new 8300.0 [ 410.1% faster]
Docs/MB @ flush: old 93.1; new 4118.4 [ 4323.9% more]
Avg RAM used (MB) @ flush: old 150.5; new 38.4 [ 74.5% less]

200000 DOCS @ ~5,500 bytes plain text

No term vectors nor stored fields

AUTOCOMMIT = true (commit whenever RAM is full)

old
200000 docs in 352.2 secs
index size = 406M

new
200000 docs in 86.4 secs
index size = 403M

Total Docs/sec: old 567.9; new 2313.7 [ 307.4% faster]
Docs/MB @ flush: old 83.5; new 420.0 [ 402.7% more]
Avg RAM used (MB) @ flush: old 76.8; new 38.1 [ 50.4% less]

AUTOCOMMIT = false (commit only once at the end)

old
200000 docs in 399.2 secs
index size = 406M

new
200000 docs in 89.6 secs
index size = 400M

Total Docs/sec: old 501.0; new 2231.0 [ 345.3% faster]
Docs/MB @ flush: old 83.5; new 422.6 [ 405.8% more]
Avg RAM used (MB) @ flush: old 76.7; new 36.2 [ 52.7% less]

With term vectors (positions + offsets) and 2 small stored fields

AUTOCOMMIT = true (commit whenever RAM is full)

old
200000 docs in 594.2 secs
index size = 1.7G

new
200000 docs in 229.0 secs
index size = 1.7G

Total Docs/sec: old 336.6; new 873.3 [ 159.5% faster]
Docs/MB @ flush: old 47.9; new 420.0 [ 776.9% more]
Avg RAM used (MB) @ flush: old 157.8; new 38.0 [ 75.9% less]

AUTOCOMMIT = false (commit only once at the end)

old
200000 docs in 605.1 secs
index size = 1.7G

new
200000 docs in 181.3 secs
index size = 1.7G

Total Docs/sec: old 330.5; new 1103.1 [ 233.7% faster]
Docs/MB @ flush: old 47.9; new 422.6 [ 782.2% more]
Avg RAM used (MB) @ flush: old 132.0; new 37.1 [ 71.9% less]

20000 DOCS @ ~55,000 bytes plain text

No term vectors nor stored fields

AUTOCOMMIT = true (commit whenever RAM is full)

old
20000 docs in 180.8 secs
index size = 350M

new
20000 docs in 79.1 secs
index size = 349M

Total Docs/sec: old 110.6; new 252.8 [ 128.5% faster]
Docs/MB @ flush: old 25.0; new 46.8 [ 87.0% more]
Avg RAM used (MB) @ flush: old 112.2; new 44.3 [ 60.5% less]

AUTOCOMMIT = false (commit only once at the end)

old
20000 docs in 180.1 secs
index size = 350M

new
20000 docs in 75.9 secs
index size = 347M

Total Docs/sec: old 111.0; new 263.5 [ 137.3% faster]
Docs/MB @ flush: old 25.0; new 47.5 [ 89.7% more]
Avg RAM used (MB) @ flush: old 111.1; new 42.5 [ 61.7% less]

With term vectors (positions + offsets) and 2 small stored fields

AUTOCOMMIT = true (commit whenever RAM is full)

old
20000 docs in 323.1 secs
index size = 1.4G

new
20000 docs in 183.9 secs
index size = 1.4G

Total Docs/sec: old 61.9; new 108.7 [ 75.7% faster]
Docs/MB @ flush: old 10.4; new 46.8 [ 348.3% more]
Avg RAM used (MB) @ flush: old 74.2; new 44.9 [ 39.5% less]

AUTOCOMMIT = false (commit only once at the end)

old
20000 docs in 323.5 secs
index size = 1.4G

new
20000 docs in 135.6 secs
index size = 1.4G

Total Docs/sec: old 61.8; new 147.5 [ 138.5% faster]
Docs/MB @ flush: old 10.4; new 47.5 [ 354.8% more]
Avg RAM used (MB) @ flush: old 74.3; new 42.9 [ 42.2% less]


Yonik Seeley added a comment - 30/Apr/07 02:10 PM
How does this work with pending deletes?
I assume that if autocommit is false, then you need to wait until the end when you get a real lucene segment to delete the pending terms?

Also, how has the merge policy (or index invariants) of lucene segments changed?
If autocommit is off, then you wait until the end to create a big lucene segment. This new segment may be much larger than segments to it's "left". I suppose the idea of merging rightmost segments should just be dropped in favor of merging the smallest adjacent segments? Sorry if this has already been covered... as I said, I'm trying to follow along at a high level.


Michael McCandless added a comment - 30/Apr/07 06:54 PM
> How does this work with pending deletes?
> I assume that if autocommit is false, then you need to wait until the end when you get a real lucene segment to delete the pending terms?

Yes, all of this sits "below" the pending deletes layer since this
change writes a single segment either when RAM is full
(autoCommit=true) or when writer is closed (autoCommit=false). Then
the deletes get applied like normal (I haven't changed that part).

> Also, how has the merge policy (or index invariants) of lucene segments changed?
> If autocommit is off, then you wait until the end to create a big lucene segment. This new segment may be much larger than segments to it's "left". I suppose the idea of merging rightmost segments should just be dropped in favor of merging the smallest adjacent segments? Sorry if this has already been covered... as I said, I'm trying to follow along at a high level.

Has not been covered, and as usual these are excellent questions
Yonik!

I haven't yet changed anything about merge policy, but you're right
the current invariants won't hold anymore. In fact they already don't
hold if you "flush by RAM" now (APIs are exposed in 2.1 to let you do
this). So we need to do something.

I like your idea to relax merge policy (& invariants) to allow
"merging of any adjacent segments" (not just rightmost ones) and then
make the policy merge the smallest ones / most similarly sized ones,
measuring size by net # bytes in the segment. This would preserve the
"docID monotonicity invariance".

If we take that approach then it would automatically resolve
LUCENE-845 as well (which would otherwise block this issue).


Michael McCandless added a comment - 21/May/07 06:14 PM
Attached latest patch.

I'm now working towards simplify & cleaning up the code & design:
eliminated dead code leftover from the previous iterations, use
existing RAMFile instead of my own new class, refactored
duplicate/confusing code, added comments, etc. It's getting closer to
a committable state but still has a ways to go.

I also renamed the class from MultiDocumentWriter to DocumentsWriter.

To summarize the current design:

1. Write stored fields & term vectors to files in the Directory
immediately (don't buffer these in RAM).

2. Write freq & prox postings to RAM directly as a byte stream
instead of first pass as int[] and then second pass as a byte
stream. This single-pass instead of double-pass is a big
savings. I use slices into shared byte[] arrays to efficiently
allocate bytes to the postings the need them.

3. Build Postings hash that holds the Postings for many documents at
once instead of a single doc, keyed by unique term. Not tearing
down & rebuilding the Postings hash w/ every doc saves alot of
time. Also when term vectors are off this saves quicksort for
every doc and this gives very good performance gain.

When the Postings hash is full (used up the allowed RAM usage) I
then create a real Lucene segment when autoCommit=true, else a
"partial segment".

4. Use my own "partial segment" format that differs from Lucene's
normal segments in that it is optimized for merging (and unusable
for searching). This format, and the merger I created to work
with this format, performs merging mostly by copying blocks of
bytes instead of reinterpreting every vInt in each Postings list.
These partial segments are are only created when IndexWriter has
autoCommit=false, and then on commit they are merged into the
real Lucene segment format.

5. Reuse the Posting, PostingVector, char[] and byte[] objects that
are used by the Postings hash.

I plan to keep simplifying the design & implementation. Specifically,
I'm going to test removing #4 above entirely (using my own "partial
segment" format that's optimized for merging not searching).

While doing this may give back some of the performance gains, that
code is the source of much added complexity in the patch, and, it
duplicates the current SegmentMerger code. It was more necessary
before (when we would merge thousands of single-doc segments in
memory) but now that each segment contains many docs I think we are no
longer gaining as much performance from it.

I plan instead to write all segments in the "real" Lucene segment
format and use the current SegmentMerger, possibly w/ some small
changes, to do the merges even when autoCommit=false. Since we have
another issue (LUCENE-856) to optimize segment merging I can carry
over any optimizations that we may want to keep into that issue. If
this doesn't lose much performance it will make the approach here even
simpler.


Michael McCandless added a comment - 08/Jun/07 01:31 PM
Latest working patch attached.

I've cutover to using Lucene's normal segment merging for all merging
(ie, I no longer use a different merge-efficient format for segments
when autoCommit=false); this has substantially simplified the code.

All unit tests pass except disk-full test and certain contrib tests
(gdata-server, lucli, similarity, wordnet) that I think I'm not
causing.

Other changes:

  • Consolidated flushing of a new segment back into IndexWriter
    (previously DocumentsWriter would do its own flushing when
    autoCommit=false).

I would also like to consolidate merging entirely into
IndexWriter; right now DocumentsWriter does its own merging of the
flushed segments when autoCommit=false (this is because those
segments are "partial" meaning they do not have their own stored
fields or term vectors). I'm trying to find a clean way to do
this...

  • Thread concurrency now works: each thread writes into a separate
    Postings hash (up until a limit (currently 5) at which point the
    threads share the Postings hashes) and then when flushing the
    segment I merge the docIDs together. I flush when the total RAM
    used across threads is over the limit. I ran a test comparing
    thread concurrency on current trunk vs this patch, which I'll post
    next.
  • Reduced bytes used per-unique-term to be lower than current
    Lucene. This means the worst-case document (many terms, all of
    which are unique) should use less RAM overall than Lucene trunk
    does.
  • Added some new unit test cases; added missing "writer.close()" to
    one of the contrib tests.
  • Cleanup, comments, etc. I think the code is getting more
    "approachable" now.

Michael McCandless added a comment - 08/Jun/07 01:33 PM
I ran a benchmark using more than 1 thread to do indexing, in order to
test & compare concurrency of trunk and the patch. The test is the
same as above, and runs on a 4 core Mac Pro (OS X) box with 4 drive
RAID 0 IO system.

Here are the raw results:

DOCS = ~5,500 bytes plain text
RAM = 32 MB
MERGE FACTOR = 10
With term vectors (positions + offsets) and 2 small stored fields
AUTOCOMMIT = false (commit only once at the end)

NUM THREADS = 1

new
200000 docs in 172.3 secs
index size = 1.7G

old
200000 docs in 539.5 secs
index size = 1.7G

Total Docs/sec: old 370.7; new 1161.0 [ 213.2% faster]
Docs/MB @ flush: old 47.9; new 334.6 [ 598.7% more]
Avg RAM used (MB) @ flush: old 131.9; new 33.1 [ 74.9% less]

NUM THREADS = 2

new
200001 docs in 130.8 secs
index size = 1.7G

old
200001 docs in 452.8 secs
index size = 1.7G

Total Docs/sec: old 441.7; new 1529.3 [ 246.2% faster]
Docs/MB @ flush: old 47.9; new 301.5 [ 529.7% more]
Avg RAM used (MB) @ flush: old 226.1; new 35.2 [ 84.4% less]

NUM THREADS = 3

new
200002 docs in 105.4 secs
index size = 1.7G

old
200002 docs in 428.4 secs
index size = 1.7G

Total Docs/sec: old 466.8; new 1897.9 [ 306.6% faster]
Docs/MB @ flush: old 47.9; new 277.8 [ 480.2% more]
Avg RAM used (MB) @ flush: old 289.8; new 37.0 [ 87.2% less]

NUM THREADS = 4

new
200003 docs in 104.8 secs
index size = 1.7G

old
200003 docs in 440.4 secs
index size = 1.7G

Total Docs/sec: old 454.1; new 1908.5 [ 320.3% faster]
Docs/MB @ flush: old 47.9; new 259.9 [ 442.9% more]
Avg RAM used (MB) @ flush: old 293.7; new 37.1 [ 87.3% less]

NUM THREADS = 5

new
200004 docs in 99.5 secs
index size = 1.7G

old
200004 docs in 425.0 secs
index size = 1.7G

Total Docs/sec: old 470.6; new 2010.5 [ 327.2% faster]
Docs/MB @ flush: old 47.9; new 245.3 [ 412.6% more]
Avg RAM used (MB) @ flush: old 390.9; new 38.3 [ 90.2% less]

NUM THREADS = 6

new
200005 docs in 106.3 secs
index size = 1.7G

old
200005 docs in 427.1 secs
index size = 1.7G

Total Docs/sec: old 468.2; new 1882.3 [ 302.0% faster]
Docs/MB @ flush: old 47.8; new 248.5 [ 419.3% more]
Avg RAM used (MB) @ flush: old 340.9; new 38.7 [ 88.6% less]

NUM THREADS = 7

new
200006 docs in 106.1 secs
index size = 1.7G

old
200006 docs in 435.2 secs
index size = 1.7G

Total Docs/sec: old 459.6; new 1885.3 [ 310.2% faster]
Docs/MB @ flush: old 47.8; new 248.7 [ 420.0% more]
Avg RAM used (MB) @ flush: old 408.6; new 39.1 [ 90.4% less]

NUM THREADS = 8

new
200007 docs in 109.0 secs
index size = 1.7G

old
200007 docs in 469.2 secs
index size = 1.7G

Total Docs/sec: old 426.3; new 1835.2 [ 330.5% faster]
Docs/MB @ flush: old 47.8; new 251.3 [ 425.5% more]
Avg RAM used (MB) @ flush: old 448.9; new 39.0 [ 91.3% less]

Some quick comments:

  • Both trunk & the patch show speedups if you use more than 1 thread
    to do indexing. This is expected since the machine has concurrency.
  • The biggest speedup is from 1->2 threads but still good gains from
    2->5 threads.
  • Best seems to be 5 threads.
  • The patch allows better concurrency: relatively speaking it speeds
    up faster than the trunk (the % faster increases as we add
    threads) as you increase # threads. I think this makes sense
    because we flush less often with the patch, and, flushing is time
    consuming and single threaded.

Michael McCandless added a comment - 15/Jun/07 07:00 PM
Attached latest patch.

I think this patch is ready to commit. I will let it sit for a while
so people can review it.

We still need to do LUCENE-845 before it can be committed as is.

However one option instead would be to commit this patch, but leave
IndexWriter flushing by doc count by default and then later switch it
to flush by net RAM usage once LUCENE-845 is done. I like this option
best.

All tests pass (I've re-enabled the disk full tests and fixed error
handling so they now pass) on Windows XP, Debian Linux and OS X.

Summary of the changes in this rev:

  • Finished cleaning up & commenting code
  • Exception handling: if there is a disk full or any other exception
    while adding a document or flushing then the index is rolled back
    to the last commit point.
  • Added more unit tests
  • Removed my profiling tool from the patch (not intended to be
    committed)
  • Fixed a thread safety issue where if you flush by doc count you
    would sometimes get more than the doc count at flush than you
    requested. I moved the thread synchronization for determining
    flush time down into DocumentsWriter.
  • Also fixed thread safety of calling flush with one thread while
    other threads are still adding documents.
  • The biggest change is: absorbed all merging logic back into
    IndexWriter.

Previously in DocumentsWriter I was tracking my own
flushed/partial segments and merging them on my own (but using
SegmentMerger). This makes DocumentsWriter much simpler: now its
sole purpose is to gather added docs and write a new segment.

This turns out to be a big win:

  • Code is much simpler (no duplication of "merging"
    policy/logic)
  • 21-25% additional performance gain for autoCommit=false case
    when stored fields & vectors are used
  • IndexWriter.close() no longer takes an unexpected long time to
    close in autoCommit=false case

However I had to make a change to the index format to do this.
The basic idea is to allow multiple segments to share access to
the "doc store" (stored fields, vectors) index files.

The change is quite simple: FieldsReader/VectorsReader are now
told the doc offset that they should start from when seeking in
the index stream (this info is stored in SegmentInfo). When
merging segments we don't merge the "doc store" files when all
segments are sharing the same ones (big performance gain), else,
we make a private copy of the "doc store" files (ie as segments
normally are on the trunk today).

The change is fully backwards compatible (I added a test case to
the backwards compatibility unit test to be sure) and the change
is only used when autoCommit=false.

When autoCommit=false, the writer will append stored fields /
vectors to a single set of files even though it is flushing normal
segments whenever RAM is full. These normal segments all refer to
the single shared set of "doc store" files. Then when segments
are merged, the newly merged segment has its own "private" doc
stores again. So the sharing only occurs for the "level 0"
segments.

I still need to update fileformats doc with this change.


Yonik Seeley added a comment - 15/Jun/07 09:26 PM
> When merging segments we don't merge the "doc store" files when all segments are sharing the same ones (big performance gain),

Is this only in the case where the segments have no deleted docs?


Michael McCandless added a comment - 16/Jun/07 01:00 AM
> > When merging segments we don't merge the "doc store" files when all segments are sharing the same ones (big performance gain),
>
> Is this only in the case where the segments have no deleted docs?

Right. Also the segments must be contiguous which the current merge
policy ensures but future merge policies may not.


Michael McCandless added a comment - 18/Jun/07 01:56 PM
OK, I attached a new version (take9) of the patch that reverts back to
the default of "flush after every 10 documents added" in IndexWriter.
This removes the dependency on LUCENE-845.

However, I still think we should later (once LUCENE-845 is done)
default IndexWriter to flush by RAM usage since this will generally
give the best "out of the box" performance. I will open a separate
issue to change the default after this issue is resolved.


Steven Parkes added a comment - 20/Jun/07 03:44 PM
I've started looking at this, what it would take to merge with the merge policy stuff (LUCENE-847). Noticed that there are a couple of test failures?

Michael McCandless added a comment - 20/Jun/07 03:58 PM
Oh, were the test failures only in the TestBackwardsCompatibility?

Because I changed the index file format, I added 2 more ZIP files to
that unit test, but, "svn diff" doesn't pick up the new zip files. So
I'm attaching them. Can you pull off these zip files into your
src/test/org/apache/lucene/index and test again? Thanks.


Steven Parkes added a comment - 20/Jun/07 05:37 PM
Yeah, that was it.

I'll be delving more into the code as I try to figure out how it will dove tail with the merge policy factoring.


Michael McCandless added a comment - 20/Jun/07 11:25 PM
> Yeah, that was it.

Phew!

> I'll be delving more into the code as I try to figure out how it will
> dove tail with the merge policy factoring.

OK, thanks. I am very eager to get some other eyeballs looking for
issues with this patch!

I think this patch and the merge policy refactoring should be fairly
separate.

With this patch, "flushing" RAM -> Lucene segment is no longer a
"mergeSegments" call which I think simplifies IndexWriter. Previously
mergeSegments had lots of extra logic to tell if it was merging RAM
segments (= a flush) vs merging "real" segments but now it's simpler
because mergeSegments really only merges segments.


Michael Busch added a comment - 21/Jun/07 03:05 AM
Hi Mike,

my first comment on this patch is: Impressive!

It's also quite overwhelming at the beginning, but I'm trying to dig into it. I'll probably have more questions, here's the first one:

Does DocumentsWriter also solve the problem DocumentWriter had before LUCENE-880? I believe the answer is yes. Even though you close the TokenStreams in the finally clause of invertField() like DocumentWriter did before 880 this is safe, because addPosition() serializes the term strings and payload bytes into the posting hash table right away. Is that right?


Michael Busch added a comment - 21/Jun/07 06:51 AM
Mike,

the benchmarks you run focus on measuring the pure indexing performance. I think it would be interesting to know how big the speedup is in real-life scenarios, i. e. with StandardAnalyzer and maybe even HTML parsing? For sure the speedup will be less, but it should still be a significant improvement. Did you run those kinds of benchmarks already?


Michael McCandless added a comment - 21/Jun/07 09:35 AM
> Does DocumentsWriter also solve the problem DocumentWriter had
> before LUCENE-880? I believe the answer is yes. Even though you
> close the TokenStreams in the finally clause of invertField() like
> DocumentWriter did before 880 this is safe, because addPosition()
> serializes the term strings and payload bytes into the posting hash
> table right away. Is that right?

That's right. When I merged in the fix for LUCENE-880, I realized
with this patch it's fine to close the token stream immediately after
processing all of its tokens because everything about the token stream
has been "absorbed" into postings hash.

> the benchmarks you run focus on measuring the pure indexing
> performance. I think it would be interesting to know how big the
> speedup is in real-life scenarios, i. e. with StandardAnalyzer and
> maybe even HTML parsing? For sure the speedup will be less, but it
> should still be a significant improvement. Did you run those kinds
> of benchmarks already?

Good question ... I haven't measured the performance cost of using
StandardAnalyzer or HTML parsing but I will test & post back.


Michael McCandless added a comment - 21/Jun/07 02:00 PM
OK I ran tests comparing analyzer performance.

It's the same test framework as above, using the ~5,500 byte Europarl
docs with autoCommit=true, 32 MB RAM buffer, no stored fields nor
vectors, and CFS=false, indexing 200,000 documents.

The SimpleSpaceAnalyzer is my own whitespace analyzer that minimizes
GC cost by not allocating a Term or String for every token in every
document.

Each run is best time of 2 runs:

ANALYZER PATCH (sec) TRUNK (sec) SPEEDUP
SimpleSpaceAnalyzer 79.0 326.5 4.1 X
StandardAnalyzer 449.0 674.1 1.5 X
WhitespaceAnalyzer 104.0 338.9 3.3 X
SimpleAnalyzer 104.7 328.0 3.1 X

StandardAnalyzer is definiteely rather time consuming!


Michael Busch added a comment - 21/Jun/07 05:08 PM
> OK I ran tests comparing analyzer performance.

Thanks for the numbers Mike. Yes the gain is less with StandardAnalyzer
but 1.5X faster is still very good!

I have some question about the extensibility of your code. For flexible
indexing we want to be able in the future to implement different posting
formats and we might even want to allow our users to implement own
posting formats.

When I implemented multi-level skipping I tried to keep this in mind.
Therefore I put most of the functionality in the two abstract classes
MultiLevelSkipListReader/Writer. Subclasses implement the actual format
of the skip data. I think with this design it should be quite easy to
implement different formats in the future while limiting the code
complexity.

With the old DocumentWriter I think this is quite simple to do too by
adding a class like PostingListWriter, where subclasses define the actual
format (because DocumentWriter is so simple).

Do you think your code is easily extensible in this regard? I'm
wondering because of all the optimizations you're doing like e. g.
sharing byte arrays. But I'm certainly not familiar enough with your code
yet, so I'm only guessing here.


Michael McCandless added a comment - 21/Jun/07 05:59 PM
> Do you think your code is easily extensible in this regard? I'm
> wondering because of all the optimizations you're doing like e. g.
> sharing byte arrays. But I'm certainly not familiar enough with your code
> yet, so I'm only guessing here.

Good question!

DocumentsWriter is definitely more complex than DocumentWriter, but it
doesn't prevent extensibility and I think will work very well when we
do flexible indexing.

The patch now has dedicated methods for writing into the freq/prox/etc
streams ('writeFreqByte', 'writeFreqVInt', 'writeProxByte',
'writeProxVInt', etc.), but, this could easily be changed to instead
use true IndexOutput streams. This would then hide all details of
shared byte arrays from whoever is doing the writing.

The way I roughly see flexible indexing working in the future is
DocumentsWriter will be responsible for keeping track of unique terms
seen (in its hash table), holding the Posting instance (which could be
subclassed in the future) for each term, flushing a real segment when
full, handling shared byte arrays, etc. Ie all the "infrastructure".

But then the specific logic of what bytes are written into which
streams (freq/prox/vectors/others) will be handled by a separate class
or classes that we can plug/unplug according to some "schema".
DocumentsWriter would call on these classes and provide the
IndexOutput's for all streams for the Posting, per position, and these
classes write their own format into the IndexOutputs.

I think a separation like that would work well: we could have good
performance and also extensibility. Devil is in the details of
course...

I obviously haven't factored DocumentsWriter in this way (it has its
own addPosition that writes the current Lucene index format) but I
think this is very doable in the future.


Doron Cohen added a comment - 23/Jun/07 06:59 AM
Mike, I am considering testing the performance of this patch on a somewhat different use case, real one I think. After indexing 25M docs of TREC .gov2 (~500GB of docs) I pushed the index terms to create a spell correction index, by using the contrib spell checker. Docs here are very short - For each index term a document is created, containing some N-GRAMS. On the specific machine I used there are 2 CPUs but the SpellChecker indexing does not take advantage of that. Anyhow, 126,684,685 words==documents were indexed.
For the docs adding step I had:
mergeFactor = 100,000
maxBufferedDocs = 10,000
So no merging took place.
This step took 21 hours, and created 12,685 segments, total size 15 - 20 GB.
Then I optimized the index with
mergeFactor = 400
(Larger values were hard on the open files limits.)

I thought it would be interesting to see how the new code performs in this scenario, what do you think?

If you too find this comparison interesting, I have two more questions:

  • what settings do you recommend?
  • is there any chance for speed-up in optimize()? I didn't read your
    new code yet, but at least from some comments here it seems that
    on disk merging was not changed... is this (still) so? I would skip the
    optimize part if this is not of interest for the comparison. (In fact I am
    still waiting for my optimize() to complete, but if it is not of interest I
    will just interrupt it...)

Thanks,
Doron


Michael McCandless added a comment - 23/Jun/07 09:43 AM

> I thought it would be interesting to see how the new code performs in this scenario, what do you think?

Yes I'd be very interested to see the results of this. It's a
somewhat "unusual" indexing situation (such tiny docs) but it's a real
world test case. Thanks!

> - what settings do you recommend?

I think these are likely the important ones in this case:

  • Flush by RAM instead of doc count
    (writer.setRAMBufferSizeMB(...)).
  • Give it as much RAM as you can.
  • Use maybe 3 indexing threads (if you can).
  • Turn off compound file.
  • If you have stored fields/vectors (seems not in this case) use
    autoCommit=false.
  • Use a trivial analyzer that doesn't create new String/new Token
    (re-use the same Token, and use the char[] based term text
    storage instead of the String one).
  • Re-use Document/Field instances. The DocumentsWriter is fine with
    this and it saves substantial time from GC especially because your
    docs are so tiny (per-doc overhead is otherwise a killer). In
    IndexLineFiles I made a StringReader that lets me reset its String
    value; this way I didn't have to change the Field instances stored
    in the Document.

> - is there any chance for speed-up in optimize()? I didn't read
> your new code yet, but at least from some comments here it seems
> that on disk merging was not changed... is this (still) so? I would

Correct: my patch doesn't touch merging and optimizing. All it does
now is gather many docs in RAM and then flush a new segment when it's
time. I've opened a separate issue (LUCENE-856) for optimizations
in segment merging.


Doron Cohen added a comment - 24/Jun/07 07:56 PM
Just to clarify your comment on reusing field and doc instances - to my understanding reusing a field instance is ok only after the containing doc was added to the index.

For a "fair" comparison I ended up not following most of your recommendations, including the reuse field/docs one and the non-compound one (apologies), but I might use them later.

For the first 100,000,000 docs (==speller words) the speed-up is quite amazing:
Orig: Speller: added 100000000 words in 10912 seconds = 3 hours 1 minutes 52 seconds
New: Speller: added 100000000 words in 58490 seconds = 16 hours 14 minutes 50 seconds
This is 5.3 times faster !!!

This btw was with maxBufDocs=100,000 (I forgot to set the MEM param).
I stopped the run now, I don't expect to learn anything new by letting it continue.

When trying with MEM=512MB, it at first seemed faster, but then there were now and then local slow-downs, and eventually it became a bit slower than the previous run. I know these are not merges, so they are either flushes (RAM directed), or GC activity. I will perhaps run with GC debug flags and perhaps add a print at flush so to tell the culprit for these local slow-downs.

Other than that, I will perhaps try to index .GOV2 (25 Million HTML docs) with this patch. The way I indexed it before it took about 4 days - running in 4 threads, and creating 36 indexes. This is even more a real life scenario, it involves HTML parsing, standard analysis, and merging (to some extent). Since there are 4 threads each one will get, say, 250MB. Again, for a "fair" comparison, I will remain with compound.


Michael McCandless added a comment - 24/Jun/07 08:56 PM

> Just to clarify your comment on reusing field and doc instances - to my
> understanding reusing a field instance is ok only after the containing
> doc was added to the index.

Right, if your documents are very "regular" you should get a sizable
speedup (especially for tiny docs), with or without this patch, if you
make a single Document and add separate Field instances to it for
each field, and then reuse the Document and Field instances for all
the docs you want to add.

It's not easy to reuse Field instances now (there's no
setStringValue()). I made a ReusableStringReader to do this but you
could also make your own class that implements Fieldable.

> For a "fair" comparison I ended up not following most of your
> recommendations, including the reuse field/docs one and the non-compound
> one (apologies), but I might use them later.

OK, when you say "fair" I think you mean because you already had a
previous run that used compound file, you had to use compound file in
the run with the LUCENE-843 patch (etc)? The recommendations above
should speed up Lucene with or without my patch.

> For the first 100,000,000 docs (==speller words) the speed-up is quite
> amazing:
> Orig: Speller: added 100000000 words in 10912 seconds = 3 hours 1
> minutes 52 seconds
> New: Speller: added 100000000 words in 58490 seconds = 16 hours 14
> minutes 50 seconds
> This is 5.3 times faster !!!

Wow! I think the speedup might be even more if both of your runs followed
the suggestions above.

> This btw was with maxBufDocs=100,000 (I forgot to set the MEM param).
> I stopped the run now, I don't expect to learn anything new by letting it
> continue.
>
> When trying with MEM=512MB, it at first seemed faster, but then there
> were now and then local slow-downs, and eventually it became a bit slower
> than the previous run. I know these are not merges, so they are either
> flushes (RAM directed), or GC activity. I will perhaps run with GC debug
> flags and perhaps add a print at flush so to tell the culprit for these
> local slow-downs.

Hurm, odd. I haven't pushed RAM buffer up to 512 MB so it could be GC
cost somehow makes things worse ... curious.

> Other than that, I will perhaps try to index .GOV2 (25 Million HTML docs)
> with this patch. The way I indexed it before it took about 4 days -
> running in 4 threads, and creating 36 indexes. This is even more a real
> life scenario, it involves HTML parsing, standard analysis, and merging
> (to some extent). Since there are 4 threads each one will get, say,
> 250MB. Again, for a "fair" comparison, I will remain with compound.

OK, because you're doing StandardAnalyzer and HTML parsing and
presumably loading one-doc-per-file, most of your time is spent
outside of Lucene indexing so I'd expect less that 50% speedup in
this case.


Michael McCandless added a comment - 06/Jul/07 11:52 AM
Re-opening this issue: I saw one failure of the contrib/benchmark
TestPerfTasksLogic.testParallelDocMaker() tests due to an intermittant
thread-safety issue. It's hard to get the failure to happen (it's
happened only once in ~20 runs of contrib/benchmark) but I see where
the issue is. Will commit a fix shortly.

Steven Parkes added a comment - 12/Jul/07 09:27 PM
Did we lose the triggered merge stuff from 887, i.e.,, should it be

if (triggerMerge) { /* new merge policy if (0 == docWriter.getMaxBufferedDocs()) maybeMergeSegments(mergeFactor * numDocs / 2); else maybeMergeSegments(docWriter.getMaxBufferedDocs()); */ maybeMergeSegments(docWriter.getMaxBufferedDocs()); }


Michael McCandless added a comment - 12/Jul/07 09:59 PM
Woops ... you are right; thanks for catching it! I will add a unit
test & fix it. I will also make the flush(boolean triggerMerge,
boolean flushDocStores) protected, not public, and move the javadoc
back to the public flush().