[LUCENE-843] improve how IndexWriter uses RAM to buffer added documents - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.2
Fix Version/s: 2.3
Component/s: core/index
Labels:
None

Lucene Fields:

New, Patch Available

Description

I'm working on a new class (MultiDocumentWriter) that writes more than
one document directly into a single Lucene segment, more efficiently
than the current approach.

This only affects the creation of an initial segment from added
documents. I haven't changed anything after that, eg how segments are
merged.

The basic ideas are:

Write stored fields and term vectors directly to disk (don't
use up RAM for these).

Gather posting lists & term infos in RAM, but periodically do
in-RAM merges. Once RAM is full, flush buffers to disk (and
merge them later when it's time to make a real segment).

Recycle objects/buffers to reduce time/stress in GC.

Other various optimizations.

Some of these changes are similar to how KinoSearch builds a segment.
But, I haven't made any changes to Lucene's file format nor added
requirements for a global fields schema.

So far the only externally visible change is a new method
"setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
deprecated) so that it flushes according to RAM usage and not a fixed
number documents added.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

index.presharedstores.nocfs.zip
20/Jun/07 15:58
5 kB
Michael McCandless
index.presharedstores.cfs.zip
20/Jun/07 15:58
2 kB
Michael McCandless
LUCENE-843.take9.patch
18/Jun/07 13:56
204 kB
Michael McCandless
LUCENE-843.take8.patch
15/Jun/07 19:00
203 kB
Michael McCandless
LUCENE-843.take7.patch
08/Jun/07 13:31
189 kB
Michael McCandless
LUCENE-843.take6.patch
21/May/07 18:14
210 kB
Michael McCandless
LUCENE-843.take5.patch
30/Apr/07 10:39
239 kB
Michael McCandless
LUCENE-843.take4.patch
02/Apr/07 14:43
188 kB
Michael McCandless
LUCENE-843.take3.patch
28/Mar/07 12:49
156 kB
Michael McCandless
LUCENE-843.take2.patch
25/Mar/07 14:30
148 kB
Michael McCandless
LUCENE-843.patch
22/Mar/07 17:06
141 kB
Michael McCandless

Issue Links

is blocked by

LUCENE-845 If you "flush by RAM usage" then IndexWriter may over-merge

Closed

is related to

SOLR-342 Add support for Lucene's new Indexing and merge features (excluding Document/Field/Token reuse)

Resolved

Activity

People

Assignee:: Michael McCandless

Reporter:: Michael McCandless

Votes:: 5 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 22/Mar/07 17:05

Updated:: 28/Aug/22 11:36

Resolved:: 12/Aug/07 15:48