|
New rev of the patch:
Another rev of the patch:
BTW, one of the nice side effects of this patch is it cleans up the Another rev of the patch. All tests pass except disk full tests. The
code is still rather "dirty" and not well commented. I think I'm close to finishing optimizing and now I will focus on Here are the changes in this rev:
Some details on how I measure RAM usage: both the baseline (current
lucene trunk) and my patch have two general classes of RAM usage. The first class, "document processing RAM", is RAM used while The second class, "indexed documents RAM", is the RAM used up by So when I say the writer is allowed to use 32 MB of RAM, I'm only I then define "RAM efficiency" (docs/MB) as how many docs we can hold I also measure overall RAM used in the JVM (using To do the benchmarking I created a simple standalone tool
(demo/IndexLineFiles, in the last patch) that indexes one line at a time from a large previously created file, optionally using multiple threads. I do it this way to minimize IO cost of pulling the document source because I want to measure just indexing time as much as possible. Each line is read and a doc is created with field "contents" that is For the corpus, I took Europarl's "en" content, stripped tags, and All settings (mergeFactor, compound file, etc.) are left at defaults. I ran the tests with Java 1.5 on a Mac Pro quad (2 Intel CPUs, each A couple more details on the testing: I run java -server to get all
optimizations in the JVM, and the IO system is a local OS X RAID 0 of 4 SATA drives. Using the above tool I ran an initial set of benchmarks comparing old For each document size I run 4 combinations of whether term vectors Here are the results for the 10K tokens (= ~55,000 bytes plain text) 20000 DOCS @ ~55,000 bytes plain text No term vectors nor stored fields AUTOCOMMIT = true (commit whenever RAM is full) old new Total Docs/sec: old 99.8; new 158.7 [ 59.0% faster] AUTOCOMMIT = false (commit only once at the end) old new Total Docs/sec: old 98.7; new 166.7 [ 69.0% faster] With term vectors (positions + offsets) and 2 small stored fields AUTOCOMMIT = true (commit whenever RAM is full) old new Total Docs/sec: old 53.4; new 84.7 [ 58.7% faster] AUTOCOMMIT = false (commit only once at the end) old new Total Docs/sec: old 51.9; new 109.4 [ 111.0% faster] Here are the results for "normal" sized docs (1K tokens = ~5,500 bytes plain text each):
200000 DOCS @ ~5,500 bytes plain text No term vectors nor stored fields AUTOCOMMIT = true (commit whenever RAM is full) old new Total Docs/sec: old 503.1; new 1194.1 [ 137.3% faster] AUTOCOMMIT = false (commit only once at the end) old new Total Docs/sec: old 506.9; new 1187.7 [ 134.3% faster] With term vectors (positions + offsets) and 2 small stored fields AUTOCOMMIT = true (commit whenever RAM is full) old new Total Docs/sec: old 265.2; new 656.0 [ 147.4% faster] AUTOCOMMIT = false (commit only once at the end) old new Total Docs/sec: old 268.9; new 818.7 [ 204.5% faster] Last is the results for small docs (100 tokens = ~550 bytes plain text each): 2000000 DOCS @ ~550 bytes plain text No term vectors nor stored fields AUTOCOMMIT = true (commit whenever RAM is full) old new Total Docs/sec: old 2255.6; new 8676.4 [ 284.7% faster] AUTOCOMMIT = false (commit only once at the end) old new Total Docs/sec: old 2250.5; new 8348.7 [ 271.0% faster] With term vectors (positions + offsets) and 2 small stored fields AUTOCOMMIT = true (commit whenever RAM is full) old new Total Docs/sec: old 1351.2; new 4329.3 [ 220.4% faster] AUTOCOMMIT = false (commit only once at the end) old new Total Docs/sec: old 1342.8; new 5749.4 [ 328.2% faster] 200000 DOCS @ ~5,500 bytes plain text No term vectors nor stored fields AUTOCOMMIT = true (commit whenever RAM is full) old new Total Docs/sec: old 503.1; new 1194.1 [ 137.3% faster] AUTOCOMMIT = false (commit only once at the end) old new Total Docs/sec: old 506.9; new 1187.7 [ 134.3% faster] With term vectors (positions + offsets) and 2 small stored fields AUTOCOMMIT = true (commit whenever RAM is full) old new Total Docs/sec: old 265.2; new 656.0 [ 147.4% faster] AUTOCOMMIT = false (commit only once at the end) old new Total Docs/sec: old 268.9; new 818.7 [ 204.5% faster] A few notes from these results:
> The actual HEAP RAM usage is quite a bit more
> stable with the patch, especially with term vectors > & stored fields enabled. I think this is because the > patch creates far less garbage for GC to periodically > reclaim. I think this also means you could push your > RAM buffer size even higher to get better performance. For KinoSearch, the sweet spot seems to be a buffer of around 16 MB when benchmarking with the Reuters corpus on my G4 laptop. Larger than that and things actually slow down, unless the buffer is large enough that it never needs flushing. My hypothesis is that RAM fragmentation is slowing down malloc/free. I'll be interested as to whether you see the same effect. >> The actual HEAP RAM usage is quite a bit more Interesting. OK I will run the benchmark across increasing RAM sizes OK I ran old (trunk) vs new (this patch) with increasing RAM buffer I used the "normal" sized docs (~5,500 bytes plain text), left stored Here're the results: NUM THREADS = 1 1 MB old new Total Docs/sec: old 232.0; new 673.2 [ 190.2% faster] 2 MB old new Total Docs/sec: old 241.3; new 716.8 [ 197.0% faster] 4 MB old new Total Docs/sec: old 237.9; new 767.0 [ 222.3% faster] 8 MB old new Total Docs/sec: old 294.6; new 803.7 [ 172.8% faster] 16 MB old new Total Docs/sec: old 302.8; new 808.7 [ 167.1% faster] 24 MB old new Total Docs/sec: old 303.9; new 823.0 [ 170.8% faster] 32 MB old new Total Docs/sec: old 280.0; new 836.0 [ 198.5% faster] 48 MB old new Total Docs/sec: old 312.4; new 847.5 [ 171.3% faster] 64 MB old new Total Docs/sec: old 308.0; new 839.3 [ 172.5% faster] 80 MB old new Total Docs/sec: old 298.4; new 880.5 [ 195.0% faster] 96 MB old new Total Docs/sec: old 292.7; new 882.0 [ 201.4% faster] Some observations:
I attached a new iteration of the patch. It's quite different from
the last patch. After discussion on java-dev last time, I decided to retry the It turns out this is even faster than my previous approach, especially Other changes:
With this new approach, as I process each term in the document I When enough RAM is used by the Posting entries plus the byte[] How are you writing the frq data in compressed format? The works fine for
prx data, because the deltas are all within a single doc – but for the freq data, the deltas are tied up in doc num deltas, so you have to decompress it when performing merges. To continue our discussion from java-dev...
> How are you writing the frq data in compressed format? The works fine for For each Posting I keep track of the last docID that its term occurred > * I haven't been able to come up with a file format tweak that I'm just doing the "stitching" approach here: it's only the very first Note that I'm only doing this for the "internal" merges (of partial > * I've added a custom MemoryPool class to KS which grabs memory in 1 meg Fabulous!! I think it's the custom memory management I'm doing with slices into Results with the above patch:
RAM = 32 MB 2000000 DOCS @ ~550 bytes plain text No term vectors nor stored fields AUTOCOMMIT = true (commit whenever RAM is full) old new Total Docs/sec: old 2554.8; new 21421.1 [ 738.5% faster] AUTOCOMMIT = false (commit only once at the end) old new Total Docs/sec: old 2563.3; new 22086.8 [ 761.7% faster] With term vectors (positions + offsets) and 2 small stored fields AUTOCOMMIT = true (commit whenever RAM is full) old new Total Docs/sec: old 1629.2; new 3572.5 [ 119.3% faster] AUTOCOMMIT = false (commit only once at the end) old new Total Docs/sec: old 1627.0; new 8300.0 [ 410.1% faster] 200000 DOCS @ ~5,500 bytes plain text No term vectors nor stored fields AUTOCOMMIT = true (commit whenever RAM is full) old new Total Docs/sec: old 567.9; new 2313.7 [ 307.4% faster] AUTOCOMMIT = false (commit only once at the end) old new Total Docs/sec: old 501.0; new 2231.0 [ 345.3% faster] With term vectors (positions + offsets) and 2 small stored fields AUTOCOMMIT = true (commit whenever RAM is full) old new Total Docs/sec: old 336.6; new 873.3 [ 159.5% faster] AUTOCOMMIT = false (commit only once at the end) old new Total Docs/sec: old 330.5; new 1103.1 [ 233.7% faster] 20000 DOCS @ ~55,000 bytes plain text No term vectors nor stored fields AUTOCOMMIT = true (commit whenever RAM is full) old new Total Docs/sec: old 110.6; new 252.8 [ 128.5% faster] AUTOCOMMIT = false (commit only once at the end) old new Total Docs/sec: old 111.0; new 263.5 [ 137.3% faster] With term vectors (positions + offsets) and 2 small stored fields AUTOCOMMIT = true (commit whenever RAM is full) old new Total Docs/sec: old 61.9; new 108.7 [ 75.7% faster] AUTOCOMMIT = false (commit only once at the end) old new Total Docs/sec: old 61.8; new 147.5 [ 138.5% faster] How does this work with pending deletes?
I assume that if autocommit is false, then you need to wait until the end when you get a real lucene segment to delete the pending terms? Also, how has the merge policy (or index invariants) of lucene segments changed? > How does this work with pending deletes?
> I assume that if autocommit is false, then you need to wait until the end when you get a real lucene segment to delete the pending terms? Yes, all of this sits "below" the pending deletes layer since this > Also, how has the merge policy (or index invariants) of lucene segments changed? Has not been covered, and as usual these are excellent questions I haven't yet changed anything about merge policy, but you're right I like your idea to relax merge policy (& invariants) to allow If we take that approach then it would automatically resolve Attached latest patch.
I'm now working towards simplify & cleaning up the code & design: I also renamed the class from MultiDocumentWriter to DocumentsWriter. To summarize the current design: 1. Write stored fields & term vectors to files in the Directory 2. Write freq & prox postings to RAM directly as a byte stream 3. Build Postings hash that holds the Postings for many documents at When the Postings hash is full (used up the allowed RAM usage) I 4. Use my own "partial segment" format that differs from Lucene's 5. Reuse the Posting, PostingVector, char[] and byte[] objects that I plan to keep simplifying the design & implementation. Specifically, While doing this may give back some of the performance gains, that I plan instead to write all segments in the "real" Lucene segment Latest working patch attached.
I've cutover to using Lucene's normal segment merging for all merging All unit tests pass except disk-full test and certain contrib tests Other changes:
I would also like to consolidate merging entirely into
I ran a benchmark using more than 1 thread to do indexing, in order to
test & compare concurrency of trunk and the patch. The test is the same as above, and runs on a 4 core Mac Pro (OS X) box with 4 drive RAID 0 IO system. Here are the raw results: DOCS = ~5,500 bytes plain text NUM THREADS = 1 new old Total Docs/sec: old 370.7; new 1161.0 [ 213.2% faster] NUM THREADS = 2 new old Total Docs/sec: old 441.7; new 1529.3 [ 246.2% faster] NUM THREADS = 3 new old Total Docs/sec: old 466.8; new 1897.9 [ 306.6% faster] NUM THREADS = 4 new old Total Docs/sec: old 454.1; new 1908.5 [ 320.3% faster] NUM THREADS = 5 new old Total Docs/sec: old 470.6; new 2010.5 [ 327.2% faster] NUM THREADS = 6 new old Total Docs/sec: old 468.2; new 1882.3 [ 302.0% faster] NUM THREADS = 7 new old Total Docs/sec: old 459.6; new 1885.3 [ 310.2% faster] NUM THREADS = 8 new old Total Docs/sec: old 426.3; new 1835.2 [ 330.5% faster] Some quick comments:
Attached latest patch.
I think this patch is ready to commit. I will let it sit for a while We still need to do However one option instead would be to commit this patch, but leave All tests pass (I've re-enabled the disk full tests and fixed error Summary of the changes in this rev:
Previously in DocumentsWriter I was tracking my own This turns out to be a big win:
However I had to make a change to the index format to do this. The change is quite simple: FieldsReader/VectorsReader are now The change is fully backwards compatible (I added a test case to When autoCommit=false, the writer will append stored fields / I still need to update fileformats doc with this change. > When merging segments we don't merge the "doc store" files when all segments are sharing the same ones (big performance gain),
Is this only in the case where the segments have no deleted docs? > > When merging segments we don't merge the "doc store" files when all segments are sharing the same ones (big performance gain),
> > Is this only in the case where the segments have no deleted docs? Right. Also the segments must be contiguous which the current merge OK, I attached a new version (take9) of the patch that reverts back to
the default of "flush after every 10 documents added" in IndexWriter. This removes the dependency on However, I still think we should later (once I've started looking at this, what it would take to merge with the merge policy stuff (
Oh, were the test failures only in the TestBackwardsCompatibility?
Because I changed the index file format, I added 2 more ZIP files to Yeah, that was it.
I'll be delving more into the code as I try to figure out how it will dove tail with the merge policy factoring. > Yeah, that was it.
Phew! > I'll be delving more into the code as I try to figure out how it will OK, thanks. I am very eager to get some other eyeballs looking for I think this patch and the merge policy refactoring should be fairly With this patch, "flushing" RAM -> Lucene segment is no longer a Hi Mike,
my first comment on this patch is: Impressive! It's also quite overwhelming at the beginning, but I'm trying to dig into it. I'll probably have more questions, here's the first one: Does DocumentsWriter also solve the problem DocumentWriter had before Mike,
the benchmarks you run focus on measuring the pure indexing performance. I think it would be interesting to know how big the speedup is in real-life scenarios, i. e. with StandardAnalyzer and maybe even HTML parsing? For sure the speedup will be less, but it should still be a significant improvement. Did you run those kinds of benchmarks already? > Does DocumentsWriter also solve the problem DocumentWriter had
> before > close the TokenStreams in the finally clause of invertField() like > DocumentWriter did before 880 this is safe, because addPosition() > serializes the term strings and payload bytes into the posting hash > table right away. Is that right? That's right. When I merged in the fix for > the benchmarks you run focus on measuring the pure indexing Good question ... I haven't measured the performance cost of using OK I ran tests comparing analyzer performance.
It's the same test framework as above, using the ~5,500 byte Europarl The SimpleSpaceAnalyzer is my own whitespace analyzer that minimizes Each run is best time of 2 runs: ANALYZER PATCH (sec) TRUNK (sec) SPEEDUP StandardAnalyzer is definiteely rather time consuming! > OK I ran tests comparing analyzer performance.
Thanks for the numbers Mike. Yes the gain is less with StandardAnalyzer I have some question about the extensibility of your code. For flexible When I implemented multi-level skipping I tried to keep this in mind. With the old DocumentWriter I think this is quite simple to do too by Do you think your code is easily extensible in this regard? I'm > Do you think your code is easily extensible in this regard? I'm
> wondering because of all the optimizations you're doing like e. g. > sharing byte arrays. But I'm certainly not familiar enough with your code > yet, so I'm only guessing here. Good question! DocumentsWriter is definitely more complex than DocumentWriter, but it The patch now has dedicated methods for writing into the freq/prox/etc The way I roughly see flexible indexing working in the future is But then the specific logic of what bytes are written into which I think a separation like that would work well: we could have good I obviously haven't factored DocumentsWriter in this way (it has its Mike, I am considering testing the performance of this patch on a somewhat different use case, real one I think. After indexing 25M docs of TREC .gov2 (~500GB of docs) I pushed the index terms to create a spell correction index, by using the contrib spell checker. Docs here are very short - For each index term a document is created, containing some N-GRAMS. On the specific machine I used there are 2 CPUs but the SpellChecker indexing does not take advantage of that. Anyhow, 126,684,685 words==documents were indexed.
For the docs adding step I had: mergeFactor = 100,000 maxBufferedDocs = 10,000 So no merging took place. This step took 21 hours, and created 12,685 segments, total size 15 - 20 GB. Then I optimized the index with mergeFactor = 400 (Larger values were hard on the open files limits.) I thought it would be interesting to see how the new code performs in this scenario, what do you think? If you too find this comparison interesting, I have two more questions:
Thanks, > I thought it would be interesting to see how the new code performs in this scenario, what do you think? Yes I'd be very interested to see the results of this. It's a > - what settings do you recommend? I think these are likely the important ones in this case:
> - is there any chance for speed-up in optimize()? I didn't read Correct: my patch doesn't touch merging and optimizing. All it does Just to clarify your comment on reusing field and doc instances - to my understanding reusing a field instance is ok only after the containing doc was added to the index.
For a "fair" comparison I ended up not following most of your recommendations, including the reuse field/docs one and the non-compound one (apologies For the first 100,000,000 docs (==speller words) the speed-up is quite amazing: This btw was with maxBufDocs=100,000 (I forgot to set the MEM param). When trying with MEM=512MB, it at first seemed faster, but then there were now and then local slow-downs, and eventually it became a bit slower than the previous run. I know these are not merges, so they are either flushes (RAM directed), or GC activity. I will perhaps run with GC debug flags and perhaps add a print at flush so to tell the culprit for these local slow-downs. Other than that, I will perhaps try to index .GOV2 (25 Million HTML docs) with this patch. The way I indexed it before it took about 4 days - running in 4 threads, and creating 36 indexes. This is even more a real life scenario, it involves HTML parsing, standard analysis, and merging (to some extent). Since there are 4 threads each one will get, say, 250MB. Again, for a "fair" comparison, I will remain with compound. > Just to clarify your comment on reusing field and doc instances - to my Right, if your documents are very "regular" you should get a sizable It's not easy to reuse Field instances now (there's no > For a "fair" comparison I ended up not following most of your OK, when you say "fair" I think you mean because you already had a > For the first 100,000,000 docs (==speller words) the speed-up is quite Wow! I think the speedup might be even more if both of your runs followed > This btw was with maxBufDocs=100,000 (I forgot to set the MEM param). Hurm, odd. I haven't pushed RAM buffer up to 512 MB so it could be GC > Other than that, I will perhaps try to index .GOV2 (25 Million HTML docs) OK, because you're doing StandardAnalyzer and HTML parsing and Re-opening this issue: I saw one failure of the contrib/benchmark
TestPerfTasksLogic.testParallelDocMaker() tests due to an intermittant thread-safety issue. It's hard to get the failure to happen (it's happened only once in ~20 runs of contrib/benchmark) but I see where the issue is. Will commit a fix shortly. Did we lose the triggered merge stuff from 887, i.e.,, should it be
if (triggerMerge) { /* new merge policy if (0 == docWriter.getMaxBufferedDocs()) maybeMergeSegments(mergeFactor * numDocs / 2); else maybeMergeSegments(docWriter.getMaxBufferedDocs()); */ maybeMergeSegments(docWriter.getMaxBufferedDocs()); } Woops ... you are right; thanks for catching it! I will add a unit
test & fix it. I will also make the flush(boolean triggerMerge, boolean flushDocStores) protected, not public, and move the javadoc back to the public flush(). |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
and very much a work in progress and nowhere near ready to commit! I
wanted to get it out there sooner rather than later to get feedback,
maybe entice some daring early adopters, iterate, etc.
It passes all unit tests except the disk-full tests.
There are some big issues yet to resolve:
even before my patch). Not sure how to fix yet.
writer).