|
The following patch...
It also slows Lucene down – indexing takes around a 20% speed hit. It would be possible to submit a patch which had a smaller impact on performance, but this one is already over 700 lines long, and it's goal is to achieve standard UTF-8 compliance and modify the definition of Lucene strings as simply and reliably as possible. Optimization patches can now be submitted which build upon this one. Marvin Humphrey Greets,
I've ported KinoSearch's external sorting module to java, along with its tests. This class is the linchpin for the KinoSearch merge model, as it allows serialized postings to be dumped into a sort pool of effectively unlimited size. At some point, I'll submit patches implementing the KinoSearch merge model in Lucene. I'm reasonably confident that it will more than make up for the index-time performance hit caused by using bytecounts as string headers. Thematically, this class belongs in org.apache.lucene.util, and that's where I've put it for now. The classes that will use it are in org.apache.lucene.index, so if it stays in util, it will have to be public. However, it shouldn't be part of Lucene's documented public API. The process by which Lucene's docs are generated is not clear to me, so access control advice would be appreciated. There are a number of other areas where this code could stand review, especially considering my relatively limited experience using Java. I'd single out exception handling and thread safety, but of course anything else is fair game. Marvin Humphrey Hi Marvin,
This no longer applies cleanly to trunk, care to update? Thanks, Has an improvement been made to eliminate the reported 20% indexing hit? That would be a big price to pay.
To me the performance benefits in algorithms that scan for selected fields (e.g., FieldsReader.doc() with a FieldSelector) are much more important than standard UTF-8 compliance. A 20% hit seems suprising. The pre-scan over the string to be written shouldn't cost much compared to the cost of tokenizing and indeixng that string (assuming it is in an indexed field). In case it is relevant, I had a related issue in my bulk updater, a case where a vint required at the beginning of a record by the lucene index format was not known until after the end. I solved this with a fixed length vint record that was estimated up front and revised if necessary after the whole record was processed. The vint representation still works if more bytes than necessary are written. I'd like to see everything kept as bytes for as long as possible (right up into Term).
A nice bonus would be to reduce the size of things like the FieldCache, and to allow true binary data. Grant... At the moment I am completely consumed by the task of getting a devel release of KinoSearch version 0.20 out the door. Once that is taken care of, I will be glad to update this patch, and to explore how to compensate for the performance hit it causes.
Chuck... If bytecount-based strings are adopted, standard UTF-8 probably comes along for the ride. There's actually a 1-2% performance gain to be had using standard over modified because of simplified conditionals. What holds us back is backwards compatibility – but we'll have wrecked backwards compat with the bytecounts. However, I no longer have a strong objection to using Modified UTF-8 (for Lucene, that is – Modified UTF-8 would be a deal-breaker for Lucy), so if somewhere along the way we find a compelling reason to stick with modified UTF-8, so be it. If bytecount-based strings get adopted, it will be because they hold up on their own merits. They're required for KinoSearch merge model; once KS 0.20 is out, I'll port the new benchmarking stuff, we can study the numbers, and assess whether the significant effort needed to pry that algo into Lucene would be worthwhile. Yonik... yes, I agree. Even better for indexing time, leave postings in serialized form for the entire indexing session. I don't have time at the moment
I think it makes total sense to change this. And this issue seems to be
very popular with 5 votes. Mike, you've done so much performance & indexing work recently. Are Attached patch.
I modernized Marvin's original patch and added full backwards All tests pass. I think it's close, but, I need to run performance I think future optimizations can keep the byte[] further, eg, into I also updated the TestBackwardsCompatibility testcase to properly So, with this we should be able to skip ahead in the FieldsReader, right? I will try to update your patch with that. Should improve lazy loading, etc.
Yes, exactly. But I think the current patch is already doing this? – ie, using seek instead of skipChars, if the fdt is new.
Cool. I just downloaded and applied, but hadn't looked at it yet.
New rev of the patch. I think it's ready to commit. I'll wait a few
days. I made some performance improvements by factoring out a new One new issue I fixed is the handling of invalid UTF-16 strings. Indexing performance has a small slowdown (3.5%); details are below. Unfortunately, time to enumerate terms was more affected. I made a public class TestTermEnum { On trunk with current index format this takes 3104 msec (best of 5). Details on the indexing performance test: analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker docs.file=/Volumes/External/lucene/wiki.txt directory=FSDirectory ram.flush.mb=64 { "Rounds" NewRound RepSumByPrefRound BuildIndex I ran it on a quad-core Intel Mac Pro, with 4 drive RAID 0 array, -server -Xbatch -Xms1024m -Xmx1024m Best of 5 with current trunk is 921.2 docs/sec and with patch it's I'm wondering why the patch doesn't utilize java.nio.charset.CharsetEncoder, CharsetDecoder....?
I think there are two reasons for rolling our own instead of using CharsetEncoder encoder = Charset.forName("UTF-8").newEncoder(); t0 = System.currentTimeMillis(); Then it takes 676 msec to convert ~3.3 million strings from the terms UnicodeUtil.UTF16toUTF8(strings[i], 0, strings[i].length(), utf8Result); It's 441 msec. Second reason is some API mismatch. EG we need to convert char[] that To the first reason of performance issue,
How about this code? CharsetEncoder encoder = Charset.forName("UTF-8").newEncoder(); t0 = System.currentTimeMillis(); That version takes 1067 msec (best of 3).
Ah interesting !
I'm not tried yet, but... CharBuffer cb = ByteBuffer.allocateDirect(5000).asCharBuffer(); will improve the performance.
IMHO, the char 0xffff is not telling that the string is the end. The marker is not to be used to dermine the termination, but is for assertion, I think (in DocumentsWriterThreadState.java). And we should rewrite DocumentsWriterFieldMergeState.java to provide "private int postingUpto" outside via public method, to use it in DocumentsWriter.java and DocumentsWriterField.java. Using allocateDirect is surprisingly slower: 1050 msec.
The 0xffff is used for more than assertion. It marks the end of the text for each term.
The "private int postingUpto" in DocumentsWriterFieldMergeState is very different from the usage of 0xffff marker; I'm not sure what you're suggesting here? I'm sorry I mistook something in DocumentsWriterFieldMergeState.java.
But, my suggestion here, is use 0xffff only for assertion. To achive this, we should add textLength property to DocumentsWriterFieldMergeState and use it in DocumentsWriter.java, modify DocumentsWriter#compareText. I'll submit a patch. OK. Can you open a separate issue for that?
OK. I'll open another ticket.
I'll try measuring the performance, too. I still believe that we should use java.nio.charset for our code maintainance. I think using java.nio.charset is too much of a performance hit. I think it will especially further slow down enumerating terms since we can't convert the characters incrementally if we use java.nio.charset.
I think performance issue is yet another issue, and I believe we can find the way to speed up.
I suspect that this issue is not reading old indexes perfectly. I am very suspicious that this patch caused what I thought was index corruption the other day. I didn't notice it was a format change and should have paid more attention.
I seem to be able to replicate the issue so hopefully I will have more to report soon. If my thoughts are unlikely, please let me know. Whoa, I think you are correct Mark!
On inspecting my changes here, I think the bulk-merging of stored fields is to blame. Specifically, when we bulk merge the stored fields we fail to check whether the segments being merged are the pre-UTF8 format. And so that code bulk-copies stored fields in the older format into a file that claims it's using the newer format. This only affects trunk, not 2.3. Thanks for being such a brave early-adopter trunk tester, Mark. And, sorry No problem Michael. Brave/Foolish, I certainly am one. Can't keep myself away from these fresh issues
Thanks so much for checking into this so quick - certainly saves me some time today.
In fact Lucene already has TestBackwardsCompatibility ... so my first step is to add a test case there that shows this bug... OK, I've fixed TestBackwardsCompatibility to catch this, and then fixed merging of stored fields to work properly. Will commit shortly...
Thanks again Mark! |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
LUCENE-510.