Issue Details (XML | Word | Printable)

Key: LUCENE-510
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Michael McCandless
Reporter: Doug Cutting
Votes: 5
Watchers: 3
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

IndexOutput.writeString() should write length in bytes

Created: 03/Mar/06 03:30 AM   Updated: 11/Oct/08 12:49 PM
Return to search
Component/s: Store
Affects Version/s: 2.1
Fix Version/s: 2.4

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works LUCENE-510.patch 2008-03-04 08:55 PM Michael McCandless 51 kB
Text File Licensed for inclusion in ASF works LUCENE-510.take2.patch 2008-03-17 08:04 PM Michael McCandless 120 kB
Java Source File Licensed for inclusion in ASF works SortExternal.java 2006-06-05 08:54 AM Marvin Humphrey 15 kB
File Licensed for inclusion in ASF works strings.diff 2006-05-09 04:04 AM Marvin Humphrey 27 kB
Java Source File Licensed for inclusion in ASF works TestSortExternal.java 2006-06-05 08:54 AM Marvin Humphrey 6 kB
Issue Links:
Reference

Resolution Date: 09/May/08 12:05 PM


 Description  « Hide
We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters. This issue has been discussed at:

http://www.mail-archive.com/java-dev@lucene.apache.org/msg01970.html

We must increment the file format number to indicate this change. At least the format number in the segments file should change.

I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Doug Cutting added a comment - 03/Mar/06 03:32 AM
Link to LUCENE-510.

Marvin Humphrey added a comment - 09/May/06 04:02 AM
The following patch...
  • Changes Lucene to use bytecounts as the prefix to all written Strings
  • Changes Lucene to write standard UTF-8 rather than Modified UTF-8
  • Adds the new test classes MockIndexOutput and TestIndexOutput
  • Increases the number of tests in TestIndexInput

It also slows Lucene down – indexing takes around a 20% speed hit. It would be possible to submit a patch which had a smaller impact on performance, but this one is already over 700 lines long, and it's goal is to achieve standard UTF-8 compliance and modify the definition of Lucene strings as simply and reliably as possible. Optimization patches can now be submitted which build upon this one.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


Marvin Humphrey added a comment - 05/Jun/06 08:54 AM
Greets,

I've ported KinoSearch's external sorting module to java, along with its tests. This class is the linchpin for the KinoSearch merge model, as it allows serialized postings to be dumped into a sort pool of effectively unlimited size.

At some point, I'll submit patches implementing the KinoSearch merge model in Lucene. I'm reasonably confident that it will more than make up for the index-time performance hit caused by using bytecounts as string headers.

Thematically, this class belongs in org.apache.lucene.util, and that's where I've put it for now. The classes that will use it are in org.apache.lucene.index, so if it stays in util, it will have to be public. However, it shouldn't be part of Lucene's documented public API. The process by which Lucene's docs are generated is not clear to me, so access control advice would be appreciated.

There are a number of other areas where this code could stand review, especially considering my relatively limited experience using Java. I'd single out exception handling and thread safety, but of course anything else is fair game.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


Grant Ingersoll added a comment - 04/Jan/07 02:05 AM
Hi Marvin,

This no longer applies cleanly to trunk, care to update?

Thanks,
Grant


Chuck Williams added a comment - 04/Jan/07 03:15 AM
Has an improvement been made to eliminate the reported 20% indexing hit? That would be a big price to pay.

To me the performance benefits in algorithms that scan for selected fields (e.g., FieldsReader.doc() with a FieldSelector) are much more important than standard UTF-8 compliance.

A 20% hit seems suprising. The pre-scan over the string to be written shouldn't cost much compared to the cost of tokenizing and indeixng that string (assuming it is in an indexed field).

In case it is relevant, I had a related issue in my bulk updater, a case where a vint required at the beginning of a record by the lucene index format was not known until after the end. I solved this with a fixed length vint record that was estimated up front and revised if necessary after the whole record was processed. The vint representation still works if more bytes than necessary are written.


Yonik Seeley added a comment - 04/Jan/07 05:46 AM
I'd like to see everything kept as bytes for as long as possible (right up into Term).
A nice bonus would be to reduce the size of things like the FieldCache, and to allow true binary data.

Marvin Humphrey added a comment - 04/Jan/07 07:01 PM
Grant... At the moment I am completely consumed by the task of getting a devel release of KinoSearch version 0.20 out the door. Once that is taken care of, I will be glad to update this patch, and to explore how to compensate for the performance hit it causes.

Chuck... If bytecount-based strings are adopted, standard UTF-8 probably comes along for the ride. There's actually a 1-2% performance gain to be had using standard over modified because of simplified conditionals. What holds us back is backwards compatibility – but we'll have wrecked backwards compat with the bytecounts. However, I no longer have a strong objection to using Modified UTF-8 (for Lucene, that is – Modified UTF-8 would be a deal-breaker for Lucy), so if somewhere along the way we find a compelling reason to stick with modified UTF-8, so be it.

If bytecount-based strings get adopted, it will be because they hold up on their own merits. They're required for KinoSearch merge model; once KS 0.20 is out, I'll port the new benchmarking stuff, we can study the numbers, and assess whether the significant effort needed to pry that algo into Lucene would be worthwhile.

Yonik... yes, I agree. Even better for indexing time, leave postings in serialized form for the entire indexing session.


Grant Ingersoll added a comment - 01/Jun/07 05:07 PM
I don't have time at the moment

Michael Busch added a comment - 11/Jan/08 09:07 AM
I think it makes total sense to change this. And this issue seems to be
very popular with 5 votes.

Mike, you've done so much performance & indexing work recently. Are
you interested in taking this?


Michael McCandless added a comment - 11/Jan/08 10:03 AM
Yup, I'll take this!

Michael McCandless added a comment - 04/Mar/08 08:55 PM
Attached patch.

I modernized Marvin's original patch and added full backwards
compatibility to it so that old indices can be opened for reading or
writing. New segments are written in the new format.

All tests pass. I think it's close, but, I need to run performance
tests now to measure the impact to indexing throughput.

I think future optimizations can keep the byte[] further, eg, into
Term and FieldCache, as Yonik mentioned. We could also fix
DocumentsWriter to use byte[] for its terms storage which would
improve RAM efficiency for single-byte (ascii) content.

I also updated the TestBackwardsCompatibility testcase to properly
test non-ascii terms.


Grant Ingersoll added a comment - 07/Mar/08 02:09 PM
So, with this we should be able to skip ahead in the FieldsReader, right? I will try to update your patch with that. Should improve lazy loading, etc.

Michael McCandless added a comment - 07/Mar/08 04:10 PM
Yes, exactly. But I think the current patch is already doing this? – ie, using seek instead of skipChars, if the fdt is new.

Grant Ingersoll added a comment - 07/Mar/08 06:23 PM
Cool. I just downloaded and applied, but hadn't looked at it yet.

Michael McCandless added a comment - 17/Mar/08 08:04 PM
New rev of the patch. I think it's ready to commit. I'll wait a few
days.

I made some performance improvements by factoring out a new
UnicodeUtil class that does not allocate new objects for every
conversion to/from UTF8.

One new issue I fixed is the handling of invalid UTF-16 strings.
Specifically if the UTF16 text has invalid surrogate pairs, UTF-8 is
unable to represent it (unlike the current modified UTF-8 Lucene
format). I changed DocumentsWriter & UnicodeUtil to substitute the
replacement char U+FFFD for such invalid surrogate characters. This
affects terms, stored String fields and term vectors.

Indexing performance has a small slowdown (3.5%); details are below.

Unfortunately, time to enumerate terms was more affected. I made a
simple test that enumerates all terms from the index (= ~3.3 million
terms) created below:

public class TestTermEnum {
public static void main(String[] args) throws Exception { IndexReader r = IndexReader.open(args[0]); TermEnum terms = r.terms(); int count = 0; long t0 = System.currentTimeMillis(); while(terms.next()) count++; long t1 = System.currentTimeMillis(); System.out.println(count + " terms in " + (t1-t0) + " millis"); r.close(); }
}

On trunk with current index format this takes 3104 msec (best of 5).
With the patch with UTF8 index format it takes 3443 msec = 10.9%
slower. I don't see any further ways to make this faster.

Details on the indexing performance test:

analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer

doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker

docs.file=/Volumes/External/lucene/wiki.txt
doc.stored = true
doc.term.vector = true
doc.add.log.step=2000

directory=FSDirectory
autocommit=false
compound=false

ram.flush.mb=64

{ "Rounds"
ResetSystemErase
{ "BuildIndex"
CreateIndex

{ "AddDocs" AddDoc > : 200000 - CloseIndex }

NewRound
} : 5

RepSumByPrefRound BuildIndex

I ran it on a quad-core Intel Mac Pro, with 4 drive RAID 0 array,
running OS 10.4.11, java 1.5, run with these command-line args:

-server -Xbatch -Xms1024m -Xmx1024m

Best of 5 with current trunk is 921.2 docs/sec and with patch it's
888.7 = 3.5% slowdown.


Hiroaki Kawai added a comment - 18/Mar/08 02:28 AM
I'm wondering why the patch doesn't utilize java.nio.charset.CharsetEncoder, CharsetDecoder....?

Michael McCandless added a comment - 19/Mar/08 10:39 PM

I'm wondering why the patch doesn't utilize java.nio.charset.CharsetEncoder, CharsetDecoder....?

I think there are two reasons for rolling our own instead of using
CharsetEncoder/Decoder. First is performance. If I use
CharsetEncoder, like this:

CharsetEncoder encoder = Charset.forName("UTF-8").newEncoder();
CharBuffer cb = CharBuffer.allocate(5000);
ByteBuffer bb = ByteBuffer.allocate(5000);
byte[] bbArray = bb.array();
UnicodeUtil.UTF8Result utf8Result = new UnicodeUtil.UTF8Result();

t0 = System.currentTimeMillis();
for(int i=0;i<count;i++) { cb.clear(); cb.put(strings[i]); cb.flip(); bb.clear(); encoder.reset(); encoder.encode(cb, bb, true); }

Then it takes 676 msec to convert ~3.3 million strings from the terms
from indexing first 200K Wikipedia docs. If I replace for loop with:

UnicodeUtil.UTF16toUTF8(strings[i], 0, strings[i].length(), utf8Result);

It's 441 msec.

Second reason is some API mismatch. EG we need to convert char[] that
end in the 0xffff character. Also, we need to do incremental
conversion (only convert changed bytes), which is used by TermEnum.
CharsetEncoder/Decoder doesn't do this.


Hiroaki Kawai added a comment - 20/Mar/08 12:43 PM
To the first reason of performance issue,

CharsetEncoder encoder = Charset.forName("UTF-8").newEncoder();
CharBuffer cb = CharBuffer.allocate(5000);
ByteBuffer bb = ByteBuffer.allocate(5000);
byte[] bbArray = bb.array();
UnicodeUtil.UTF8Result utf8Result = new UnicodeUtil.UTF8Result();

t0 = System.currentTimeMillis();
for(int i=0;i<count;i++)

Unknown macro: { cb.clear(); cb.put(strings[i]); cb.flip(); bb.clear(); encoder.reset(); encoder.encode(cb, bb, true); }

How about this code?

CharsetEncoder encoder = Charset.forName("UTF-8").newEncoder();
CharBuffer cb;
ByteBuffer bb = ByteBuffer.allocate(5000);
byte[] bbArray = bb.array();
UnicodeUtil.UTF8Result utf8Result = new UnicodeUtil.UTF8Result();

t0 = System.currentTimeMillis();
for(int i=0;i<count;i++) {
bb.clear();
encoder.reset();
encoder.encode(CharBuffer.wrap(strings[i]), bb, true);
}


Michael McCandless added a comment - 20/Mar/08 01:49 PM
That version takes 1067 msec (best of 3).

Hiroaki Kawai added a comment - 20/Mar/08 03:41 PM
Ah interesting !

I'm not tried yet, but...

CharBuffer cb = ByteBuffer.allocateDirect(5000).asCharBuffer();
ByteBuffer bb = ByteBuffer.allocateDirect(5000);

will improve the performance.


Hiroaki Kawai added a comment - 20/Mar/08 04:11 PM

Second reason is some API mismatch. EG we need to convert char[] that
end in the 0xffff character. Also, we need to do incremental
conversion (only convert changed bytes), which is used by TermEnum.
CharsetEncoder/Decoder doesn't do this.

IMHO, the char 0xffff is not telling that the string is the end. The marker is not to be used to dermine the termination, but is for assertion, I think (in DocumentsWriterThreadState.java). And we should rewrite DocumentsWriterFieldMergeState.java to provide "private int postingUpto" outside via public method, to use it in DocumentsWriter.java and DocumentsWriterField.java.


Michael McCandless added a comment - 20/Mar/08 04:18 PM
Using allocateDirect is surprisingly slower: 1050 msec.

Michael McCandless added a comment - 20/Mar/08 04:25 PM

IMHO, the char 0xffff is not telling that the string is the end. The marker is not to be used to dermine the termination, but is for assertion, I think (in DocumentsWriterThreadState.java).

The 0xffff is used for more than assertion. It marks the end of the text for each term.

And we should rewrite DocumentsWriterFieldMergeState.java to provide "private int postingUpto" outside via public method, to use it in DocumentsWriter.java and DocumentsWriterField.java.

The "private int postingUpto" in DocumentsWriterFieldMergeState is very different from the usage of 0xffff marker; I'm not sure what you're suggesting here?


Hiroaki Kawai added a comment - 20/Mar/08 04:58 PM
I'm sorry I mistook something in DocumentsWriterFieldMergeState.java.

But, my suggestion here, is use 0xffff only for assertion.

To achive this, we should add textLength property to DocumentsWriterFieldMergeState and use it in DocumentsWriter.java, modify DocumentsWriter#compareText.

I'll submit a patch.


Michael McCandless added a comment - 20/Mar/08 05:21 PM
OK. Can you open a separate issue for that?

Hiroaki Kawai added a comment - 20/Mar/08 05:34 PM
OK. I'll open another ticket.

I'll try measuring the performance, too. I still believe that we should use java.nio.charset for our code maintainance.


Michael McCandless added a comment - 20/Mar/08 05:40 PM
I think using java.nio.charset is too much of a performance hit. I think it will especially further slow down enumerating terms since we can't convert the characters incrementally if we use java.nio.charset.

Hiroaki Kawai added a comment - 21/Mar/08 02:11 AM
I think performance issue is yet another issue, and I believe we can find the way to speed up.

Mark Miller added a comment - 08/May/08 11:43 PM
I suspect that this issue is not reading old indexes perfectly. I am very suspicious that this patch caused what I thought was index corruption the other day. I didn't notice it was a format change and should have paid more attention.

I seem to be able to replicate the issue so hopefully I will have more to report soon.

If my thoughts are unlikely, please let me know.


Michael McCandless added a comment - 09/May/08 09:59 AM
Whoa, I think you are correct Mark!

On inspecting my changes here, I think the bulk-merging of stored fields is to blame. Specifically, when we bulk merge the stored fields we fail to check whether the segments being merged are the pre-UTF8 format. And so that code bulk-copies stored fields in the older format into a file that claims it's using the newer format.

This only affects trunk, not 2.3.

Thanks for being such a brave early-adopter trunk tester, Mark. And, sorry


Mark Miller added a comment - 09/May/08 11:15 AM
No problem Michael. Brave/Foolish, I certainly am one. Can't keep myself away from these fresh issues My fault for not understanding the release process really though. I thought I was roughly getting 2.3.2 ahead of time. Live and learn I got a little lax inspecting all the issues for someone grabbing from trunk...once all my tests pass I usually feel pretty good about it. Time to add some upgrade index tests

Thanks so much for checking into this so quick - certainly saves me some time today.


Michael McCandless added a comment - 09/May/08 11:28 AM

Time to add some upgrade index tests

In fact Lucene already has TestBackwardsCompatibility ... so my first step is to add a test case there that shows this bug...


Michael McCandless added a comment - 09/May/08 12:04 PM
OK, I've fixed TestBackwardsCompatibility to catch this, and then fixed merging of stored fields to work properly. Will commit shortly...

Thanks again Mark!