Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-510

IndexOutput.writeString() should write length in bytes

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.1
    • Fix Version/s: 2.4
    • Component/s: core/store
    • Labels:
      None

      Description

      We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters. This issue has been discussed at:

      http://www.mail-archive.com/java-dev@lucene.apache.org/msg01970.html

      We must increment the file format number to indicate this change. At least the format number in the segments file should change.

      I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

      1. LUCENE-510.patch
        51 kB
        Michael McCandless
      2. LUCENE-510.take2.patch
        120 kB
        Michael McCandless
      3. SortExternal.java
        15 kB
        Marvin Humphrey
      4. strings.diff
        27 kB
        Marvin Humphrey
      5. TestSortExternal.java
        6 kB
        Marvin Humphrey

        Issue Links

          Activity

          Hide
          cutting Doug Cutting added a comment -

          Link to LUCENE-510.

          Show
          cutting Doug Cutting added a comment - Link to LUCENE-510 .
          Hide
          creamyg Marvin Humphrey added a comment -

          The following patch...

          • Changes Lucene to use bytecounts as the prefix to all written Strings
          • Changes Lucene to write standard UTF-8 rather than Modified UTF-8
          • Adds the new test classes MockIndexOutput and TestIndexOutput
          • Increases the number of tests in TestIndexInput

          It also slows Lucene down – indexing takes around a 20% speed hit. It would be possible to submit a patch which had a smaller impact on performance, but this one is already over 700 lines long, and it's goal is to achieve standard UTF-8 compliance and modify the definition of Lucene strings as simply and reliably as possible. Optimization patches can now be submitted which build upon this one.

          Marvin Humphrey
          Rectangular Research
          http://www.rectangular.com/

          Show
          creamyg Marvin Humphrey added a comment - The following patch... Changes Lucene to use bytecounts as the prefix to all written Strings Changes Lucene to write standard UTF-8 rather than Modified UTF-8 Adds the new test classes MockIndexOutput and TestIndexOutput Increases the number of tests in TestIndexInput It also slows Lucene down – indexing takes around a 20% speed hit. It would be possible to submit a patch which had a smaller impact on performance, but this one is already over 700 lines long, and it's goal is to achieve standard UTF-8 compliance and modify the definition of Lucene strings as simply and reliably as possible. Optimization patches can now be submitted which build upon this one. Marvin Humphrey Rectangular Research http://www.rectangular.com/
          Hide
          creamyg Marvin Humphrey added a comment -

          Greets,

          I've ported KinoSearch's external sorting module to java, along with its tests. This class is the linchpin for the KinoSearch merge model, as it allows serialized postings to be dumped into a sort pool of effectively unlimited size.

          At some point, I'll submit patches implementing the KinoSearch merge model in Lucene. I'm reasonably confident that it will more than make up for the index-time performance hit caused by using bytecounts as string headers.

          Thematically, this class belongs in org.apache.lucene.util, and that's where I've put it for now. The classes that will use it are in org.apache.lucene.index, so if it stays in util, it will have to be public. However, it shouldn't be part of Lucene's documented public API. The process by which Lucene's docs are generated is not clear to me, so access control advice would be appreciated.

          There are a number of other areas where this code could stand review, especially considering my relatively limited experience using Java. I'd single out exception handling and thread safety, but of course anything else is fair game.

          Marvin Humphrey
          Rectangular Research
          http://www.rectangular.com/

          Show
          creamyg Marvin Humphrey added a comment - Greets, I've ported KinoSearch's external sorting module to java, along with its tests. This class is the linchpin for the KinoSearch merge model, as it allows serialized postings to be dumped into a sort pool of effectively unlimited size. At some point, I'll submit patches implementing the KinoSearch merge model in Lucene. I'm reasonably confident that it will more than make up for the index-time performance hit caused by using bytecounts as string headers. Thematically, this class belongs in org.apache.lucene.util, and that's where I've put it for now. The classes that will use it are in org.apache.lucene.index, so if it stays in util, it will have to be public. However, it shouldn't be part of Lucene's documented public API. The process by which Lucene's docs are generated is not clear to me, so access control advice would be appreciated. There are a number of other areas where this code could stand review, especially considering my relatively limited experience using Java. I'd single out exception handling and thread safety, but of course anything else is fair game. Marvin Humphrey Rectangular Research http://www.rectangular.com/
          Hide
          gsingers Grant Ingersoll added a comment -

          Hi Marvin,

          This no longer applies cleanly to trunk, care to update?

          Thanks,
          Grant

          Show
          gsingers Grant Ingersoll added a comment - Hi Marvin, This no longer applies cleanly to trunk, care to update? Thanks, Grant
          Hide
          manawiz Chuck Williams added a comment -

          Has an improvement been made to eliminate the reported 20% indexing hit? That would be a big price to pay.

          To me the performance benefits in algorithms that scan for selected fields (e.g., FieldsReader.doc() with a FieldSelector) are much more important than standard UTF-8 compliance.

          A 20% hit seems suprising. The pre-scan over the string to be written shouldn't cost much compared to the cost of tokenizing and indeixng that string (assuming it is in an indexed field).

          In case it is relevant, I had a related issue in my bulk updater, a case where a vint required at the beginning of a record by the lucene index format was not known until after the end. I solved this with a fixed length vint record that was estimated up front and revised if necessary after the whole record was processed. The vint representation still works if more bytes than necessary are written.

          Show
          manawiz Chuck Williams added a comment - Has an improvement been made to eliminate the reported 20% indexing hit? That would be a big price to pay. To me the performance benefits in algorithms that scan for selected fields (e.g., FieldsReader.doc() with a FieldSelector) are much more important than standard UTF-8 compliance. A 20% hit seems suprising. The pre-scan over the string to be written shouldn't cost much compared to the cost of tokenizing and indeixng that string (assuming it is in an indexed field). In case it is relevant, I had a related issue in my bulk updater, a case where a vint required at the beginning of a record by the lucene index format was not known until after the end. I solved this with a fixed length vint record that was estimated up front and revised if necessary after the whole record was processed. The vint representation still works if more bytes than necessary are written.
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          I'd like to see everything kept as bytes for as long as possible (right up into Term).
          A nice bonus would be to reduce the size of things like the FieldCache, and to allow true binary data.

          Show
          yseeley@gmail.com Yonik Seeley added a comment - I'd like to see everything kept as bytes for as long as possible (right up into Term). A nice bonus would be to reduce the size of things like the FieldCache, and to allow true binary data.
          Hide
          creamyg Marvin Humphrey added a comment -

          Grant... At the moment I am completely consumed by the task of getting a devel release of KinoSearch version 0.20 out the door. Once that is taken care of, I will be glad to update this patch, and to explore how to compensate for the performance hit it causes.

          Chuck... If bytecount-based strings are adopted, standard UTF-8 probably comes along for the ride. There's actually a 1-2% performance gain to be had using standard over modified because of simplified conditionals. What holds us back is backwards compatibility – but we'll have wrecked backwards compat with the bytecounts. However, I no longer have a strong objection to using Modified UTF-8 (for Lucene, that is – Modified UTF-8 would be a deal-breaker for Lucy), so if somewhere along the way we find a compelling reason to stick with modified UTF-8, so be it.

          If bytecount-based strings get adopted, it will be because they hold up on their own merits. They're required for KinoSearch merge model; once KS 0.20 is out, I'll port the new benchmarking stuff, we can study the numbers, and assess whether the significant effort needed to pry that algo into Lucene would be worthwhile.

          Yonik... yes, I agree. Even better for indexing time, leave postings in serialized form for the entire indexing session.

          Show
          creamyg Marvin Humphrey added a comment - Grant... At the moment I am completely consumed by the task of getting a devel release of KinoSearch version 0.20 out the door. Once that is taken care of, I will be glad to update this patch, and to explore how to compensate for the performance hit it causes. Chuck... If bytecount-based strings are adopted, standard UTF-8 probably comes along for the ride. There's actually a 1-2% performance gain to be had using standard over modified because of simplified conditionals. What holds us back is backwards compatibility – but we'll have wrecked backwards compat with the bytecounts. However, I no longer have a strong objection to using Modified UTF-8 (for Lucene, that is – Modified UTF-8 would be a deal-breaker for Lucy), so if somewhere along the way we find a compelling reason to stick with modified UTF-8, so be it. If bytecount-based strings get adopted, it will be because they hold up on their own merits. They're required for KinoSearch merge model; once KS 0.20 is out, I'll port the new benchmarking stuff, we can study the numbers, and assess whether the significant effort needed to pry that algo into Lucene would be worthwhile. Yonik... yes, I agree. Even better for indexing time, leave postings in serialized form for the entire indexing session.
          Hide
          gsingers Grant Ingersoll added a comment -

          I don't have time at the moment

          Show
          gsingers Grant Ingersoll added a comment - I don't have time at the moment
          Hide
          michaelbusch Michael Busch added a comment -

          I think it makes total sense to change this. And this issue seems to be
          very popular with 5 votes.

          Mike, you've done so much performance & indexing work recently. Are
          you interested in taking this?

          Show
          michaelbusch Michael Busch added a comment - I think it makes total sense to change this. And this issue seems to be very popular with 5 votes. Mike, you've done so much performance & indexing work recently. Are you interested in taking this?
          Hide
          mikemccand Michael McCandless added a comment -

          Yup, I'll take this!

          Show
          mikemccand Michael McCandless added a comment - Yup, I'll take this!
          Hide
          mikemccand Michael McCandless added a comment -

          Attached patch.

          I modernized Marvin's original patch and added full backwards
          compatibility to it so that old indices can be opened for reading or
          writing. New segments are written in the new format.

          All tests pass. I think it's close, but, I need to run performance
          tests now to measure the impact to indexing throughput.

          I think future optimizations can keep the byte[] further, eg, into
          Term and FieldCache, as Yonik mentioned. We could also fix
          DocumentsWriter to use byte[] for its terms storage which would
          improve RAM efficiency for single-byte (ascii) content.

          I also updated the TestBackwardsCompatibility testcase to properly
          test non-ascii terms.

          Show
          mikemccand Michael McCandless added a comment - Attached patch. I modernized Marvin's original patch and added full backwards compatibility to it so that old indices can be opened for reading or writing. New segments are written in the new format. All tests pass. I think it's close, but, I need to run performance tests now to measure the impact to indexing throughput. I think future optimizations can keep the byte[] further, eg, into Term and FieldCache, as Yonik mentioned. We could also fix DocumentsWriter to use byte[] for its terms storage which would improve RAM efficiency for single-byte (ascii) content. I also updated the TestBackwardsCompatibility testcase to properly test non-ascii terms.
          Hide
          gsingers Grant Ingersoll added a comment -

          So, with this we should be able to skip ahead in the FieldsReader, right? I will try to update your patch with that. Should improve lazy loading, etc.

          Show
          gsingers Grant Ingersoll added a comment - So, with this we should be able to skip ahead in the FieldsReader, right? I will try to update your patch with that. Should improve lazy loading, etc.
          Hide
          mikemccand Michael McCandless added a comment -

          Yes, exactly. But I think the current patch is already doing this? – ie, using seek instead of skipChars, if the fdt is new.

          Show
          mikemccand Michael McCandless added a comment - Yes, exactly. But I think the current patch is already doing this? – ie, using seek instead of skipChars, if the fdt is new.
          Hide
          gsingers Grant Ingersoll added a comment -

          Cool. I just downloaded and applied, but hadn't looked at it yet.

          Show
          gsingers Grant Ingersoll added a comment - Cool. I just downloaded and applied, but hadn't looked at it yet.
          Hide
          mikemccand Michael McCandless added a comment -

          New rev of the patch. I think it's ready to commit. I'll wait a few
          days.

          I made some performance improvements by factoring out a new
          UnicodeUtil class that does not allocate new objects for every
          conversion to/from UTF8.

          One new issue I fixed is the handling of invalid UTF-16 strings.
          Specifically if the UTF16 text has invalid surrogate pairs, UTF-8 is
          unable to represent it (unlike the current modified UTF-8 Lucene
          format). I changed DocumentsWriter & UnicodeUtil to substitute the
          replacement char U+FFFD for such invalid surrogate characters. This
          affects terms, stored String fields and term vectors.

          Indexing performance has a small slowdown (3.5%); details are below.

          Unfortunately, time to enumerate terms was more affected. I made a
          simple test that enumerates all terms from the index (= ~3.3 million
          terms) created below:

          public class TestTermEnum {
          public static void main(String[] args) throws Exception

          { IndexReader r = IndexReader.open(args[0]); TermEnum terms = r.terms(); int count = 0; long t0 = System.currentTimeMillis(); while(terms.next()) count++; long t1 = System.currentTimeMillis(); System.out.println(count + " terms in " + (t1-t0) + " millis"); r.close(); }

          }

          On trunk with current index format this takes 3104 msec (best of 5).
          With the patch with UTF8 index format it takes 3443 msec = 10.9%
          slower. I don't see any further ways to make this faster.

          Details on the indexing performance test:

          analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer

          doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker

          docs.file=/Volumes/External/lucene/wiki.txt
          doc.stored = true
          doc.term.vector = true
          doc.add.log.step=2000

          directory=FSDirectory
          autocommit=false
          compound=false

          ram.flush.mb=64

          { "Rounds"
          ResetSystemErase
          { "BuildIndex"
          CreateIndex

          { "AddDocs" AddDoc > : 200000 - CloseIndex }

          NewRound
          } : 5

          RepSumByPrefRound BuildIndex

          I ran it on a quad-core Intel Mac Pro, with 4 drive RAID 0 array,
          running OS 10.4.11, java 1.5, run with these command-line args:

          -server -Xbatch -Xms1024m -Xmx1024m

          Best of 5 with current trunk is 921.2 docs/sec and with patch it's
          888.7 = 3.5% slowdown.

          Show
          mikemccand Michael McCandless added a comment - New rev of the patch. I think it's ready to commit. I'll wait a few days. I made some performance improvements by factoring out a new UnicodeUtil class that does not allocate new objects for every conversion to/from UTF8. One new issue I fixed is the handling of invalid UTF-16 strings. Specifically if the UTF16 text has invalid surrogate pairs, UTF-8 is unable to represent it (unlike the current modified UTF-8 Lucene format). I changed DocumentsWriter & UnicodeUtil to substitute the replacement char U+FFFD for such invalid surrogate characters. This affects terms, stored String fields and term vectors. Indexing performance has a small slowdown (3.5%); details are below. Unfortunately, time to enumerate terms was more affected. I made a simple test that enumerates all terms from the index (= ~3.3 million terms) created below: public class TestTermEnum { public static void main(String[] args) throws Exception { IndexReader r = IndexReader.open(args[0]); TermEnum terms = r.terms(); int count = 0; long t0 = System.currentTimeMillis(); while(terms.next()) count++; long t1 = System.currentTimeMillis(); System.out.println(count + " terms in " + (t1-t0) + " millis"); r.close(); } } On trunk with current index format this takes 3104 msec (best of 5). With the patch with UTF8 index format it takes 3443 msec = 10.9% slower. I don't see any further ways to make this faster. Details on the indexing performance test: analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker docs.file=/Volumes/External/lucene/wiki.txt doc.stored = true doc.term.vector = true doc.add.log.step=2000 directory=FSDirectory autocommit=false compound=false ram.flush.mb=64 { "Rounds" ResetSystemErase { "BuildIndex" CreateIndex { "AddDocs" AddDoc > : 200000 - CloseIndex } NewRound } : 5 RepSumByPrefRound BuildIndex I ran it on a quad-core Intel Mac Pro, with 4 drive RAID 0 array, running OS 10.4.11, java 1.5, run with these command-line args: -server -Xbatch -Xms1024m -Xmx1024m Best of 5 with current trunk is 921.2 docs/sec and with patch it's 888.7 = 3.5% slowdown.
          Hide
          kawai Hiroaki Kawai added a comment -

          I'm wondering why the patch doesn't utilize java.nio.charset.CharsetEncoder, CharsetDecoder....?

          Show
          kawai Hiroaki Kawai added a comment - I'm wondering why the patch doesn't utilize java.nio.charset.CharsetEncoder, CharsetDecoder....?
          Hide
          mikemccand Michael McCandless added a comment -

          I'm wondering why the patch doesn't utilize java.nio.charset.CharsetEncoder, CharsetDecoder....?

          I think there are two reasons for rolling our own instead of using
          CharsetEncoder/Decoder. First is performance. If I use
          CharsetEncoder, like this:

          CharsetEncoder encoder = Charset.forName("UTF-8").newEncoder();
          CharBuffer cb = CharBuffer.allocate(5000);
          ByteBuffer bb = ByteBuffer.allocate(5000);
          byte[] bbArray = bb.array();
          UnicodeUtil.UTF8Result utf8Result = new UnicodeUtil.UTF8Result();

          t0 = System.currentTimeMillis();
          for(int i=0;i<count;i++)

          { cb.clear(); cb.put(strings[i]); cb.flip(); bb.clear(); encoder.reset(); encoder.encode(cb, bb, true); }

          Then it takes 676 msec to convert ~3.3 million strings from the terms
          from indexing first 200K Wikipedia docs. If I replace for loop with:

          UnicodeUtil.UTF16toUTF8(strings[i], 0, strings[i].length(), utf8Result);

          It's 441 msec.

          Second reason is some API mismatch. EG we need to convert char[] that
          end in the 0xffff character. Also, we need to do incremental
          conversion (only convert changed bytes), which is used by TermEnum.
          CharsetEncoder/Decoder doesn't do this.

          Show
          mikemccand Michael McCandless added a comment - I'm wondering why the patch doesn't utilize java.nio.charset.CharsetEncoder, CharsetDecoder....? I think there are two reasons for rolling our own instead of using CharsetEncoder/Decoder. First is performance. If I use CharsetEncoder, like this: CharsetEncoder encoder = Charset.forName("UTF-8").newEncoder(); CharBuffer cb = CharBuffer.allocate(5000); ByteBuffer bb = ByteBuffer.allocate(5000); byte[] bbArray = bb.array(); UnicodeUtil.UTF8Result utf8Result = new UnicodeUtil.UTF8Result(); t0 = System.currentTimeMillis(); for(int i=0;i<count;i++) { cb.clear(); cb.put(strings[i]); cb.flip(); bb.clear(); encoder.reset(); encoder.encode(cb, bb, true); } Then it takes 676 msec to convert ~3.3 million strings from the terms from indexing first 200K Wikipedia docs. If I replace for loop with: UnicodeUtil.UTF16toUTF8(strings [i] , 0, strings [i] .length(), utf8Result); It's 441 msec. Second reason is some API mismatch. EG we need to convert char[] that end in the 0xffff character. Also, we need to do incremental conversion (only convert changed bytes), which is used by TermEnum. CharsetEncoder/Decoder doesn't do this.
          Hide
          kawai Hiroaki Kawai added a comment -

          To the first reason of performance issue,

          CharsetEncoder encoder = Charset.forName("UTF-8").newEncoder();
          CharBuffer cb = CharBuffer.allocate(5000);
          ByteBuffer bb = ByteBuffer.allocate(5000);
          byte[] bbArray = bb.array();
          UnicodeUtil.UTF8Result utf8Result = new UnicodeUtil.UTF8Result();

          t0 = System.currentTimeMillis();
          for(int i=0;i<count;i++)

          Unknown macro: { cb.clear(); cb.put(strings[i]); cb.flip(); bb.clear(); encoder.reset(); encoder.encode(cb, bb, true); }

          How about this code?

          CharsetEncoder encoder = Charset.forName("UTF-8").newEncoder();
          CharBuffer cb;
          ByteBuffer bb = ByteBuffer.allocate(5000);
          byte[] bbArray = bb.array();
          UnicodeUtil.UTF8Result utf8Result = new UnicodeUtil.UTF8Result();

          t0 = System.currentTimeMillis();
          for(int i=0;i<count;i++) {
          bb.clear();
          encoder.reset();
          encoder.encode(CharBuffer.wrap(strings[i]), bb, true);
          }

          Show
          kawai Hiroaki Kawai added a comment - To the first reason of performance issue, CharsetEncoder encoder = Charset.forName("UTF-8").newEncoder(); CharBuffer cb = CharBuffer.allocate(5000); ByteBuffer bb = ByteBuffer.allocate(5000); byte[] bbArray = bb.array(); UnicodeUtil.UTF8Result utf8Result = new UnicodeUtil.UTF8Result(); t0 = System.currentTimeMillis(); for(int i=0;i<count;i++) Unknown macro: { cb.clear(); cb.put(strings[i]); cb.flip(); bb.clear(); encoder.reset(); encoder.encode(cb, bb, true); } How about this code? CharsetEncoder encoder = Charset.forName("UTF-8").newEncoder(); CharBuffer cb; ByteBuffer bb = ByteBuffer.allocate(5000); byte[] bbArray = bb.array(); UnicodeUtil.UTF8Result utf8Result = new UnicodeUtil.UTF8Result(); t0 = System.currentTimeMillis(); for(int i=0;i<count;i++) { bb.clear(); encoder.reset(); encoder.encode(CharBuffer.wrap(strings [i] ), bb, true); }
          Hide
          mikemccand Michael McCandless added a comment -

          That version takes 1067 msec (best of 3).

          Show
          mikemccand Michael McCandless added a comment - That version takes 1067 msec (best of 3).
          Hide
          kawai Hiroaki Kawai added a comment -

          Ah interesting !

          I'm not tried yet, but...

          CharBuffer cb = ByteBuffer.allocateDirect(5000).asCharBuffer();
          ByteBuffer bb = ByteBuffer.allocateDirect(5000);

          will improve the performance.

          Show
          kawai Hiroaki Kawai added a comment - Ah interesting ! I'm not tried yet, but... CharBuffer cb = ByteBuffer.allocateDirect(5000).asCharBuffer(); ByteBuffer bb = ByteBuffer.allocateDirect(5000); will improve the performance.
          Hide
          kawai Hiroaki Kawai added a comment -

          Second reason is some API mismatch. EG we need to convert char[] that
          end in the 0xffff character. Also, we need to do incremental
          conversion (only convert changed bytes), which is used by TermEnum.
          CharsetEncoder/Decoder doesn't do this.

          IMHO, the char 0xffff is not telling that the string is the end. The marker is not to be used to dermine the termination, but is for assertion, I think (in DocumentsWriterThreadState.java). And we should rewrite DocumentsWriterFieldMergeState.java to provide "private int postingUpto" outside via public method, to use it in DocumentsWriter.java and DocumentsWriterField.java.

          Show
          kawai Hiroaki Kawai added a comment - Second reason is some API mismatch. EG we need to convert char[] that end in the 0xffff character. Also, we need to do incremental conversion (only convert changed bytes), which is used by TermEnum. CharsetEncoder/Decoder doesn't do this. IMHO, the char 0xffff is not telling that the string is the end. The marker is not to be used to dermine the termination, but is for assertion, I think (in DocumentsWriterThreadState.java). And we should rewrite DocumentsWriterFieldMergeState.java to provide "private int postingUpto" outside via public method, to use it in DocumentsWriter.java and DocumentsWriterField.java.
          Hide
          mikemccand Michael McCandless added a comment -

          Using allocateDirect is surprisingly slower: 1050 msec.

          Show
          mikemccand Michael McCandless added a comment - Using allocateDirect is surprisingly slower: 1050 msec.
          Hide
          mikemccand Michael McCandless added a comment -

          IMHO, the char 0xffff is not telling that the string is the end. The marker is not to be used to dermine the termination, but is for assertion, I think (in DocumentsWriterThreadState.java).

          The 0xffff is used for more than assertion. It marks the end of the text for each term.

          And we should rewrite DocumentsWriterFieldMergeState.java to provide "private int postingUpto" outside via public method, to use it in DocumentsWriter.java and DocumentsWriterField.java.

          The "private int postingUpto" in DocumentsWriterFieldMergeState is very different from the usage of 0xffff marker; I'm not sure what you're suggesting here?

          Show
          mikemccand Michael McCandless added a comment - IMHO, the char 0xffff is not telling that the string is the end. The marker is not to be used to dermine the termination, but is for assertion, I think (in DocumentsWriterThreadState.java). The 0xffff is used for more than assertion. It marks the end of the text for each term. And we should rewrite DocumentsWriterFieldMergeState.java to provide "private int postingUpto" outside via public method, to use it in DocumentsWriter.java and DocumentsWriterField.java. The "private int postingUpto" in DocumentsWriterFieldMergeState is very different from the usage of 0xffff marker; I'm not sure what you're suggesting here?
          Hide
          kawai Hiroaki Kawai added a comment -

          I'm sorry I mistook something in DocumentsWriterFieldMergeState.java.

          But, my suggestion here, is use 0xffff only for assertion.

          To achive this, we should add textLength property to DocumentsWriterFieldMergeState and use it in DocumentsWriter.java, modify DocumentsWriter#compareText.

          I'll submit a patch.

          Show
          kawai Hiroaki Kawai added a comment - I'm sorry I mistook something in DocumentsWriterFieldMergeState.java. But, my suggestion here, is use 0xffff only for assertion. To achive this, we should add textLength property to DocumentsWriterFieldMergeState and use it in DocumentsWriter.java, modify DocumentsWriter#compareText. I'll submit a patch.
          Hide
          mikemccand Michael McCandless added a comment -

          OK. Can you open a separate issue for that?

          Show
          mikemccand Michael McCandless added a comment - OK. Can you open a separate issue for that?
          Hide
          kawai Hiroaki Kawai added a comment -

          OK. I'll open another ticket.

          I'll try measuring the performance, too. I still believe that we should use java.nio.charset for our code maintainance.

          Show
          kawai Hiroaki Kawai added a comment - OK. I'll open another ticket. I'll try measuring the performance, too. I still believe that we should use java.nio.charset for our code maintainance.
          Hide
          mikemccand Michael McCandless added a comment -

          I think using java.nio.charset is too much of a performance hit. I think it will especially further slow down enumerating terms since we can't convert the characters incrementally if we use java.nio.charset.

          Show
          mikemccand Michael McCandless added a comment - I think using java.nio.charset is too much of a performance hit. I think it will especially further slow down enumerating terms since we can't convert the characters incrementally if we use java.nio.charset.
          Hide
          kawai Hiroaki Kawai added a comment -

          I think performance issue is yet another issue, and I believe we can find the way to speed up.

          Show
          kawai Hiroaki Kawai added a comment - I think performance issue is yet another issue, and I believe we can find the way to speed up.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          I suspect that this issue is not reading old indexes perfectly. I am very suspicious that this patch caused what I thought was index corruption the other day. I didn't notice it was a format change and should have paid more attention.

          I seem to be able to replicate the issue so hopefully I will have more to report soon.

          If my thoughts are unlikely, please let me know.

          Show
          markrmiller@gmail.com Mark Miller added a comment - I suspect that this issue is not reading old indexes perfectly. I am very suspicious that this patch caused what I thought was index corruption the other day. I didn't notice it was a format change and should have paid more attention. I seem to be able to replicate the issue so hopefully I will have more to report soon. If my thoughts are unlikely, please let me know.
          Hide
          mikemccand Michael McCandless added a comment -

          Whoa, I think you are correct Mark!

          On inspecting my changes here, I think the bulk-merging of stored fields is to blame. Specifically, when we bulk merge the stored fields we fail to check whether the segments being merged are the pre-UTF8 format. And so that code bulk-copies stored fields in the older format into a file that claims it's using the newer format.

          This only affects trunk, not 2.3.

          Thanks for being such a brave early-adopter trunk tester, Mark. And, sorry

          Show
          mikemccand Michael McCandless added a comment - Whoa, I think you are correct Mark! On inspecting my changes here, I think the bulk-merging of stored fields is to blame. Specifically, when we bulk merge the stored fields we fail to check whether the segments being merged are the pre-UTF8 format. And so that code bulk-copies stored fields in the older format into a file that claims it's using the newer format. This only affects trunk, not 2.3. Thanks for being such a brave early-adopter trunk tester, Mark. And, sorry
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          No problem Michael. Brave/Foolish, I certainly am one. Can't keep myself away from these fresh issues My fault for not understanding the release process really though. I thought I was roughly getting 2.3.2 ahead of time. Live and learn I got a little lax inspecting all the issues for someone grabbing from trunk...once all my tests pass I usually feel pretty good about it. Time to add some upgrade index tests

          Thanks so much for checking into this so quick - certainly saves me some time today.

          Show
          markrmiller@gmail.com Mark Miller added a comment - No problem Michael. Brave/Foolish, I certainly am one. Can't keep myself away from these fresh issues My fault for not understanding the release process really though. I thought I was roughly getting 2.3.2 ahead of time. Live and learn I got a little lax inspecting all the issues for someone grabbing from trunk...once all my tests pass I usually feel pretty good about it. Time to add some upgrade index tests Thanks so much for checking into this so quick - certainly saves me some time today.
          Hide
          mikemccand Michael McCandless added a comment -

          Time to add some upgrade index tests

          In fact Lucene already has TestBackwardsCompatibility ... so my first step is to add a test case there that shows this bug...

          Show
          mikemccand Michael McCandless added a comment - Time to add some upgrade index tests In fact Lucene already has TestBackwardsCompatibility ... so my first step is to add a test case there that shows this bug...
          Hide
          mikemccand Michael McCandless added a comment -

          OK, I've fixed TestBackwardsCompatibility to catch this, and then fixed merging of stored fields to work properly. Will commit shortly...

          Thanks again Mark!

          Show
          mikemccand Michael McCandless added a comment - OK, I've fixed TestBackwardsCompatibility to catch this, and then fixed merging of stored fields to work properly. Will commit shortly... Thanks again Mark!

            People

            • Assignee:
              mikemccand Michael McCandless
              Reporter:
              cutting Doug Cutting
            • Votes:
              5 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development