Details

    • Type: New Feature New Feature
    • Status: Reopened
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 2.4.1
    • Fix Version/s: None
    • Component/s: core/store
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      In lucene-1793, there is the off-topic suggestion to provide compression of Unicode data. The motivation was a custom encoding in a Russian analyzer. The original supposition was that it provided a more compact index.

      This led to the comment that a different or compressed encoding would be a generally useful feature.

      BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM with an implementation in ICU. If Lucene provide it's own implementation a freely avIlable, royalty-free license would need to be obtained.

      SCSU is another Unicode compression algorithm that could be used.

      An advantage of these methods is that they work on the whole of Unicode. If that is not needed an encoding such as iso8859-1 (or whatever covers the input) could be used.

      1. Benchmark.java
        1 kB
        Yonik Seeley
      2. Benchmark.java
        4 kB
        Yonik Seeley
      3. Benchmark.java
        1 kB
        Robert Muir
      4. LUCENE-1779.patch
        21 kB
        Michael McCandless
      5. LUCENE-1799_big.patch
        355 kB
        Robert Muir
      6. LUCENE-1799.patch
        13 kB
        Robert Muir
      7. LUCENE-1799.patch
        9 kB
        Robert Muir
      8. LUCENE-1799.patch
        9 kB
        Robert Muir
      9. LUCENE-1799.patch
        9 kB
        Robert Muir
      10. LUCENE-1799.patch
        9 kB
        Michael McCandless
      11. LUCENE-1799.patch
        7 kB
        Michael McCandless
      12. LUCENE-1799.patch
        7 kB
        Michael McCandless
      13. LUCENE-1799.patch
        9 kB
        Robert Muir
      14. LUCENE-1799.patch
        17 kB
        Uwe Schindler
      15. LUCENE-1799.patch
        11 kB
        Uwe Schindler
      16. LUCENE-1799.patch
        10 kB
        Uwe Schindler
      17. LUCENE-1799.patch
        9 kB
        Uwe Schindler
      18. LUCENE-1799.patch
        9 kB
        Uwe Schindler
      19. LUCENE-1799.patch
        9 kB
        Robert Muir

        Activity

        Hide
        Earwin Burrfoot added a comment -

        I think right now this can be implemented as a delegating Directory.

        Show
        Earwin Burrfoot added a comment - I think right now this can be implemented as a delegating Directory.
        Hide
        Robert Muir added a comment -

        Earwin, if implemented as a directory, we lose many of the advantages.

        For example, if you are using BOCU-1, lets say with Hindi language, then according to the stats here: http://unicode.org/notes/tn6/#Performance

        • you can encode/decode BOCU-1 to/from UTF-16 more than twice as fast as you can UTF-8 to/from UTF-16 (for this language)
        • also, resulting bytes are less than half the size of UTF-8 (for this language), yet sort order is still preserved, so it should work for term dictionary, etc.

        Note: I have never measured bocu performance in practice.

        I took a look at the flex indexing branch and this appears like this might be possible in the future thru a codec...

        Show
        Robert Muir added a comment - Earwin, if implemented as a directory, we lose many of the advantages. For example, if you are using BOCU-1, lets say with Hindi language, then according to the stats here: http://unicode.org/notes/tn6/#Performance you can encode/decode BOCU-1 to/from UTF-16 more than twice as fast as you can UTF-8 to/from UTF-16 (for this language) also, resulting bytes are less than half the size of UTF-8 (for this language), yet sort order is still preserved, so it should work for term dictionary, etc. Note: I have never measured bocu performance in practice. I took a look at the flex indexing branch and this appears like this might be possible in the future thru a codec...
        Hide
        Earwin Burrfoot added a comment -

        > Earwin, if implemented as a directory, we lose many of the advantages.
        Damn. I believed all strings pass through read/writeString() on IndexInput/Output. Naive. Well, one can patch UnicodeUtil, but the solution is no longer elegant.
        Waiting for flexible indexing, hoping it's gonna be flexible..

        Show
        Earwin Burrfoot added a comment - > Earwin, if implemented as a directory, we lose many of the advantages. Damn. I believed all strings pass through read/writeString() on IndexInput/Output. Naive. Well, one can patch UnicodeUtil, but the solution is no longer elegant. Waiting for flexible indexing, hoping it's gonna be flexible..
        Hide
        Robert Muir added a comment -

        Waiting for flexible indexing, hoping it's gonna be flexible..

        it looked to me, at a glance that some things would still be wierd. like TermVectors aren't "flexible" yet, so wouldn't be BOCU-1.
        I do not know if in flexible indexing, it will be possible for a codec to change behavior like this...
        maybe someone knows if this is planned eventually or not?

        Show
        Robert Muir added a comment - Waiting for flexible indexing, hoping it's gonna be flexible.. it looked to me, at a glance that some things would still be wierd. like TermVectors aren't "flexible" yet, so wouldn't be BOCU-1. I do not know if in flexible indexing, it will be possible for a codec to change behavior like this... maybe someone knows if this is planned eventually or not?
        Hide
        Michael McCandless added a comment -

        The flex API will let you completely customize how the terms dict/index is encoded, but not yet term vectors.

        Show
        Michael McCandless added a comment - The flex API will let you completely customize how the terms dict/index is encoded, but not yet term vectors.
        Hide
        Mark Miller added a comment -

        pretty simple though, isnt it? Just pull the vector reader/writer from the codec as well?

        Show
        Mark Miller added a comment - pretty simple though, isnt it? Just pull the vector reader/writer from the codec as well?
        Hide
        Robert Muir added a comment -

        The flex API will let you completely customize how the terms dict/index is encoded, but not yet term vectors.

        Thanks Mike! as far as the encoding itself, BOCU-1 is available in the ICU library, so we do not need to implement it and deal with the conformance/patent stuff
        (To get the royalty-free patent you must be "fully compliant", they have already done this).

        If this feature is desired, I think something like a Codec in contrib that encodes the index with BOCU-1 from ICU would be the best.

        Show
        Robert Muir added a comment - The flex API will let you completely customize how the terms dict/index is encoded, but not yet term vectors. Thanks Mike! as far as the encoding itself, BOCU-1 is available in the ICU library, so we do not need to implement it and deal with the conformance/patent stuff (To get the royalty-free patent you must be "fully compliant", they have already done this). If this feature is desired, I think something like a Codec in contrib that encodes the index with BOCU-1 from ICU would be the best.
        Hide
        Robert Muir added a comment -

        by the way, here are even more details on BOCU, including more in-depth size and performance, at least compared to the UTN:
        http://icu-project.org/repos/icu/icuhtml/trunk/design/conversion/bocu1/bocu1.html

        Show
        Robert Muir added a comment - by the way, here are even more details on BOCU, including more in-depth size and performance, at least compared to the UTN: http://icu-project.org/repos/icu/icuhtml/trunk/design/conversion/bocu1/bocu1.html
        Hide
        Earwin Burrfoot added a comment -

        as far as the encoding itself, BOCU-1 is available in the ICU library

        ICU's API requires to use ByteBuffer and CharBuffer for input/output. And even if I missed some nice method, encoder/decoder operates internally on said buffers. Thus, a wrap/unwrap for each String is inevitable.

        Show
        Earwin Burrfoot added a comment - as far as the encoding itself, BOCU-1 is available in the ICU library ICU's API requires to use ByteBuffer and CharBuffer for input/output. And even if I missed some nice method, encoder/decoder operates internally on said buffers. Thus, a wrap/unwrap for each String is inevitable.
        Hide
        Robert Muir added a comment -

        ICU's API requires to use ByteBuffer and CharBuffer for input/output. And even if I missed some nice method, encoder/decoder operates internally on said buffers. Thus, a wrap/unwrap for each String is inevitable.

        Earwin, at least in ICU trunk you have the following (public class) in com.ibm.icu.impl.BOCU:

        public static int compress(String source, byte buffer[], int offset)
        public static int getCompressionLength(String source) 
        ...
        

        But I think this class only supports encoding, not decoding (only used by Collation API for so called BOCSU).
        For decoding, we might have to use registered charset and ByteBuffer... unless theres another way.

        Show
        Robert Muir added a comment - ICU's API requires to use ByteBuffer and CharBuffer for input/output. And even if I missed some nice method, encoder/decoder operates internally on said buffers. Thus, a wrap/unwrap for each String is inevitable. Earwin, at least in ICU trunk you have the following (public class) in com.ibm.icu.impl.BOCU: public static int compress( String source, byte buffer[], int offset) public static int getCompressionLength( String source) ... But I think this class only supports encoding, not decoding (only used by Collation API for so called BOCSU). For decoding, we might have to use registered charset and ByteBuffer... unless theres another way.
        Hide
        Robert Muir added a comment -

        Earwin, i do not really like this implementation either.

        So it would be of course better to have something more suitable similar to UnicodeUtil, plus you could ditch the lib dependency.
        but then i guess we have to deal with this patent thing... i do not really know what is involved with that.

        Show
        Robert Muir added a comment - Earwin, i do not really like this implementation either. So it would be of course better to have something more suitable similar to UnicodeUtil, plus you could ditch the lib dependency. but then i guess we have to deal with this patent thing... i do not really know what is involved with that.
        Hide
        Earwin Burrfoot added a comment -

        but then i guess we have to deal with this patent thing... i do not really know what is involved with that.

        CPAN holds BOCU-1 implementation, derived from "Sample C code", with all necessary copyrights and patent mentioned, but there's no word of them formally obtaining a license. I'm not sure if this is okay, or just overlooked.

        Show
        Earwin Burrfoot added a comment - but then i guess we have to deal with this patent thing... i do not really know what is involved with that. CPAN holds BOCU-1 implementation, derived from "Sample C code", with all necessary copyrights and patent mentioned, but there's no word of them formally obtaining a license. I'm not sure if this is okay, or just overlooked.
        Hide
        DM Smith added a comment -

        The sample code is probably what is on this page, here:
        http://unicode.org/notes/tn6/#Sample_Code

        From what I gather reading the whole page:
        If we port the sample code and the test case and then provide demonstration that all test pass, then we will be granted a license.

        There's contact info at the bottom of the page for getting the license. Maybe, contact them for clarification?

        As the code is fairly small, I don't think it would be too hard to port. The trick is that the sample code appears to deal in 32-bit arrays and we'd probably want a byte[].

        Show
        DM Smith added a comment - The sample code is probably what is on this page, here: http://unicode.org/notes/tn6/#Sample_Code From what I gather reading the whole page: If we port the sample code and the test case and then provide demonstration that all test pass, then we will be granted a license. There's contact info at the bottom of the page for getting the license. Maybe, contact them for clarification? As the code is fairly small, I don't think it would be too hard to port. The trick is that the sample code appears to deal in 32-bit arrays and we'd probably want a byte[].
        Hide
        Robert Muir added a comment -

        attached is a simple prototype for encoding terms as BOCU-1

        So while I don't think things like wildcard, etc will work due to the nature of BOCU-1, term and phrase queries should work fine, and it maintains UTF-8 order so sorting is fine, and range queries should work once we fix TermRangeQuery to use byte.

        the impl is probably a bit slow (uses charset api) as its just for playing around.

        note: I didnt check the box because of the patent thing, (not sure it even applies since i use the icu impl here), but either way i dont want to involve myself with that.

        Show
        Robert Muir added a comment - attached is a simple prototype for encoding terms as BOCU-1 So while I don't think things like wildcard, etc will work due to the nature of BOCU-1, term and phrase queries should work fine, and it maintains UTF-8 order so sorting is fine, and range queries should work once we fix TermRangeQuery to use byte. the impl is probably a bit slow (uses charset api) as its just for playing around. note: I didnt check the box because of the patent thing, (not sure it even applies since i use the icu impl here), but either way i dont want to involve myself with that.
        Hide
        Uwe Schindler added a comment - - edited

        For correctness of code: target.offset = buffer.arrayOffset() + buffer.position();
        But for most cases position() will be 0, but this is quite often an error. If you use limit() you have to use position(), else its inconsistent. arrayOffset() gives the offset corresponding to position=0. And length should be remaining()(for example see payload contrib IdentityEncoder)

        And the default factory could be a singleton...

        Show
        Uwe Schindler added a comment - - edited For correctness of code: target.offset = buffer.arrayOffset() + buffer.position(); But for most cases position() will be 0, but this is quite often an error. If you use limit() you have to use position(), else its inconsistent. arrayOffset() gives the offset corresponding to position=0. And length should be remaining()(for example see payload contrib IdentityEncoder) And the default factory could be a singleton...
        Hide
        Robert Muir added a comment -

        Uwe, sure, if we were to implement this I wouldnt use NIO anyway though, like i said i dont plan on committing anything (unless somethign is figured out about the patent), but it might be useful to someone.

        I tested this on some hindi text:

        encoding tii tis
        utf-8 60,205 3,740,329
        bocu-1 28,431 2,168,407
        Show
        Robert Muir added a comment - Uwe, sure, if we were to implement this I wouldnt use NIO anyway though, like i said i dont plan on committing anything (unless somethign is figured out about the patent), but it might be useful to someone. I tested this on some hindi text: encoding tii tis utf-8 60,205 3,740,329 bocu-1 28,431 2,168,407
        Hide
        Uwe Schindler added a comment -

        And one thing more, in the non-array case:
        buffer.get(target.bytes, target.offset, limit); target's offset should be set to 0 on all write operations to bytesref (see UnicodeUtil.UTF16toUTF8WithHash()). Else the grow() before does not resize correct!

        Show
        Uwe Schindler added a comment - And one thing more, in the non-array case: buffer.get(target.bytes, target.offset, limit); target's offset should be set to 0 on all write operations to bytesref (see UnicodeUtil.UTF16toUTF8WithHash()). Else the grow() before does not resize correct!
        Hide
        Uwe Schindler added a comment -

        Here the policed one

        In my opinion something is better than nothing. The patents are not violated here, as we only use an abstract API and the string "BOCU-1". You can use the same code to encode in "ISO-8859-1".

        Show
        Uwe Schindler added a comment - Here the policed one In my opinion something is better than nothing. The patents are not violated here, as we only use an abstract API and the string "BOCU-1". You can use the same code to encode in "ISO-8859-1".
        Hide
        Uwe Schindler added a comment -

        One more violation. Now its correct!

        Show
        Uwe Schindler added a comment - One more violation. Now its correct!
        Hide
        Uwe Schindler added a comment -

        Here a heavy reusing variant.

        Show
        Uwe Schindler added a comment - Here a heavy reusing variant.
        Hide
        Uwe Schindler added a comment -

        The last one that could be used with any charset

        Show
        Uwe Schindler added a comment - The last one that could be used with any charset
        Hide
        Uwe Schindler added a comment -

        Here is a 100% legally valid implementation:

        • Linking to icu4j-charsets is done dynamically by reflection. If you don't have ICU4J charsets in your classpath, the attribute throws explaining exception
        • We dont need to ship the rather large JAR file with Lucene just for this class
        • We dont have legal patent problems as we neither ship the API nor use it directly
        • The backside is that the Test simple prints a warning but passes, so the class is not tested until you install icu4j-charsets.jar. We can put the JAR file on hudson, so it can be used during nightly builds. Or we download it dynamically on build.

        I added further improvements to the encoder ittself:

        • less variables
        • correct error handling for encoding errors
        • remove floating point from main loop
        Show
        Uwe Schindler added a comment - Here is a 100% legally valid implementation: Linking to icu4j-charsets is done dynamically by reflection. If you don't have ICU4J charsets in your classpath, the attribute throws explaining exception We dont need to ship the rather large JAR file with Lucene just for this class We dont have legal patent problems as we neither ship the API nor use it directly The backside is that the Test simple prints a warning but passes, so the class is not tested until you install icu4j-charsets.jar. We can put the JAR file on hudson, so it can be used during nightly builds. Or we download it dynamically on build. I added further improvements to the encoder ittself: less variables correct error handling for encoding errors remove floating point from main loop
        Hide
        Uwe Schindler added a comment -

        A new patch that completely separates the BOCU factory from the implementation (which moves to common/miscellaneous). This has the following advantages:

        • You can use any Charset to encode your terms. The javadocs should only note, that the byte[] order should be correct for range queries to work
        • Theoretically you could remove the BOCU classes at all, one that wants to use, can simply get the Charset from ICUs factory and pass it to the AttributeFactory. The convenience class is still useful, especially if we can later natively implement the encoding without NIO (when patent issues are solved...)
        • The test for the CustomCharsetTermAttributeFactory uses UTF-8 as charset and verifies that the created BytesRefs have the same format like a BytesRef created using the UnicodeUtils.
        • The test also checks that encoding errors are bubbled up as RuntimeExceptions

        TODO:

        • docs
        • handling of encoding errors configureable (replace with replacement char?)
        • If you want your complete index e.g. in ISO-8859-1, there should be also convenience methods that take CharSequences/char[] in the factory/attribute to quickly convert strings to BytesRefs like UnicodeUtil does - by this its possible to create TermQueries directly using e.g. ISO-8859-1 encoding.
        Show
        Uwe Schindler added a comment - A new patch that completely separates the BOCU factory from the implementation (which moves to common/miscellaneous). This has the following advantages: You can use any Charset to encode your terms. The javadocs should only note, that the byte[] order should be correct for range queries to work Theoretically you could remove the BOCU classes at all, one that wants to use, can simply get the Charset from ICUs factory and pass it to the AttributeFactory. The convenience class is still useful, especially if we can later natively implement the encoding without NIO (when patent issues are solved...) The test for the CustomCharsetTermAttributeFactory uses UTF-8 as charset and verifies that the created BytesRefs have the same format like a BytesRef created using the UnicodeUtils. The test also checks that encoding errors are bubbled up as RuntimeExceptions TODO: docs handling of encoding errors configureable (replace with replacement char?) If you want your complete index e.g. in ISO-8859-1, there should be also convenience methods that take CharSequences/char[] in the factory/attribute to quickly convert strings to BytesRefs like UnicodeUtil does - by this its possible to create TermQueries directly using e.g. ISO-8859-1 encoding.
        Hide
        Michael McCandless added a comment -

        This is fabulous! And a great example of what's now possible w/ the cutover to opaque binary terms w/ flex – makes it easy to swap out how terms are encoded.

        BOCU-1 is a much more compact encoding than UTF-8 for non-Latin languages.

        This encoding would also naturally reduce the RAM required for the terms index and Terms/TermsIndex FieldCache (used when you sort by string field) as well, since Lucene just loads the [opaque] term bytes into RAM.

        Show
        Michael McCandless added a comment - This is fabulous! And a great example of what's now possible w/ the cutover to opaque binary terms w/ flex – makes it easy to swap out how terms are encoded. BOCU-1 is a much more compact encoding than UTF-8 for non-Latin languages. This encoding would also naturally reduce the RAM required for the terms index and Terms/TermsIndex FieldCache (used when you sort by string field) as well, since Lucene just loads the [opaque] term bytes into RAM.
        Hide
        Robert Muir added a comment -

        You can use any Charset to encode your terms. The javadocs should only note, that the byte[] order should be correct for range queries to work

        I don't think we should add support for any non-unicode character sets.

        If you want your complete index e.g. in ISO-8859-1

        I am 100% against doing this.

        Show
        Robert Muir added a comment - You can use any Charset to encode your terms. The javadocs should only note, that the byte[] order should be correct for range queries to work I don't think we should add support for any non-unicode character sets. If you want your complete index e.g. in ISO-8859-1 I am 100% against doing this.
        Hide
        Michael McCandless added a comment -

        Is there any reason not to make BOCU-1 Lucene's default encoding?

        UTF8 penalizes non-english languages, and BOCU-1 does not, and it sounds like we expect little to no indexing or searching perf penalty (once we have a faster interface to BOCU1, eg our own private impl, like UnicodeUtil).

        Show
        Michael McCandless added a comment - Is there any reason not to make BOCU-1 Lucene's default encoding? UTF8 penalizes non-english languages, and BOCU-1 does not, and it sounds like we expect little to no indexing or searching perf penalty (once we have a faster interface to BOCU1, eg our own private impl, like UnicodeUtil).
        Hide
        Robert Muir added a comment -

        Is there any reason not to make BOCU-1 Lucene's default encoding?

        in my opinion, just IBM But maybe we can make a strong implementation and they will approve it and give us a patent:

        http://unicode.org/notes/tn6/#Intellectual_Property

        UTF8 penalizes non-english languages, and BOCU-1 does not, and it sounds like we expect little to no indexing or searching perf penalty (once we have a faster interface to BOCU1, eg our own private impl, like UnicodeUtil).

        I'd like to play with swapping it in as the default, just to see what problems (if any) there are, and to make sure all queries are supported, etc. I can upload a new patch that does it this way and we can play.

        Show
        Robert Muir added a comment - Is there any reason not to make BOCU-1 Lucene's default encoding? in my opinion, just IBM But maybe we can make a strong implementation and they will approve it and give us a patent: http://unicode.org/notes/tn6/#Intellectual_Property UTF8 penalizes non-english languages, and BOCU-1 does not, and it sounds like we expect little to no indexing or searching perf penalty (once we have a faster interface to BOCU1, eg our own private impl, like UnicodeUtil). I'd like to play with swapping it in as the default, just to see what problems (if any) there are, and to make sure all queries are supported, etc. I can upload a new patch that does it this way and we can play.
        Hide
        Michael McCandless added a comment -

        > Is there any reason not to make BOCU-1 Lucene's default encoding?

        in my opinion, just IBM

        But... ICU's license is compatible w/ ASL (I think), and includes a working impl of BOCU-1, so aren't we in the clear here? Ie we are free to take that impl, tweak it, add to our sources, and include ICU's license in our LICENSE/NOTICE?

        Show
        Michael McCandless added a comment - > Is there any reason not to make BOCU-1 Lucene's default encoding? in my opinion, just IBM But... ICU's license is compatible w/ ASL (I think), and includes a working impl of BOCU-1, so aren't we in the clear here? Ie we are free to take that impl, tweak it, add to our sources, and include ICU's license in our LICENSE/NOTICE?
        Hide
        Robert Muir added a comment -

        But... ICU's license is compatible w/ ASL (I think), and includes a working impl of BOCU-1, so aren't we in the clear here? Ie we are free to take that impl, tweak it, add to our sources, and include ICU's license in our LICENSE/NOTICE?

        I dont know... personally i wouldnt feel comfortable committing something without getting guidance first. but we can explore the technicals with patches on this jira issue and not check the box and i think this is all ok for now.

        Show
        Robert Muir added a comment - But... ICU's license is compatible w/ ASL (I think), and includes a working impl of BOCU-1, so aren't we in the clear here? Ie we are free to take that impl, tweak it, add to our sources, and include ICU's license in our LICENSE/NOTICE? I dont know... personally i wouldnt feel comfortable committing something without getting guidance first. but we can explore the technicals with patches on this jira issue and not check the box and i think this is all ok for now.
        Hide
        Robert Muir added a comment -

        attached is a really really rough patch that sets bocu-1 as the default encoding.

        Beware: its a work in progress and a lot of the patch is auto-generated (eclipse) so some things need to be reverted.

        Most tests pass, the idea is to find bugs in tests etc that abuse bytesref/assume utf-8 encoding, things like that.

        Show
        Robert Muir added a comment - attached is a really really rough patch that sets bocu-1 as the default encoding. Beware: its a work in progress and a lot of the patch is auto-generated (eclipse) so some things need to be reverted. Most tests pass, the idea is to find bugs in tests etc that abuse bytesref/assume utf-8 encoding, things like that.
        Hide
        Robert Muir added a comment -

        btw that patch is huge because i just sucked in the icu charset stuff to have an implementation that works for testing...

        its not intended to ever be that way as we would just implement the stuff we need without this code, but it makes it easier to test since you dont need any external jars or muck with the build system at all.

        Show
        Robert Muir added a comment - btw that patch is huge because i just sucked in the icu charset stuff to have an implementation that works for testing... its not intended to ever be that way as we would just implement the stuff we need without this code, but it makes it easier to test since you dont need any external jars or muck with the build system at all.
        Hide
        Robert Muir added a comment -

        attached is a patch for the start of a "BOCUUtil' with unicodeutil like methods.

        For now i only implemented encode (and encodeWithHash):

        I generated random strings with _TestUtil.randomRealisticUnicodeString and benchmarked, and the numbers are stable.

        encoding time to encode 20 million strings (ms) number of encoded bytes
        UTF-8 1,757 596,516,000
        BOCU-1 1,968 250,202,000

        So I think we get good compression, and good performance close to UTF-8 for encode.
        I'll work on decode now.

        Show
        Robert Muir added a comment - attached is a patch for the start of a "BOCUUtil' with unicodeutil like methods. For now i only implemented encode (and encodeWithHash): I generated random strings with _TestUtil.randomRealisticUnicodeString and benchmarked, and the numbers are stable. encoding time to encode 20 million strings (ms) number of encoded bytes UTF-8 1,757 596,516,000 BOCU-1 1,968 250,202,000 So I think we get good compression, and good performance close to UTF-8 for encode. I'll work on decode now.
        Hide
        Michael McCandless added a comment -

        Slightly more optimized version of BOCU1 encode (but it's missing the hash variant).

        Show
        Michael McCandless added a comment - Slightly more optimized version of BOCU1 encode (but it's missing the hash variant).
        Hide
        Michael McCandless added a comment -

        Duh – that was some ancient wrong patch. This one should be right!

        Show
        Michael McCandless added a comment - Duh – that was some ancient wrong patch. This one should be right!
        Hide
        Michael McCandless added a comment -

        Just inlines the 2-byte diff case.

        Show
        Michael McCandless added a comment - Just inlines the 2-byte diff case.
        Hide
        Michael McCandless added a comment -

        Inlines/unwinds the 3-byte cases. I think we can leave the 4 byte case as a for loop...

        Show
        Michael McCandless added a comment - Inlines/unwinds the 3-byte cases. I think we can leave the 4 byte case as a for loop...
        Hide
        Robert Muir added a comment -

        I ran tests, each one of Mike's optimizations speed up the encode...

        I think I agree with not unrolling the 4-byte, the "diff" from the previous character has to be > 187659 [0x2dd0b]
        this is like pottery writings and oracle bone script... but the previous ones (2x, 3x) speed up CJK and other scripts and are very useful.

        Show
        Robert Muir added a comment - I ran tests, each one of Mike's optimizations speed up the encode... I think I agree with not unrolling the 4-byte, the "diff" from the previous character has to be > 187659 [0x2dd0b] this is like pottery writings and oracle bone script... but the previous ones (2x, 3x) speed up CJK and other scripts and are very useful.
        Hide
        Robert Muir added a comment -

        removed some ifs for the positive unrolled cases.

        Show
        Robert Muir added a comment - removed some ifs for the positive unrolled cases.
        Hide
        Robert Muir added a comment -

        i optimized the surrogate case here, moving it into the 'prev' calculation.
        now we are faster than utf-8 on average for encode.

        encoding time to encode 20 million strings (ms) number of encoded bytes
        UTF-8 1,756 596,516,000
        BOCU-1 1,724 250,202,000
        Show
        Robert Muir added a comment - i optimized the surrogate case here, moving it into the 'prev' calculation. now we are faster than utf-8 on average for encode. encoding time to encode 20 million strings (ms) number of encoded bytes UTF-8 1,756 596,516,000 BOCU-1 1,724 250,202,000
        Hide
        Robert Muir added a comment -

        oops, forgot a check in the surrogate case.

        Show
        Robert Muir added a comment - oops, forgot a check in the surrogate case.
        Hide
        Robert Muir added a comment -

        here it is with first stab at decoder (its correct against random icu strings, but i didnt benchmark yet)

        Show
        Robert Muir added a comment - here it is with first stab at decoder (its correct against random icu strings, but i didnt benchmark yet)
        Hide
        Yonik Seeley added a comment -

        I took a stab at benchmarking encoding speed only with some different languages.
        I encoded a word at a time (which happens at indexing time).
        I used some text from wikipedia in different languages: english, german, french, spanish, and chinese.
        I used WhitespaceAnalyzer for the first 4 and StandardAnalyzer for the chinese (but analysis speed is not measured.)

        encoding english german french spanish chinese
        UTF-8 size 1888 4550 4875 5123 4497
        BOCU-1 size 1888 4610 4995 5249 4497
        BOCU slowdown 29% 39% 47% 61% 80%

        I suspect that the StandardAnalyzer is spitting out individual CJK chars, and hence the same size of BOCU-1 and UTF-8?
        I'll try and see if I can get SmartChineseAnalyzer working and re-run the chinese test.

        Show
        Yonik Seeley added a comment - I took a stab at benchmarking encoding speed only with some different languages. I encoded a word at a time (which happens at indexing time). I used some text from wikipedia in different languages: english, german, french, spanish, and chinese. I used WhitespaceAnalyzer for the first 4 and StandardAnalyzer for the chinese (but analysis speed is not measured.) encoding english german french spanish chinese UTF-8 size 1888 4550 4875 5123 4497 BOCU-1 size 1888 4610 4995 5249 4497 BOCU slowdown 29% 39% 47% 61% 80% I suspect that the StandardAnalyzer is spitting out individual CJK chars, and hence the same size of BOCU-1 and UTF-8? I'll try and see if I can get SmartChineseAnalyzer working and re-run the chinese test.
        Hide
        Robert Muir added a comment -

        yonik, what were you benchmarking? I think you should benchmark overall indexing time, of which encode is just a blip (<1% of).

        and yes, since the start state is 0x40 the FIRST cjk char is a diff from 0x40, but any subsequent ones yield savings.

        in general you wont get much compression for chinese.. id say max 25%
        for russian, arabic, hebrew, japanese it will do a lot better: max 40%
        for indian languages you tend to get about 50%.

        I also dont know how you encoded word at a time, because i get quite different results. I focused a lot on 'single-byte diffs' to be fast (e.g. just subtraction) and I think i do a lot better for english than the 160% described in http://unicode.org/notes/tn6/

        Furthermore, utf-8 is a complete no-op for english, so being a compression algorithm that is only 29% slower than (byte) char is good in my book, but i dont measure 29% for english.

        I don't think there is any problem in encode speed at all.

        Show
        Robert Muir added a comment - yonik, what were you benchmarking? I think you should benchmark overall indexing time, of which encode is just a blip (<1% of). and yes, since the start state is 0x40 the FIRST cjk char is a diff from 0x40, but any subsequent ones yield savings. in general you wont get much compression for chinese.. id say max 25% for russian, arabic, hebrew, japanese it will do a lot better: max 40% for indian languages you tend to get about 50%. I also dont know how you encoded word at a time, because i get quite different results. I focused a lot on 'single-byte diffs' to be fast (e.g. just subtraction) and I think i do a lot better for english than the 160% described in http://unicode.org/notes/tn6/ Furthermore, utf-8 is a complete no-op for english, so being a compression algorithm that is only 29% slower than (byte) char is good in my book, but i dont measure 29% for english. I don't think there is any problem in encode speed at all.
        Hide
        Michael Busch added a comment -

        Yonik can you give more details about how you ran your tests?

        Was it an isolated string encoding test or does BOCU slow down overall indexing speed by 29%-80% (which would be hard to believe).

        Show
        Michael Busch added a comment - Yonik can you give more details about how you ran your tests? Was it an isolated string encoding test or does BOCU slow down overall indexing speed by 29%-80% (which would be hard to believe).
        Hide
        Yonik Seeley added a comment -

        Yonik can you give more details about how you ran your tests?

        I'm isolating encoding speed only (not analysis, not indexing, etc) of tokens in different languages.
        So I took some text from wikipedia, analyze it to get a list of char[], then encode each char[] in a loop. It's only the last step that is benchmarked to isolate the encode performance. I'm certainly not claiming that indexing is n% slower.

        Show
        Yonik Seeley added a comment - Yonik can you give more details about how you ran your tests? I'm isolating encoding speed only (not analysis, not indexing, etc) of tokens in different languages. So I took some text from wikipedia, analyze it to get a list of char[], then encode each char[] in a loop. It's only the last step that is benchmarked to isolate the encode performance. I'm certainly not claiming that indexing is n% slower.
        Hide
        Robert Muir added a comment -

        attached is my benchmark for english text.

        UTF-8: 15530ms
        BOCU-1: 15687ms

        Note, i use a Sun JVM 1.6.0_19 (64bit)

        Yonik if you run this benchmark and find a problem with it / or its slower on your machine, let me know your configuration, because i dont see the results you do.

        Show
        Robert Muir added a comment - attached is my benchmark for english text. UTF-8: 15530ms BOCU-1: 15687ms Note, i use a Sun JVM 1.6.0_19 (64bit) Yonik if you run this benchmark and find a problem with it / or its slower on your machine, let me know your configuration, because i dont see the results you do.
        Hide
        Yonik Seeley added a comment -

        in general you wont get much compression for chinese.. id say max 25%

        Ah, OK.
        I just tried russian w/ whitespace analyzer used to split and did get a good size savings:

        UTF8_size=11056 BOCU-1_size=6810 BOCU-1_slowdown=32%

        Show
        Yonik Seeley added a comment - in general you wont get much compression for chinese.. id say max 25% Ah, OK. I just tried russian w/ whitespace analyzer used to split and did get a good size savings: UTF8_size=11056 BOCU-1_size=6810 BOCU-1_slowdown=32%
        Hide
        Robert Muir added a comment -

        Yonik, please see my issue.

        the fact we can encode 100 million terms in 15 seconds, means any speed stuff is meaningless (though i still insist, something is wrong: either your benchmark, or it runs slower on your JDK or something (which we should try to improve)

        Show
        Robert Muir added a comment - Yonik, please see my issue. the fact we can encode 100 million terms in 15 seconds, means any speed stuff is meaningless (though i still insist, something is wrong: either your benchmark, or it runs slower on your JDK or something (which we should try to improve)
        Hide
        Michael McCandless added a comment -

        The char[] -> byte[] encode time is a miniscule part of indexing time. And, in turn, indexing time is far less important than impact on search performance. So... let's focus on the search performance here.

        Most queries are unaffected by the term encoding; it's only AutomatonQuery (= fuzzy, regexp, wildcard) that do a fair amount of decoding...

        Net/net BOCU1 sounds like an awesome win over UTF8.

        Show
        Michael McCandless added a comment - The char[] -> byte[] encode time is a miniscule part of indexing time. And, in turn, indexing time is far less important than impact on search performance. So... let's focus on the search performance here. Most queries are unaffected by the term encoding; it's only AutomatonQuery (= fuzzy, regexp, wildcard) that do a fair amount of decoding... Net/net BOCU1 sounds like an awesome win over UTF8.
        Hide
        Robert Muir added a comment -

        I just insist there is no real difference between this and UTF-8 for encoding english...

        Show
        Robert Muir added a comment - I just insist there is no real difference between this and UTF-8 for encoding english...
        Hide
        Yonik Seeley added a comment -

        OK, I just tried Robert's Benchmark.java (i.e. fake english word encoding):
        UTF8=15731 BOCU-1=16961 (lowest of 5 diff runs)

        But looking at the benchmark, it looks like the majority of the time could be just making random strings.
        I made a modified Benchmark.java that pulls out this string creation and only tests encoding performance.
        Here are my results:

        UTF8=2936 BOCU-1=4310
        It turns out that making the random strings to encode took up 81% of the UTF8 time.

        System: Win7 64 bit, JVM=Sun 1.6.0_21 64 bit -server

        Show
        Yonik Seeley added a comment - OK, I just tried Robert's Benchmark.java (i.e. fake english word encoding): UTF8=15731 BOCU-1=16961 (lowest of 5 diff runs) But looking at the benchmark, it looks like the majority of the time could be just making random strings. I made a modified Benchmark.java that pulls out this string creation and only tests encoding performance. Here are my results: UTF8=2936 BOCU-1=4310 It turns out that making the random strings to encode took up 81% of the UTF8 time. System: Win7 64 bit, JVM=Sun 1.6.0_21 64 bit -server
        Hide
        Robert Muir added a comment -

        Thats good news, so we can encode 100 million strings in 4.3 seconds?
        I dont think we need to discuss performance any further, this is a complete non-issue.

        Show
        Robert Muir added a comment - Thats good news, so we can encode 100 million strings in 4.3 seconds? I dont think we need to discuss performance any further, this is a complete non-issue.
        Hide
        Yonik Seeley added a comment -

        OK, hopefully the right Benchmark.java this time

        Show
        Yonik Seeley added a comment - OK, hopefully the right Benchmark.java this time
        Hide
        Yonik Seeley added a comment -

        Thats good news, so we can encode 100 million strings in 4.3 seconds? I dont think we need to discuss performance any further, this is a complete non-issue.

        Well... hopefully it's not an issue.
        That should really be tested with real indexing when the time comes (micro-benchmarks can do funny things).

        Show
        Yonik Seeley added a comment - Thats good news, so we can encode 100 million strings in 4.3 seconds? I dont think we need to discuss performance any further, this is a complete non-issue. Well... hopefully it's not an issue. That should really be tested with real indexing when the time comes (micro-benchmarks can do funny things).
        Hide
        Robert Muir added a comment -

        Well... hopefully it's not an issue.
        That should really be tested with real indexing when the time comes (micro-benchmarks can do funny things).

        its definitely not an issue no lucene indexer can do anything with 100 million strings in any reasonable time where this will matter.

        instead most non-latin languages will be writing less bytes, causing less real i/o, using half the RAM at search time, etc which is way more dramatic.

        utf-8 is a non-option for our internal memory encoding, i'm suggesting bocu-1, but if you want to try to fight me all the way, then i'll start fighting for a reversion back to char[] instead... its at least less biased.

        Show
        Robert Muir added a comment - Well... hopefully it's not an issue. That should really be tested with real indexing when the time comes (micro-benchmarks can do funny things). its definitely not an issue no lucene indexer can do anything with 100 million strings in any reasonable time where this will matter. instead most non-latin languages will be writing less bytes, causing less real i/o, using half the RAM at search time, etc which is way more dramatic. utf-8 is a non-option for our internal memory encoding, i'm suggesting bocu-1, but if you want to try to fight me all the way, then i'll start fighting for a reversion back to char[] instead... its at least less biased.
        Hide
        Yonik Seeley added a comment -

        Ummm, so you're against actually measuring any indexing performance decrease? That's all I was suggesting.

        Show
        Yonik Seeley added a comment - Ummm, so you're against actually measuring any indexing performance decrease? That's all I was suggesting.
        Hide
        Robert Muir added a comment -

        I dont think its measurable. 100 million strings in 4.3 seconds? this has no affect.

        keep in mind, i fixed the analysis in 3.1 and doubled the speed of the default english indexing in solr,
        so if you want to improve indexing speed, i think you will be more successful looking at other parts of the code.

        Show
        Robert Muir added a comment - I dont think its measurable. 100 million strings in 4.3 seconds? this has no affect. keep in mind, i fixed the analysis in 3.1 and doubled the speed of the default english indexing in solr, so if you want to improve indexing speed, i think you will be more successful looking at other parts of the code.
        Hide
        Yonik Seeley added a comment -

        so if you want to improve indexing speed, i think you will be more successful looking at other parts of the code.

        I have only been measuring performance at this point, and I haven't expressed an option about what defaults should be used.
        If we convert to BOCU-1 as a default, and if UTF-8 remains an option, then I'd at least want to be able to document any trade-offs and when people should consider setting the encoding back to UTF-8.

        Show
        Yonik Seeley added a comment - so if you want to improve indexing speed, i think you will be more successful looking at other parts of the code. I have only been measuring performance at this point, and I haven't expressed an option about what defaults should be used. If we convert to BOCU-1 as a default, and if UTF-8 remains an option, then I'd at least want to be able to document any trade-offs and when people should consider setting the encoding back to UTF-8.
        Hide
        Robert Muir added a comment -

        I have only been measuring performance at this point

        You havent really been measuring performance, you have just been trying to pick a fight.

        1. any difference in encode has almost no effect on indexing speed, like i said, 100 million strings in 4.3 seconds?
        2. you aren't factoring i/o nor ram into the equation for the writing systems (of which there are many) where this actually cuts terms to close half their size.
        3. since this is a compression algorithm (and I'm still working on it), its vital to include these things, and not post useless benchmarks about whether it takes 2.9 or 4.3 seconds to encode 100 million strings, which nothing in lucene can do anything with in any short time anyway.

        I have a benchmark for UTF-8: and its that i have a lot of text that is twice as big on disk and causes twice as much io and eats up twice as much ram than it should.
        bocu-1 fixes that, and at the same time keeps ascii at a single-byte encoding (and other latin languages are very close).
        so everyone can potentially win.

        Show
        Robert Muir added a comment - I have only been measuring performance at this point You havent really been measuring performance, you have just been trying to pick a fight. any difference in encode has almost no effect on indexing speed, like i said, 100 million strings in 4.3 seconds? you aren't factoring i/o nor ram into the equation for the writing systems (of which there are many) where this actually cuts terms to close half their size. since this is a compression algorithm (and I'm still working on it), its vital to include these things, and not post useless benchmarks about whether it takes 2.9 or 4.3 seconds to encode 100 million strings, which nothing in lucene can do anything with in any short time anyway. I have a benchmark for UTF-8: and its that i have a lot of text that is twice as big on disk and causes twice as much io and eats up twice as much ram than it should. bocu-1 fixes that, and at the same time keeps ascii at a single-byte encoding (and other latin languages are very close). so everyone can potentially win.
        Hide
        Yonik Seeley added a comment -

        You havent really been measuring performance, you have just been trying to pick a fight.

        I'm sorry if it appeared that way, and apologize for anything I said to encourage that perception.

        I was genuinely surprised when you reported "now we are faster than utf-8 on average for encode", so I set out to benchmark it myself and report back. In addition, I wanted to see what the encoding speed diff was for some different languages.

        Show
        Yonik Seeley added a comment - You havent really been measuring performance, you have just been trying to pick a fight. I'm sorry if it appeared that way, and apologize for anything I said to encourage that perception. I was genuinely surprised when you reported "now we are faster than utf-8 on average for encode", so I set out to benchmark it myself and report back. In addition, I wanted to see what the encoding speed diff was for some different languages.
        Hide
        Robert Muir added a comment -

        I was genuinely surprised when you reported "now we are faster than utf-8 on average for encode", so I set out to benchmark it myself and report back. In addition, I wanted to see what the encoding speed diff was for some different languages.

        For all of unicode yes, you just didnt pick a good variety of languages, or didnt tokenize them well (e.g. using an english tokenizer for chinese).
        I've been measuring against many, and i already checked the bigram (cjktokenizer) case to make sure that cjk was always smaller (its not much... e.g. 5 bytes instead of 6, but its better)

        Show
        Robert Muir added a comment - I was genuinely surprised when you reported "now we are faster than utf-8 on average for encode", so I set out to benchmark it myself and report back. In addition, I wanted to see what the encoding speed diff was for some different languages. For all of unicode yes, you just didnt pick a good variety of languages, or didnt tokenize them well (e.g. using an english tokenizer for chinese). I've been measuring against many, and i already checked the bigram (cjktokenizer) case to make sure that cjk was always smaller (its not much... e.g. 5 bytes instead of 6, but its better)
        Hide
        Robert Muir added a comment -

        by the way, to explain your results on french and german:

        since the compression is a diff from the 'middle of the alphabet' (unicode block), an unaccented char, accented char, unaccented char combination will cause 2 2-byte diffs.
        in utf-8 encoding this sequence is 4 bytes, but in bocu it becomes 5.

        The reason you experienced anything of measure is, I think because of whitespaceanalyzer (which i feel is a tad unrealistic)
        for example, all the german stemmers do something with the umlauts (remove or substitute ue, oe, etc).

        In general, lots of our analysis for lots of languages folds and normalizes characters in ways like this, that also serves to help the compression
        so I think if you used germananalyzer on the german text instead of whitespaceanalyzer, you wouldn't see much of size increase.

        Show
        Robert Muir added a comment - by the way, to explain your results on french and german: since the compression is a diff from the 'middle of the alphabet' (unicode block), an unaccented char, accented char, unaccented char combination will cause 2 2-byte diffs. in utf-8 encoding this sequence is 4 bytes, but in bocu it becomes 5. The reason you experienced anything of measure is, I think because of whitespaceanalyzer (which i feel is a tad unrealistic) for example, all the german stemmers do something with the umlauts (remove or substitute ue, oe, etc). In general, lots of our analysis for lots of languages folds and normalizes characters in ways like this, that also serves to help the compression so I think if you used germananalyzer on the german text instead of whitespaceanalyzer, you wouldn't see much of size increase.
        Hide
        Robert Muir added a comment -

        But looking at the benchmark, it looks like the majority of the time could be just making random strings.
        I made a modified Benchmark.java that pulls out this string creation and only tests encoding performance.
        Here are my results:

        UTF8=2936 BOCU-1=4310

        I think your benchmark isnt very reliable (i got really different results), so i added an extra 0 to do 10x more terms:
        char[][] terms = new char[10000][];

        ret=716132704 UTF-8 encode: 35081
        ret=716132704BOCU-1 encode: 36517

        Like i said before, i don't see a 20% difference.

        Show
        Robert Muir added a comment - But looking at the benchmark, it looks like the majority of the time could be just making random strings. I made a modified Benchmark.java that pulls out this string creation and only tests encoding performance. Here are my results: UTF8=2936 BOCU-1=4310 I think your benchmark isnt very reliable (i got really different results), so i added an extra 0 to do 10x more terms: char[][] terms = new char [10000] []; ret=716132704 UTF-8 encode: 35081 ret=716132704BOCU-1 encode: 36517 Like i said before, i don't see a 20% difference.
        Hide
        Yonik Seeley added a comment -

        I think your benchmark isnt very reliable (i got really different results), so i added an extra 0 to do 10x more terms:

        Did that change the ratio for you? I just tried 10x more terms, and I got the exact same ratio:

        ret=708532704 UTF-8 encode: 30524
        ret=708532704BOCU-1 encode: 44635

        Show
        Yonik Seeley added a comment - I think your benchmark isnt very reliable (i got really different results), so i added an extra 0 to do 10x more terms: Did that change the ratio for you? I just tried 10x more terms, and I got the exact same ratio: ret=708532704 UTF-8 encode: 30524 ret=708532704BOCU-1 encode: 44635
        Hide
        Robert Muir added a comment -

        yeah it did (it didnt seem 'stable' but the first run was much different than yours, e.g. 3300 vs 3500 or so).

        I just ran with -server also [using my same 64-bit 1.6.0_19 as before]:
        there is more of a difference, however not as much as yours
        ret=704032704 UTF-8 encode: 32134
        ret=704032704BOCU-1 encode: 36391

        but go figure, if i run with my 32-bit [same jdk: 1.6.0_19], i get horrible numbers!
        here is -client
        ret=684832704 UTF-8 encode: 26237
        ret=684832704BOCU-1 encode: 54662

        here is -server
        ret=697132704 UTF-8 encode: 30062
        ret=697132704BOCU-1 encode: 46293

        so there is definitely an issue with 32-bit jvm, sure yours is 64-bit?

        Show
        Robert Muir added a comment - yeah it did (it didnt seem 'stable' but the first run was much different than yours, e.g. 3300 vs 3500 or so). I just ran with -server also [using my same 64-bit 1.6.0_19 as before] : there is more of a difference, however not as much as yours ret=704032704 UTF-8 encode: 32134 ret=704032704BOCU-1 encode: 36391 but go figure, if i run with my 32-bit [same jdk: 1.6.0_19] , i get horrible numbers! here is -client ret=684832704 UTF-8 encode: 26237 ret=684832704BOCU-1 encode: 54662 here is -server ret=697132704 UTF-8 encode: 30062 ret=697132704BOCU-1 encode: 46293 so there is definitely an issue with 32-bit jvm, sure yours is 64-bit?
        Hide
        Yonik Seeley added a comment -

        Hmmm, interesting. I'm sure my JVM is 64 bit, and I just double-checked that the IDE is using that to launch the benchmark. The differences we see might be down to CPU?

        Here's my 64 bit JVM I'm using:
        java version "1.6.0_21"
        Java(TM) SE Runtime Environment (build 1.6.0_21-b06)
        Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode)

        And I just tried a 32 bit one (also -server) I had laying around:
        java version "1.6.0_18"
        Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
        Java HotSpot(TM) Server VM (build 16.0-b13, mixed mode)

        32 bit results:
        ret=713832704 UTF-8 encode: 35895
        ret=713832704BOCU-1 encode: 55855

        Show
        Yonik Seeley added a comment - Hmmm, interesting. I'm sure my JVM is 64 bit, and I just double-checked that the IDE is using that to launch the benchmark. The differences we see might be down to CPU? Here's my 64 bit JVM I'm using: java version "1.6.0_21" Java(TM) SE Runtime Environment (build 1.6.0_21-b06) Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode) And I just tried a 32 bit one (also -server) I had laying around: java version "1.6.0_18" Java(TM) SE Runtime Environment (build 1.6.0_18-b07) Java HotSpot(TM) Server VM (build 16.0-b13, mixed mode) 32 bit results: ret=713832704 UTF-8 encode: 35895 ret=713832704BOCU-1 encode: 55855
        Hide
        Earwin Burrfoot added a comment -

        Returning to this issue, right now the best place for this functionality seems to be a variant of CharTermAttribute?

        Show
        Earwin Burrfoot added a comment - Returning to this issue, right now the best place for this functionality seems to be a variant of CharTermAttribute?
        Hide
        Earwin Burrfoot added a comment -

        .. and not the Codec, as was suggested in the beginning.

        Show
        Earwin Burrfoot added a comment - .. and not the Codec, as was suggested in the beginning.
        Hide
        DM Smith added a comment -

        Any idea as to when this will be released?

        Show
        DM Smith added a comment - Any idea as to when this will be released?
        Hide
        DM Smith added a comment -

        Would someone be able to champion this. It appears ready to go. for the last 1.5 years. Looks like it is merely a permission problem. I'd like to see it get in the 3.x series.

        Show
        DM Smith added a comment - Would someone be able to champion this. It appears ready to go. for the last 1.5 years. Looks like it is merely a permission problem. I'd like to see it get in the 3.x series.
        Hide
        Erick Erickson added a comment -

        2013 Old JIRA cleanup

        Show
        Erick Erickson added a comment - 2013 Old JIRA cleanup

          People

          • Assignee:
            Unassigned
            Reporter:
            DM Smith
          • Votes:
            2 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:

              Development