Lucene - Core
  1. Lucene - Core
  2. LUCENE-1793

remove custom encoding support in Greek/Russian Analyzers

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      The Greek and Russian analyzers support custom encodings such as KOI-8, they define things like Lowercase and tokenization for these.

      I think that analyzers should support unicode and that conversion/handling of other charsets belongs somewhere else.

      I would like to deprecate/remove the support for these other encodings.

      1. LUCENE-1793.patch
        12 kB
        Robert Muir
      2. LUCENE-1793.patch
        14 kB
        Robert Muir
      3. LUCENE-1793.patch
        14 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        I am guessing the rationale for the current code is to try to reduce index size? (since these languages are double-byte encoded in Unicode).

        If this is the concern, then I think a better solution would be to integrate some form of unicode compression (i.e. BOCU-1) into lucene, rather than try to deal with legacy character sets in this way.

        Show
        Robert Muir added a comment - I am guessing the rationale for the current code is to try to reduce index size? (since these languages are double-byte encoded in Unicode). If this is the concern, then I think a better solution would be to integrate some form of unicode compression (i.e. BOCU-1) into lucene, rather than try to deal with legacy character sets in this way.
        Hide
        Uwe Schindler added a comment -

        I would also strongly suggest to remove these custom charsets. They are not unicode conform, because they use char codepoint mappings that simply define an US ASCII char for some of the input chars. The problems begin with mixed language texts.
        This strange (and wrong) mapping can also be seen in the tests: Tests load a KOI-8 file with encoding ISO-8859-1 (to get the native bytes as chars) and then map it. This is very bad!
        The analyzers should really only work on unicode codepoints and nothing more. For backwards compatibility with old indexes (that are encoded using this strange mapping), we have to preserve the charsets for a while, but deprecate all of them and only leave UTF-16 as input (java chars).

        You are right, to reduce index size, it would be good, to also support other encodings in addition to UTF-8 for storage of term text.

        Show
        Uwe Schindler added a comment - I would also strongly suggest to remove these custom charsets. They are not unicode conform, because they use char codepoint mappings that simply define an US ASCII char for some of the input chars. The problems begin with mixed language texts. This strange (and wrong) mapping can also be seen in the tests: Tests load a KOI-8 file with encoding ISO-8859-1 (to get the native bytes as chars) and then map it. This is very bad! The analyzers should really only work on unicode codepoints and nothing more. For backwards compatibility with old indexes (that are encoded using this strange mapping), we have to preserve the charsets for a while, but deprecate all of them and only leave UTF-16 as input (java chars). You are right, to reduce index size, it would be good, to also support other encodings in addition to UTF-8 for storage of term text.
        Hide
        Robert Muir added a comment -

        patch that deprecates the custom charsets.

        after a release, a lot of code can be completely removed and these will be much simpler.

        I can add more verbage to these if needed, but I am suspicious that this stuff is actually working correctly...

        Show
        Robert Muir added a comment - patch that deprecates the custom charsets. after a release, a lot of code can be completely removed and these will be much simpler. I can add more verbage to these if needed, but I am suspicious that this stuff is actually working correctly...
        Hide
        DM Smith added a comment -

        If this is the concern, then I think a better solution would be to integrate some form of unicode compression (i.e. BOCU-1) into lucene, rather than try to deal with legacy character sets in this way.

        So it doesn't get lost, would it be good to open an issue for this? And for alternate encodings?

        Show
        DM Smith added a comment - If this is the concern, then I think a better solution would be to integrate some form of unicode compression (i.e. BOCU-1) into lucene, rather than try to deal with legacy character sets in this way. So it doesn't get lost, would it be good to open an issue for this? And for alternate encodings?
        Hide
        Robert Muir added a comment -

        good idea, curious what other encodings you had in mind?. I only thought of BOCU because it maintains binary sort order, so it makes sense for an index...

        and it looks like there could be a possible performance benefit to something like these as well over UTF-8 (at least for certain languages): http://unicode.org/notes/tn6/#Performance

        Show
        Robert Muir added a comment - good idea, curious what other encodings you had in mind?. I only thought of BOCU because it maintains binary sort order, so it makes sense for an index... and it looks like there could be a possible performance benefit to something like these as well over UTF-8 (at least for certain languages): http://unicode.org/notes/tn6/#Performance
        Hide
        DM Smith added a comment -

        I wasn't thinking about any encoding in particular. It was in reference to Uwe's comment:

        You are right, to reduce index size, it would be good, to also support other encodings in addition to UTF-8 for storage of term text.

        I think that having other encodings can be problematic. Problems that Unicode solves. But if the idea is worthy of discussion then a new issue would be a better place to house it.

        Regarding BOCU it is a patented algorithm requiring a license from IBM for implementation. I gather that it is part of ICU. Not sure if either is a big deal or not.

        Show
        DM Smith added a comment - I wasn't thinking about any encoding in particular. It was in reference to Uwe's comment: You are right, to reduce index size, it would be good, to also support other encodings in addition to UTF-8 for storage of term text. I think that having other encodings can be problematic. Problems that Unicode solves. But if the idea is worthy of discussion then a new issue would be a better place to house it. Regarding BOCU it is a patented algorithm requiring a license from IBM for implementation. I gather that it is part of ICU. Not sure if either is a big deal or not.
        Hide
        Robert Muir added a comment -

        DM, you are right this is a better discussion for another issue/place
        I was concerned that we would be taking functionality away, but this is not the case, as Uwe says it is only "strange".

        I just looked at all these encodings: they are all storing characters in the extended ascii range (> 0x7F)
        Therefore, anyone using this strange encoding support is using 2 bytes per character already!
        For example someone using CP1251 in the russian analyzer is simply storing Ж as 0xC6, its being represented as Æ. (2 bytes in UTF-8)
        So, by deprecating these encodings for unicode, nobody's index size will double...

        Show
        Robert Muir added a comment - DM, you are right this is a better discussion for another issue/place I was concerned that we would be taking functionality away, but this is not the case, as Uwe says it is only "strange". I just looked at all these encodings: they are all storing characters in the extended ascii range (> 0x7F) Therefore, anyone using this strange encoding support is using 2 bytes per character already! For example someone using CP1251 in the russian analyzer is simply storing Ж as 0xC6, its being represented as Æ. (2 bytes in UTF-8) So, by deprecating these encodings for unicode, nobody's index size will double...
        Hide
        Earwin Burrfoot added a comment -

        I am guessing the rationale for the current code is to try to reduce index size? (since these languages are double-byte encoded in Unicode).

        Rationale was most probably to support existing non-unicode systems/databases/files, whatever. My say is - anyone still holding onto koi8, cp1251 and friends should silently do harakiri.

        Show
        Earwin Burrfoot added a comment - I am guessing the rationale for the current code is to try to reduce index size? (since these languages are double-byte encoded in Unicode). Rationale was most probably to support existing non-unicode systems/databases/files, whatever. My say is - anyone still holding onto koi8, cp1251 and friends should silently do harakiri.
        Hide
        Robert Muir added a comment -

        it seems no one is against this, I will clean this up / add friendly deprecation warnings to the patch.

        Show
        Robert Muir added a comment - it seems no one is against this, I will clean this up / add friendly deprecation warnings to the patch.
        Hide
        Robert Muir added a comment -

        patch with more javadocs verbage.

        When do we want to deprecate these strange encodings? might be too late for 2.9 but I think sooner than later would be best.

        Show
        Robert Muir added a comment - patch with more javadocs verbage. When do we want to deprecate these strange encodings? might be too late for 2.9 but I think sooner than later would be best.
        Hide
        Mark Miller added a comment -

        We have not started code freeze yet - I'd deprecate if it makes sense.

        Show
        Mark Miller added a comment - We have not started code freeze yet - I'd deprecate if it makes sense.
        Hide
        Robert Muir added a comment -

        updated patch with "removed in next release" changed to "removed in Lucene 3.0".

        Changes reads:

        Deprecate the custom encoding support in the Greek and Russian
        Analyzers. If you need to index text in these encodings, please use Java's
        character set conversion facilities (InputStreamReader, etc) during I/O,
        so that Lucene can analyze this text as Unicode instead.

        Show
        Robert Muir added a comment - updated patch with "removed in next release" changed to "removed in Lucene 3.0". Changes reads: Deprecate the custom encoding support in the Greek and Russian Analyzers. If you need to index text in these encodings, please use Java's character set conversion facilities (InputStreamReader, etc) during I/O, so that Lucene can analyze this text as Unicode instead.
        Hide
        Robert Muir added a comment -

        Setting to 2.9
        I would like to commit in a day or two if there are no objections.

        Show
        Robert Muir added a comment - Setting to 2.9 I would like to commit in a day or two if there are no objections.
        Hide
        Robert Muir added a comment -

        Committed revision 806886.

        Show
        Robert Muir added a comment - Committed revision 806886.

          People

          • Assignee:
            Robert Muir
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development