Lucene - Core
  1. Lucene - Core
  2. LUCENE-2906

Filter to process output of ICUTokenizer and create overlapping bigrams for CJK

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The ICUTokenizer produces unigrams for CJK. We would like to use the ICUTokenizer but have overlapping bigrams created for CJK as in the CJK Analyzer. This filter would take the output of the ICUtokenizer, read the ScriptAttribute and for selected scripts (Han, Kana), would produce overlapping bigrams.

      1. LUCENE-2906.patch
        27 kB
        Robert Muir
      2. LUCENE-2906.patch
        49 kB
        Robert Muir
      3. LUCENE-2906.patch
        54 kB
        Robert Muir
      4. LUCENE-2906.patch
        55 kB
        Robert Muir
      There are no Sub-Tasks for this issue.

        Activity

        Hide
        Robert Muir added a comment -

        I created LUCENE-3669 for the broader interned-type issue

        Show
        Robert Muir added a comment - I created LUCENE-3669 for the broader interned-type issue
        Hide
        Robert Muir added a comment -

        Hi Uwe:

        Many filters in lucene currently do things like this, and have forever (including StandardFilter).
        In my opinion its ok, as its documented this filter works with StandardTokenizer and ICUTokenizer which use
        the interned types.

        So I would prefer if we discuss this on another issue.

        Show
        Robert Muir added a comment - Hi Uwe: Many filters in lucene currently do things like this, and have forever (including StandardFilter). In my opinion its ok, as its documented this filter works with StandardTokenizer and ICUTokenizer which use the interned types. So I would prefer if we discuss this on another issue.
        Hide
        Uwe Schindler added a comment -

        ...alternatively, we could use a HashSet<String>. If the stringsin it are intered, the lookup is fast, too. The hashCode of Strings is precalculated in the String class. For four if checks it maybe not really different performance wise, but thats just another idea. The ctor would simply check the flags and add the type strings to the Set<String>.

        Show
        Uwe Schindler added a comment - ...alternatively, we could use a HashSet<String>. If the stringsin it are intered, the lookup is fast, too. The hashCode of Strings is precalculated in the String class. For four if checks it maybe not really different performance wise, but thats just another idea. The ctor would simply check the flags and add the type strings to the Set<String>.
        Hide
        Uwe Schindler added a comment - - edited

        Hi Robert,

        I had no time to review before, there is one small thing that should maybe fixed. Currently this finter relies on the fact that TypeAttribute strings are interned, as it compares by identity:

        String type = typeAtt.type();
        if (type == doHan || type == doHiragana || type == doKatakana || type == doHangul) {
        

        This is documented nowhere that Strings in TypeAttribute need to be interned. We should maybe replace that check by a simple equals(). It seems that you already wanted to do that, as you added a sentinel value Object NO = new Object(). With the above check this sentinel value is useless, a simple null would be enough. EDIT: Sentinel value is also useful for not enabling bigramming is a Tokenizer sets "null" as TypeAttribute. When using equals() this sentinel makes real sense. The check is not costly. String.equals() already does an identity check for early exit, if the sentinel is used it will also quickly return false (if String.equals(sentinel) is used, it will return false on instanceof Check, if you call sentinel.equals(String) it will even be faster).

        So I would change this check to:

        String type = typeAtt.type();
        if (doHan.equals(type) || doHiragana.equals(type) || doKatakana.equals(type) || doHangul.equals(type)) {
        

        (this is the fastest check, if the doXXX is the sentinel, it's default Object.equals() will return false. If its a string, String.equals() will return true on identity very quick, but if it's not interned it will be slower. So we loose nothing but dont require useless interned strings.

        Show
        Uwe Schindler added a comment - - edited Hi Robert, I had no time to review before, there is one small thing that should maybe fixed. Currently this finter relies on the fact that TypeAttribute strings are interned, as it compares by identity: String type = typeAtt.type(); if (type == doHan || type == doHiragana || type == doKatakana || type == doHangul) { This is documented nowhere that Strings in TypeAttribute need to be interned. We should maybe replace that check by a simple equals(). It seems that you already wanted to do that, as you added a sentinel value Object NO = new Object(). With the above check this sentinel value is useless, a simple null would be enough . EDIT: Sentinel value is also useful for not enabling bigramming is a Tokenizer sets "null" as TypeAttribute. When using equals() this sentinel makes real sense. The check is not costly. String.equals() already does an identity check for early exit, if the sentinel is used it will also quickly return false (if String.equals(sentinel) is used, it will return false on instanceof Check, if you call sentinel.equals(String) it will even be faster). So I would change this check to: String type = typeAtt.type(); if (doHan.equals(type) || doHiragana.equals(type) || doKatakana.equals(type) || doHangul.equals(type)) { (this is the fastest check, if the doXXX is the sentinel, it's default Object.equals() will return false. If its a string, String.equals() will return true on identity very quick, but if it's not interned it will be slower. So we loose nothing but dont require useless interned strings.
        Hide
        Robert Muir added a comment -

        Committed, maybe in the future we enable this for StandardFilter (for good CJK behavior by default), but for now it seems good enough to have separate filters that handle the corner cases and all of unicode.

        Show
        Robert Muir added a comment - Committed, maybe in the future we enable this for StandardFilter (for good CJK behavior by default), but for now it seems good enough to have separate filters that handle the corner cases and all of unicode.
        Hide
        Robert Muir added a comment -

        one new test and a tweak: so that this filter never calls input.incrementToken() after it already returned false.

        Show
        Robert Muir added a comment - one new test and a tweak: so that this filter never calls input.incrementToken() after it already returned false.
        Hide
        Robert Muir added a comment -

        patch removing all nocommits with additional tests.

        I think its ready to commit.

        Show
        Robert Muir added a comment - patch removing all nocommits with additional tests. I think its ready to commit.
        Hide
        Robert Muir added a comment -

        synced up to trunk... has a couple minor nocommits (mostly just some needed tests) I'll look at tomorrow morning.

        Show
        Robert Muir added a comment - synced up to trunk... has a couple minor nocommits (mostly just some needed tests) I'll look at tomorrow morning.
        Hide
        Robert Muir added a comment -

        sorry to take so long Tom... ill round this out tonight.

        Show
        Robert Muir added a comment - sorry to take so long Tom... ill round this out tonight.
        Hide
        Robert Muir added a comment -

        As much as I would like this to work as the patch does (where its automatic from StandardFilter), I think its bad because its something we then have to commit to/deal with for a while (e.g. backwards compat).

        So another idea is just to call it CJKFilter or something under the CJK package for now. We could still cutover CJKAnalyzer like the patch and then it finally works with supplementary characters too (which I think is really long needed).

        Show
        Robert Muir added a comment - As much as I would like this to work as the patch does (where its automatic from StandardFilter), I think its bad because its something we then have to commit to/deal with for a while (e.g. backwards compat). So another idea is just to call it CJKFilter or something under the CJK package for now. We could still cutover CJKAnalyzer like the patch and then it finally works with supplementary characters too (which I think is really long needed).
        Hide
        Tom Burton-West added a comment -

        Any chance this might get implemented for 3.4?

        Show
        Tom Burton-West added a comment - Any chance this might get implemented for 3.4?
        Hide
        Robert Muir added a comment -

        bulk move 3.2 -> 3.3

        Show
        Robert Muir added a comment - bulk move 3.2 -> 3.3
        Hide
        Robert Muir added a comment -

        the prerequisite subtask is fixed, so we should be able to add this in 3.2 (supporting StandardTokenizer, UAX29URLEmailTokenizer, and ICUTokenizer) without having to change any of the tokenizers.

        I'll update the patch.

        Show
        Robert Muir added a comment - the prerequisite subtask is fixed, so we should be able to add this in 3.2 (supporting StandardTokenizer, UAX29URLEmailTokenizer, and ICUTokenizer) without having to change any of the tokenizers. I'll update the patch.
        Hide
        Robert Muir added a comment -

        How will this differ from the SmartChineseAnalyzer?

        The SmartChineseAnalyzer is for Simplified Chinese only... this is about the
        language-independent technique similar to what CJKAnalyzer does today.

        I doubt it but can this be in 3.1?

        Well i hate the way CJKAnalyzer treats things like supplementary characters (wrongly).
        This is definitely a bug, and fixed here. Part of me wants to fix this as quickly as possible.

        At the same time though, I would prefer 3.2... otherwise I would feel like I am rushing things.

        I don't think 3.2 needs to come a year after 3.1... in fact since we have a stable branch I think its
        stupid to make bugfix releases like 3.1.1 when we could just push out a new minor version (3.2) with
        bugfixes instead. The whole branch is intended to be stable changes, so I think this is better use
        of our time. But this is just my opinion, we can discuss it later on the list as one idea to promote
        more rapid releases.

        Show
        Robert Muir added a comment - How will this differ from the SmartChineseAnalyzer? The SmartChineseAnalyzer is for Simplified Chinese only... this is about the language-independent technique similar to what CJKAnalyzer does today. I doubt it but can this be in 3.1? Well i hate the way CJKAnalyzer treats things like supplementary characters (wrongly). This is definitely a bug, and fixed here. Part of me wants to fix this as quickly as possible. At the same time though, I would prefer 3.2... otherwise I would feel like I am rushing things. I don't think 3.2 needs to come a year after 3.1... in fact since we have a stable branch I think its stupid to make bugfix releases like 3.1.1 when we could just push out a new minor version (3.2) with bugfixes instead. The whole branch is intended to be stable changes, so I think this is better use of our time. But this is just my opinion, we can discuss it later on the list as one idea to promote more rapid releases.
        Hide
        DM Smith added a comment -

        Two questions:
        How will this differ from the SmartChineseAnalyzer?
        I doubt it but can this be in 3.1?

        Show
        DM Smith added a comment - Two questions: How will this differ from the SmartChineseAnalyzer? I doubt it but can this be in 3.1?
        Hide
        Robert Muir added a comment -

        here's a patch going in a slightly different direction (though we can still add some special icu-only stuff here).

        instead the patch synchronizes the token types of ICUTokenizer with StandardTokenizer, adds the necessarily types to both, and then adds the bigramming logic to standardfilter.

        this way, cjk works easily "out of box", for all of unicode (e.g. supplementaries) and plays well with other languages. i deprecated cjktokenizer in the patch and pulled out its special full-width filter into a separate tokenfilter.

        Show
        Robert Muir added a comment - here's a patch going in a slightly different direction (though we can still add some special icu-only stuff here). instead the patch synchronizes the token types of ICUTokenizer with StandardTokenizer, adds the necessarily types to both, and then adds the bigramming logic to standardfilter. this way, cjk works easily "out of box", for all of unicode (e.g. supplementaries) and plays well with other languages. i deprecated cjktokenizer in the patch and pulled out its special full-width filter into a separate tokenfilter.
        Hide
        Tom Burton-West added a comment -

        Sounds good to me.

        The option to limit to "joined" text also sounds very useful.

        Tom

        Show
        Tom Burton-West added a comment - Sounds good to me. The option to limit to "joined" text also sounds very useful. Tom
        Hide
        Robert Muir added a comment -

        I'll take it, ive done the unibigram approach already (maybe we can just have it as a separate filter option), so the bigram should be easy.

        My original design, just lets you provide a BitSet of script codes. (this would be simple i think to parse from say a solr factory).

        I think its also useful to have an option, for whether the filter should only do this for "joined" text or not (based on offsets). For CJK i think it makes sense to enforce this, so that it won't bigram across sentence boundaries. But for say the Tibetan language, where you have a syllable separator, you would want to turn this off.

        Separately, if you want it to work "just like" CJKTokenizer, please be aware that by default, the unicode standard tokenizes Katakana to words (only hiragana and han are tokenized to codepoints). So in this case you would have to use a custom ruleset if you wanted katakana to be tokenized to codepoints instead of words, for later bigramming. I'm not sure you want to do this though... (in truth CJKTokenizer bigrams ANYTHING out of ascii, including a lot of things it shouldnt).

        For hangul the same warning applies, but its more debatable, you might want to do this if you don't have a decompounder... but in my opinion this is past tokenization, and its the same problem you have with german, etc... the default tokenization is not "wrong".

        In either case, if you decide to do that, it would be a pretty simple ruleset!

        Let me know if this makes sense to you.

        Show
        Robert Muir added a comment - I'll take it, ive done the unibigram approach already (maybe we can just have it as a separate filter option), so the bigram should be easy. My original design, just lets you provide a BitSet of script codes. (this would be simple i think to parse from say a solr factory). I think its also useful to have an option, for whether the filter should only do this for "joined" text or not (based on offsets). For CJK i think it makes sense to enforce this, so that it won't bigram across sentence boundaries. But for say the Tibetan language, where you have a syllable separator, you would want to turn this off. Separately, if you want it to work "just like" CJKTokenizer, please be aware that by default, the unicode standard tokenizes Katakana to words (only hiragana and han are tokenized to codepoints). So in this case you would have to use a custom ruleset if you wanted katakana to be tokenized to codepoints instead of words, for later bigramming. I'm not sure you want to do this though... (in truth CJKTokenizer bigrams ANYTHING out of ascii, including a lot of things it shouldnt). For hangul the same warning applies, but its more debatable, you might want to do this if you don't have a decompounder... but in my opinion this is past tokenization, and its the same problem you have with german, etc... the default tokenization is not "wrong". In either case, if you decide to do that, it would be a pretty simple ruleset! Let me know if this makes sense to you.

          People

          • Assignee:
            Robert Muir
            Reporter:
            Tom Burton-West
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development