Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts. This is because it is unaware of unicode bounds properties.

      I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard.

      in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose).

      I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p

      {Word_Break = Extend}

      ] so this is probably the major barrier.

      Thanks,
      Robert

      1. LUCENE-1488.txt
        118 kB
        Robert Muir
      2. LUCENE-1488.txt
        134 kB
        Robert Muir
      3. LUCENE-1488.patch
        78 kB
        Robert Muir
      4. LUCENE-1488.patch
        147 kB
        Robert Muir
      5. LUCENE-1488.patch
        172 kB
        Robert Muir
      6. LUCENE-1488.patch
        175 kB
        Robert Muir
      7. ICUAnalyzer.patch
        33 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Grant Ingersoll added a comment -

          Very interesting, Robert. I'd like to see your patch. I don't think we need to think of it as a Std. Analyzer replacement, but I could totally see offering it as the the ICUAnalyzer or some other better name. In other words, I'd approach this as another Analyzer in the arsenal of Analyzers, otherwise, we'll have to deal with back-compatibility issues, etc.

          Show
          Grant Ingersoll added a comment - Very interesting, Robert. I'd like to see your patch. I don't think we need to think of it as a Std. Analyzer replacement, but I could totally see offering it as the the ICUAnalyzer or some other better name. In other words, I'd approach this as another Analyzer in the arsenal of Analyzers, otherwise, we'll have to deal with back-compatibility issues, etc.
          Hide
          Robert Muir added a comment -

          thats a good idea. you know, currently trying to get it to pass all the standard analyzer unit tests causes some problems since lucene has some rather obscure definitions of 'number' (i think ip addresses, etc are included) which differ dramatically from the basic unicode definition.

          Other things of note:

          instantiating the analyzer takes a long time (couple seconds) because ICU must "compile" the rules. I'm not sure of the specifics but by compile I think that means building massive FSM or similar based on all the unicode data. Its possible to precompile the rules into binary format but I think this is not currently exposed in ICU.

          the lucene tokenization pipeline makes the implementation a little hairy. I hack around it by tokenizing on whitespace first, then acting as a token filter (just like the Thai analyzer does, which also uses RBBI). I don't think this really is that bad from a linguistic standpoint because the rare cases where 'token' can have whitespace inside of it (persian, etc) need serious muscle somewhere else and should be handled by a language analyzer.

          i'll try to get this thing in reasonable shape at least to document the approach.

          Show
          Robert Muir added a comment - thats a good idea. you know, currently trying to get it to pass all the standard analyzer unit tests causes some problems since lucene has some rather obscure definitions of 'number' (i think ip addresses, etc are included) which differ dramatically from the basic unicode definition. Other things of note: instantiating the analyzer takes a long time (couple seconds) because ICU must "compile" the rules. I'm not sure of the specifics but by compile I think that means building massive FSM or similar based on all the unicode data. Its possible to precompile the rules into binary format but I think this is not currently exposed in ICU. the lucene tokenization pipeline makes the implementation a little hairy. I hack around it by tokenizing on whitespace first, then acting as a token filter (just like the Thai analyzer does, which also uses RBBI). I don't think this really is that bad from a linguistic standpoint because the rare cases where 'token' can have whitespace inside of it (persian, etc) need serious muscle somewhere else and should be handled by a language analyzer. i'll try to get this thing in reasonable shape at least to document the approach.
          Hide
          Robert Muir added a comment -

          i've attached a patch for 'ICUAnalyzer'. I see that some things involving Token have changed but I created it before that point.

          I stole the unit tests from standard analyzer and put comments as to why certain ones arent appropriate and disabled those.

          i added some unit tests that demonstrate some of the value, correct analysis for arabic numerals, hindi text, decomposed latin diacritics, hebrew punctuation, cantonese and linear-b text outside of the BMP, etc.

          one issue is that setMaxTokenLength() doesnt work correctly for values > 255 because CharTokenizer has a hardcoded private limit of 255 that i can't override. This is a problem since i use WhitespaceTokenizer first and then break down those tokens with the RBBI.

          Show
          Robert Muir added a comment - i've attached a patch for 'ICUAnalyzer'. I see that some things involving Token have changed but I created it before that point. I stole the unit tests from standard analyzer and put comments as to why certain ones arent appropriate and disabled those. i added some unit tests that demonstrate some of the value, correct analysis for arabic numerals, hindi text, decomposed latin diacritics, hebrew punctuation, cantonese and linear-b text outside of the BMP, etc. one issue is that setMaxTokenLength() doesnt work correctly for values > 255 because CharTokenizer has a hardcoded private limit of 255 that i can't override. This is a problem since i use WhitespaceTokenizer first and then break down those tokens with the RBBI.
          Hide
          Robert Muir added a comment -

          as soon as I figure out how to invoke the ICU RBBI compiler i'll see if i can update the patch with compiled rules so instantiation of this thing is cheap...

          Show
          Robert Muir added a comment - as soon as I figure out how to invoke the ICU RBBI compiler i'll see if i can update the patch with compiled rules so instantiation of this thing is cheap...
          Hide
          uday kumar maddigatla added a comment -

          hi,

          i too just facing the same problem. my documet contains english as well as danish elements.

          I tried to use this analyzer. when i try to use this i got this error .

          Exception in thread "main" java.lang.ExceptionInInitializerError
          at org.apache.lucene.analysis.icu.ICUAnalyzer.tokenStream(ICUAnalyzer.java:74)
          at org.apache.lucene.analysis.Analyzer.reusableTokenStream(Analyzer.java:48)
          at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:117)
          at org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocFieldConsumersPerField.java:36)
          at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:234)
          at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:765)
          at org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:743)
          at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1918)
          at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1895)
          at com.IndexFiles.indexDocs(IndexFiles.java:87)
          at com.IndexFiles.indexDocs(IndexFiles.java:80)
          at com.IndexFiles.main(IndexFiles.java:57)
          Caused by: java.lang.IllegalArgumentException: Error 66063 at line 2 column 17
          at com.ibm.icu.text.RBBIRuleScanner.error(RBBIRuleScanner.java:505)
          at com.ibm.icu.text.RBBIRuleScanner.scanSet(RBBIRuleScanner.java:1047)
          at com.ibm.icu.text.RBBIRuleScanner.doParseActions(RBBIRuleScanner.java:484)
          at com.ibm.icu.text.RBBIRuleScanner.parse(RBBIRuleScanner.java:912)
          at com.ibm.icu.text.RBBIRuleBuilder.compileRules(RBBIRuleBuilder.java:298)
          at com.ibm.icu.text.RuleBasedBreakIterator.compileRules(RuleBasedBreakIterator.java:316)
          at com.ibm.icu.text.RuleBasedBreakIterator.<init>(RuleBasedBreakIterator.java:71)
          at org.apache.lucene.analysis.icu.ICUBreakIterator.<init>(ICUBreakIterator.java:53)
          at org.apache.lucene.analysis.icu.ICUBreakIterator.<init>(ICUBreakIterator.java:45)
          at org.apache.lucene.analysis.icu.ICUTokenizer.<clinit>(ICUTokenizer.java:58)
          ... 12 more

          please help me in this.

          Show
          uday kumar maddigatla added a comment - hi, i too just facing the same problem. my documet contains english as well as danish elements. I tried to use this analyzer. when i try to use this i got this error . Exception in thread "main" java.lang.ExceptionInInitializerError at org.apache.lucene.analysis.icu.ICUAnalyzer.tokenStream(ICUAnalyzer.java:74) at org.apache.lucene.analysis.Analyzer.reusableTokenStream(Analyzer.java:48) at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:117) at org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocFieldConsumersPerField.java:36) at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:234) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:765) at org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:743) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1918) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1895) at com.IndexFiles.indexDocs(IndexFiles.java:87) at com.IndexFiles.indexDocs(IndexFiles.java:80) at com.IndexFiles.main(IndexFiles.java:57) Caused by: java.lang.IllegalArgumentException: Error 66063 at line 2 column 17 at com.ibm.icu.text.RBBIRuleScanner.error(RBBIRuleScanner.java:505) at com.ibm.icu.text.RBBIRuleScanner.scanSet(RBBIRuleScanner.java:1047) at com.ibm.icu.text.RBBIRuleScanner.doParseActions(RBBIRuleScanner.java:484) at com.ibm.icu.text.RBBIRuleScanner.parse(RBBIRuleScanner.java:912) at com.ibm.icu.text.RBBIRuleBuilder.compileRules(RBBIRuleBuilder.java:298) at com.ibm.icu.text.RuleBasedBreakIterator.compileRules(RuleBasedBreakIterator.java:316) at com.ibm.icu.text.RuleBasedBreakIterator.<init>(RuleBasedBreakIterator.java:71) at org.apache.lucene.analysis.icu.ICUBreakIterator.<init>(ICUBreakIterator.java:53) at org.apache.lucene.analysis.icu.ICUBreakIterator.<init>(ICUBreakIterator.java:45) at org.apache.lucene.analysis.icu.ICUTokenizer.<clinit>(ICUTokenizer.java:58) ... 12 more please help me in this.
          Hide
          Robert Muir added a comment -

          what version of icu4j are you using? needs to be >= 4.0

          Show
          Robert Muir added a comment - what version of icu4j are you using? needs to be >= 4.0
          Hide
          Robert Muir added a comment -

          updated patch, not ready yet but you can see where i am going.

          ICUTokenizer: Breaks text into words according to UAX #29: Unicode Text Segmentation. Text is divided across script boundaries so that this segmentation can be tailored for different writing systems; for example Thai text is segmented with a different method. The default and script-specific rules can be tailored. In the resources folder i have some examples for Southeast Asian scripts, etc. Since i need script boundaries for tailoring, i stuff the ISO 15924 script code constant in the flags; this could be useful for downstream consumers.

          ICUCaseFoldingFilter: Fold case according to Unicode Default Caseless Matching; Full case folding. This may change the length of the token, for example german sharp s is folded to 'ss'. This filter interacts with the downstream normalization filter in a special way, so you can provide a hint as to what the desired normalization form will be. In the NFKC or NFKD case it will apply the NFKC_Closure set so you do not have to Normalize(Fold(Normalize(Fold)))

          ICUDigitFoldingFilter: Standardize digits from different scripts to the latin values, 0-9.

          ICUFormatFilter: Remove identifier-ignorable codepoints, specifically those from the Format category.

          ICUNormalizationFilter: Apply unicode normalization to text. This is accelerated with a quick-check.

          ICUAnalyzer ties all this together. All of these components should also work correctly with surrogate-pair data.

          Needs more doc and tests. any comments appreciated.

          Show
          Robert Muir added a comment - updated patch, not ready yet but you can see where i am going. ICUTokenizer: Breaks text into words according to UAX #29: Unicode Text Segmentation. Text is divided across script boundaries so that this segmentation can be tailored for different writing systems; for example Thai text is segmented with a different method. The default and script-specific rules can be tailored. In the resources folder i have some examples for Southeast Asian scripts, etc. Since i need script boundaries for tailoring, i stuff the ISO 15924 script code constant in the flags; this could be useful for downstream consumers. ICUCaseFoldingFilter: Fold case according to Unicode Default Caseless Matching; Full case folding. This may change the length of the token, for example german sharp s is folded to 'ss'. This filter interacts with the downstream normalization filter in a special way, so you can provide a hint as to what the desired normalization form will be. In the NFKC or NFKD case it will apply the NFKC_Closure set so you do not have to Normalize(Fold(Normalize(Fold ))) ICUDigitFoldingFilter: Standardize digits from different scripts to the latin values, 0-9. ICUFormatFilter: Remove identifier-ignorable codepoints, specifically those from the Format category. ICUNormalizationFilter: Apply unicode normalization to text. This is accelerated with a quick-check. ICUAnalyzer ties all this together. All of these components should also work correctly with surrogate-pair data. Needs more doc and tests. any comments appreciated.
          Hide
          Robert Muir added a comment -

          here's a simple description of what the current functionality buys you, its this:

          all indic languages (Hindi, Bengali, Tamil, ...), middle eastern languages (Arabic, Hebrew, etc) will work pretty well here (by that I mean tokenized, normalized, etc). Most of these lucene cannot parse correctly with any of the built-in analyzers.

          obviously european languages lucene handles quite well already, but unicode still has some improvements here, i.e. better case-folding.

          And finally, of course, the situation where you have data in a bunch of these different languages!

          in general, the unicode defaults work quite well for almost all languages, with the exception of CJK and southeast-asian languages.
          its not my intent to really solve those harder cases, only to provide a mechanism for someone else to deal with it if they don't like the defaults.

          a great example is the arabic tokenizer, it should not exist. unicode defaults work great for that language. and it would be silly to think about HindiTokenizer, BengaliTokenizer, etc etc when unicode defaults will tokenize those correctly as well.

          there's still some annoying complexity here, and any comments are appreciated. Especially tricky is the complexity-performance-maintenance balance, i.e. the case-folding filter could be a lot faster, but then it would have to be updated when a new unicode version is released... Another thing is i didn't optimize the BMP case anywhere [i.e. working at 32-bit codepoint to ensure surrogate data works], and I think thats worth considering... like 99.9% of data is in the BMP

          Thanks,
          Robert

          Show
          Robert Muir added a comment - here's a simple description of what the current functionality buys you, its this: all indic languages (Hindi, Bengali, Tamil, ...), middle eastern languages (Arabic, Hebrew, etc) will work pretty well here (by that I mean tokenized, normalized, etc). Most of these lucene cannot parse correctly with any of the built-in analyzers. obviously european languages lucene handles quite well already, but unicode still has some improvements here, i.e. better case-folding. And finally, of course, the situation where you have data in a bunch of these different languages! in general, the unicode defaults work quite well for almost all languages, with the exception of CJK and southeast-asian languages. its not my intent to really solve those harder cases, only to provide a mechanism for someone else to deal with it if they don't like the defaults. a great example is the arabic tokenizer, it should not exist. unicode defaults work great for that language. and it would be silly to think about HindiTokenizer, BengaliTokenizer, etc etc when unicode defaults will tokenize those correctly as well. there's still some annoying complexity here, and any comments are appreciated. Especially tricky is the complexity-performance-maintenance balance, i.e. the case-folding filter could be a lot faster, but then it would have to be updated when a new unicode version is released... Another thing is i didn't optimize the BMP case anywhere [i.e. working at 32-bit codepoint to ensure surrogate data works] , and I think thats worth considering... like 99.9% of data is in the BMP Thanks, Robert
          Hide
          Robert Muir added a comment -

          just an update, still more work to be done.

          some of the components are javadoc'ed and have pretty good tests (case folding and normalization). These might be useful to someone in the meantime.

          also added some tests to TestICUAnalyzer for various jira issues (LUCENE-1032, LUCENE-1215, LUCENE-1343, LUCENE-1545, etc) that are solved here.

          Show
          Robert Muir added a comment - just an update, still more work to be done. some of the components are javadoc'ed and have pretty good tests (case folding and normalization). These might be useful to someone in the meantime. also added some tests to TestICUAnalyzer for various jira issues ( LUCENE-1032 , LUCENE-1215 , LUCENE-1343 , LUCENE-1545 , etc) that are solved here.
          Hide
          Michael McCandless added a comment -

          ICUAnalyzer looks very useful! Good work Robert. (And, thanks!).

          Do you think this'll be ready to go in time for 2.9 (which we are
          trying to wrap up soonish)?

          It seems like this absorbs the functionality of many of Lucene's
          current analyzers. EG you mentioned ArabicAnalyzer already. What
          other analyzers (eg in contrib/analyzers/*) would you say are
          logically subsumed by this?

          Also, this seems quite different from StandardAnalyzer, in that it
          focuses entirely on doing "good" tokenization, by relying on the
          Unicode standard (defaults) instead of fixed char ranges in
          StandardAnalyzer. So it fixes many bugs in how StandardAnalyzer
          tokenizes, especially on non-European languages.

          Also, StandardAnalyzer goes beyond making the initial tokens: it also
          tries to label things as acronym, host name, number, etc.; tries to
          filter out stop words.

          I assume ICUCaseFoldingFilter logically subsumes LowercaseFilter?

          Especially tricky is the complexity-performance-maintenance balance, i.e. the case-folding filter could be a lot faster, but then it would have to be updated when a new unicode version is released.

          I think it's fine to worry about this later. Correctness is more
          important than performance at this point.

          Show
          Michael McCandless added a comment - ICUAnalyzer looks very useful! Good work Robert. (And, thanks!). Do you think this'll be ready to go in time for 2.9 (which we are trying to wrap up soonish)? It seems like this absorbs the functionality of many of Lucene's current analyzers. EG you mentioned ArabicAnalyzer already. What other analyzers (eg in contrib/analyzers/*) would you say are logically subsumed by this? Also, this seems quite different from StandardAnalyzer, in that it focuses entirely on doing "good" tokenization, by relying on the Unicode standard (defaults) instead of fixed char ranges in StandardAnalyzer. So it fixes many bugs in how StandardAnalyzer tokenizes, especially on non-European languages. Also, StandardAnalyzer goes beyond making the initial tokens: it also tries to label things as acronym, host name, number, etc.; tries to filter out stop words. I assume ICUCaseFoldingFilter logically subsumes LowercaseFilter? Especially tricky is the complexity-performance-maintenance balance, i.e. the case-folding filter could be a lot faster, but then it would have to be updated when a new unicode version is released. I think it's fine to worry about this later. Correctness is more important than performance at this point.
          Hide
          Robert Muir added a comment -

          Michael, I don't think it will be ready for 2.9, here is some answers to your questions.

          going with your arabic example:
          The only thing this absorbs is language-specific tokenization (like ArabicLetterTokenizer), because as mentioned I think thats generally the wrong approach.
          But this can't replace ArabicAnalyzer completely, because ArabicAnalyzer stems arabic text in a language-specific way, which has a huge effect on retrieval quality for Arabic language text.

          Some of what it does the language-specific analyzers don't do though.

          In this specific example, it would be nice if ArabicAnalyzer really used the functionality here, then did its Arabic-specific stuff!
          Because this functionality will do things like normalize 'Arabic Presentation Forms' and deal with Arabic digits and things that aren't in the ArabicAnalyzer. It also will treat any non-Arabic text in your corpus very nicely!

          Yes, you are correct about the difference from StandardAnalyzer and I would argue there are tokenization bugs in how StandardAnalyzer works with European languages too, just see LUCENE-1545!

          I know StandardAnalyzer does these things. This tokenizer has some built-in types already, such as number. If you want to add more types, its easy. Just make a .txt file with your grammar, create a RuleBasedBreakIterator with it, and pass it along to the tokenizer constructor. you will have to subclass the tokenizer's getType() for any new types though, because RBBI 'types' are really just integer codes in the rule file, and you have to map them to some text such as "WORD".

          Yes, case-folding will work better than lowercase for a few european languages.

          Show
          Robert Muir added a comment - Michael, I don't think it will be ready for 2.9, here is some answers to your questions. going with your arabic example: The only thing this absorbs is language-specific tokenization (like ArabicLetterTokenizer), because as mentioned I think thats generally the wrong approach. But this can't replace ArabicAnalyzer completely, because ArabicAnalyzer stems arabic text in a language-specific way, which has a huge effect on retrieval quality for Arabic language text. Some of what it does the language-specific analyzers don't do though. In this specific example, it would be nice if ArabicAnalyzer really used the functionality here, then did its Arabic-specific stuff! Because this functionality will do things like normalize 'Arabic Presentation Forms' and deal with Arabic digits and things that aren't in the ArabicAnalyzer. It also will treat any non-Arabic text in your corpus very nicely! Yes, you are correct about the difference from StandardAnalyzer and I would argue there are tokenization bugs in how StandardAnalyzer works with European languages too, just see LUCENE-1545 ! I know StandardAnalyzer does these things. This tokenizer has some built-in types already, such as number. If you want to add more types, its easy. Just make a .txt file with your grammar, create a RuleBasedBreakIterator with it, and pass it along to the tokenizer constructor. you will have to subclass the tokenizer's getType() for any new types though, because RBBI 'types' are really just integer codes in the rule file, and you have to map them to some text such as "WORD". Yes, case-folding will work better than lowercase for a few european languages.
          Hide
          Earwin Burrfoot added a comment -

          But this can't replace ArabicAnalyzer completely, because ArabicAnalyzer stems arabic text in a language-specific way, which has a huge effect on retrieval quality for Arabic language text.

          What about separating word-tokenizing from morphological processing?

          Show
          Earwin Burrfoot added a comment - But this can't replace ArabicAnalyzer completely, because ArabicAnalyzer stems arabic text in a language-specific way, which has a huge effect on retrieval quality for Arabic language text. What about separating word-tokenizing from morphological processing?
          Hide
          Robert Muir added a comment -

          Earwin, I don't understand your question...
          There is no morphological processing or any other language-specific functionality in this patch...

          Show
          Robert Muir added a comment - Earwin, I don't understand your question... There is no morphological processing or any other language-specific functionality in this patch...
          Hide
          Robert Muir added a comment -

          add analysis tests for a few languages to demonstrate what this does.

          Show
          Robert Muir added a comment - add analysis tests for a few languages to demonstrate what this does.
          Hide
          Earwin Burrfoot added a comment -

          There is no morphological processing or any other language-specific functionality in this patch...

          I'm speaking of stemming in ArabicAnalyzer. Why can't you use its stemming tokenfilter over all ICU goodness from this patch? Everything else ArabicAnalyzer consists of might as well be deleted right after.

          Show
          Earwin Burrfoot added a comment - There is no morphological processing or any other language-specific functionality in this patch... I'm speaking of stemming in ArabicAnalyzer. Why can't you use its stemming tokenfilter over all ICU goodness from this patch? Everything else ArabicAnalyzer consists of might as well be deleted right after.
          Hide
          Robert Muir added a comment -

          Earwin, you are absolutely correct.

          though i would also want to keep the ArabicNormalizationFilter as it does "non-standard" normalization that is usually helpful for arabic text.

          Show
          Robert Muir added a comment - Earwin, you are absolutely correct. though i would also want to keep the ArabicNormalizationFilter as it does "non-standard" normalization that is usually helpful for arabic text.
          Hide
          Robert Muir added a comment - - edited

          this is latest copy of my code (in response to java-user discussion).

          not many changes except tokenstream changes and work for writing systems with no word separation: lao, myanmar, cjk, etc.
          for these, the tokenizer does not break text into words, but subwords (syllables), and unigrams&bigrams of these are indexed.

          Show
          Robert Muir added a comment - - edited this is latest copy of my code (in response to java-user discussion). not many changes except tokenstream changes and work for writing systems with no word separation: lao, myanmar, cjk, etc. for these, the tokenizer does not break text into words, but subwords (syllables), and unigrams&bigrams of these are indexed.
          Hide
          Robert Muir added a comment -

          here I complete Lao support (fully implementing http://www.panl10n.net/english/final%20reports/pdf%20files/Laos/LAO06.pdf)

          Also fix a tokenstream bug (not back-compat issue!) in the bigramfilter.

          I think all language/unicode features are done, basically we can get better language support in the future from ICU automatically, but I think all languages are handled in a reasonable way for now.

          imho all that is left is:

          • fix docs, improve tests, java api, rbbi grammars, any bugs, TODOs
          • decide if we want to merge this with the collation contrib (I think it might be a good idea)
          • test various versions of ICU to know which ones it works with

          it works and the tests pass, but some tests are slow (10+ seconds, though I made them faster).
          The problem is these slow tests have found bugs and will help test version compatibility, so I like them.

          Show
          Robert Muir added a comment - here I complete Lao support (fully implementing http://www.panl10n.net/english/final%20reports/pdf%20files/Laos/LAO06.pdf ) Also fix a tokenstream bug (not back-compat issue!) in the bigramfilter. I think all language/unicode features are done, basically we can get better language support in the future from ICU automatically, but I think all languages are handled in a reasonable way for now. imho all that is left is: fix docs, improve tests, java api, rbbi grammars, any bugs, TODOs decide if we want to merge this with the collation contrib (I think it might be a good idea) test various versions of ICU to know which ones it works with it works and the tests pass, but some tests are slow (10+ seconds, though I made them faster). The problem is these slow tests have found bugs and will help test version compatibility, so I like them.
          Hide
          Uwe Schindler added a comment -

          Hi Robert: If you do a restoreState() no clearAttributes() is needed before, as the restoreState overwrites all attributes. Everything else looks good.

          Show
          Uwe Schindler added a comment - Hi Robert: If you do a restoreState() no clearAttributes() is needed before, as the restoreState overwrites all attributes. Everything else looks good.
          Hide
          Robert Muir added a comment -

          Uwe, thanks for taking a look! I'll fix this.

          Show
          Robert Muir added a comment - Uwe, thanks for taking a look! I'll fix this.
          Hide
          Robert Muir added a comment -

          setting a fix version, setting a correct description of the issue

          Show
          Robert Muir added a comment - setting a fix version, setting a correct description of the issue
          Hide
          DM Smith added a comment -

          Robert, just finished reviewing the code. Looks great! Doesn't look like there's too much left. All I see is a bit of JavaDoc and an extraneous unused variable (ICUTokenizer: private PositionIncrementAttribute posIncAtt

          The documentation in ICUNormalizationFilter is very instructive. Kudos. The only part that's hard to for me to understand is the filter order dependency, but then again that's a hard topic in the first place.

          I'm wondering whether it would make sense to have multiple representations of a token with the same position in the index. Specifically, transliterations and case-folding. That is, the one is a "synonym" for the other. Is that possible and does it make sense? I'm imagining a use case where a end user enters for a search request a Latin script transliteration of Greek "uios" but might also enter "υιος".

          The other question on my mind is that given a text of German, Greek and Hebrew (three distinct scripts) does it make sense to apply stop words to them based on script? And should stop words be normalized on load with the ICUNormalizationFilter? Or is it a given that they work as is?

          Can/How does all this integrate with stemmers?

          Again, many thanks! (Btw, special thanks for this working with 2.9 and Java 1.4!)

          Show
          DM Smith added a comment - Robert, just finished reviewing the code. Looks great! Doesn't look like there's too much left. All I see is a bit of JavaDoc and an extraneous unused variable (ICUTokenizer: private PositionIncrementAttribute posIncAtt The documentation in ICUNormalizationFilter is very instructive. Kudos. The only part that's hard to for me to understand is the filter order dependency, but then again that's a hard topic in the first place. I'm wondering whether it would make sense to have multiple representations of a token with the same position in the index. Specifically, transliterations and case-folding. That is, the one is a "synonym" for the other. Is that possible and does it make sense? I'm imagining a use case where a end user enters for a search request a Latin script transliteration of Greek "uios" but might also enter "υιος". The other question on my mind is that given a text of German, Greek and Hebrew (three distinct scripts) does it make sense to apply stop words to them based on script? And should stop words be normalized on load with the ICUNormalizationFilter? Or is it a given that they work as is? Can/How does all this integrate with stemmers? Again, many thanks! (Btw, special thanks for this working with 2.9 and Java 1.4!)
          Hide
          Robert Muir added a comment -

          DM, I really appreciate your review. You have brought up some good ideas that I haven't yet thought about.

          All I see is a bit of JavaDoc and an extraneous unused variable (ICUTokenizer: private PositionIncrementAttribute posIncAtt

          Yeah there are some TODOs, and cleanup on the tokenstreams, and the API in general. its not easy to customize the way its supposed to be: where you as a user can actually supply BreakIterator impls to the tokenizer and say "use these rules/dictionary/whatever for tokenizing XYZ script only".

          I'm wondering whether it would make sense to have multiple representations of a token with the same position in the index. Specifically, transliterations and case-folding. That is, the one is a "synonym" for the other. Is that possible and does it make sense? I'm imagining a use case where a end user enters for a search request a Latin script transliteration of Greek "uios" but might also enter "υιος".

          Yeah this is something to consider. I don't think it makes sense for the case folding filter, but maybe for the transform filter? will have to think about it.
          There's use cases here like what you mentioned, also real-world ones like invoking Serbian-Latin or something, where you want users to search in either writing system and there actually is a clearly defined transformation.

          I guess on the other hand, you could always use a separate field (with different analysis/transforms) for each and search both.

          The other question on my mind is that given a text of German, Greek and Hebrew (three distinct scripts) does it make sense to apply stop words to them based on script? And should stop words be normalized on load with the ICUNormalizationFilter? Or is it a given that they work as is?

          You could put them all in one list with regular stopfilter now. They won't clash since they are different unicode Strings. Obviously I would normalize this list with the same stuff (normalization form/case folding/whatever) that your analyzer users.

          I don't put any stopwords in this, because thats language dependent, trying to stick with language-independent (either stuff that applies to unicode as a whole, or specific writing systems, which can be accurately detected).

          Can/How does all this integrate with stemmers?

          Right this is just supposed to be what "StandardTokenizer"-type stuff does, and you would add stemming on top of it. The idea is you would use this even if you think you only have english text, maybe then applying your porter english stemmer. But if it happens to stumble upon some CJK or Thai or something along the way, everything will be ok.

          In all honesty, I probably put 90% of the work into the Khmer, Myanmar, Lao, etc cases. Having good tokenization I think makes a usable search engine, for a lot of languages stemming is only a bonus.

          However, one thing it also does is put the script value in the flags for each token. This can work pretty well: if its Greek script, its probably Greek language, but if its Hebrew script, well it could be Yiddish too. If its Latin script, could be english, german, etc. Its ended only to make life easier since the information is already available... but I don't know yet how to make use of it in a nice way.

          Again, many thanks! (Btw, special thanks for this working with 2.9 and Java 1.4!)

          Yeah i haven't updated it to java 5/Lucene 3.x yet, started working it, but kinda forgot about that so far. I guess this is a good thing, so you can play with it if you want.

          Show
          Robert Muir added a comment - DM, I really appreciate your review. You have brought up some good ideas that I haven't yet thought about. All I see is a bit of JavaDoc and an extraneous unused variable (ICUTokenizer: private PositionIncrementAttribute posIncAtt Yeah there are some TODOs, and cleanup on the tokenstreams, and the API in general. its not easy to customize the way its supposed to be: where you as a user can actually supply BreakIterator impls to the tokenizer and say "use these rules/dictionary/whatever for tokenizing XYZ script only". I'm wondering whether it would make sense to have multiple representations of a token with the same position in the index. Specifically, transliterations and case-folding. That is, the one is a "synonym" for the other. Is that possible and does it make sense? I'm imagining a use case where a end user enters for a search request a Latin script transliteration of Greek "uios" but might also enter "υιος". Yeah this is something to consider. I don't think it makes sense for the case folding filter, but maybe for the transform filter? will have to think about it. There's use cases here like what you mentioned, also real-world ones like invoking Serbian-Latin or something, where you want users to search in either writing system and there actually is a clearly defined transformation. I guess on the other hand, you could always use a separate field (with different analysis/transforms) for each and search both. The other question on my mind is that given a text of German, Greek and Hebrew (three distinct scripts) does it make sense to apply stop words to them based on script? And should stop words be normalized on load with the ICUNormalizationFilter? Or is it a given that they work as is? You could put them all in one list with regular stopfilter now. They won't clash since they are different unicode Strings. Obviously I would normalize this list with the same stuff (normalization form/case folding/whatever) that your analyzer users. I don't put any stopwords in this, because thats language dependent, trying to stick with language-independent (either stuff that applies to unicode as a whole, or specific writing systems, which can be accurately detected). Can/How does all this integrate with stemmers? Right this is just supposed to be what "StandardTokenizer"-type stuff does, and you would add stemming on top of it. The idea is you would use this even if you think you only have english text, maybe then applying your porter english stemmer. But if it happens to stumble upon some CJK or Thai or something along the way, everything will be ok. In all honesty, I probably put 90% of the work into the Khmer, Myanmar, Lao, etc cases. Having good tokenization I think makes a usable search engine, for a lot of languages stemming is only a bonus. However, one thing it also does is put the script value in the flags for each token. This can work pretty well: if its Greek script, its probably Greek language, but if its Hebrew script, well it could be Yiddish too. If its Latin script, could be english, german, etc. Its ended only to make life easier since the information is already available... but I don't know yet how to make use of it in a nice way. Again, many thanks! (Btw, special thanks for this working with 2.9 and Java 1.4!) Yeah i haven't updated it to java 5/Lucene 3.x yet, started working it, but kinda forgot about that so far. I guess this is a good thing, so you can play with it if you want.
          Hide
          Robert Muir added a comment -

          Linking this issue to LUCENE-2124. Once contrib/collation has been renamed contrib/icu, I want to split out two of the tokenfilters (case folding and normalization) out as a separate smaller issue to start.

          These are useful on their own, yet inseparable because of the special K mappings: you must tell the case folding filter what your targeted normalization form will be.

          Show
          Robert Muir added a comment - Linking this issue to LUCENE-2124 . Once contrib/collation has been renamed contrib/icu, I want to split out two of the tokenfilters (case folding and normalization) out as a separate smaller issue to start. These are useful on their own, yet inseparable because of the special K mappings: you must tell the case folding filter what your targeted normalization form will be.
          Hide
          Vilaythong Southavilay added a comment -

          I am developing an IR system for Lao. I've been searching for this kind of analyzers to be used in my development to index documents containing languages like Lao, French and English in one single passage.

          I tested it for Lao language for Lucene 2.9 and 3.0 using my short passage. It worked correctly for both versions as I expected, especially for segmenting Lao single syllables. I also tried it with the bi-gram filter option for two syllables, which worked fine for simple words. The result contained some two-syllable words which do not make sense in Lao language. I guess this not a big issue. As Robert pointed out (in an email to me), we still need dictionary-based word segmentation for Lao, which can be integrated in ICU and used by this analyzer.

          Any way, thanks for your assistance. This work will be helpful not only for Lao, but others as well because it's good to have a common analyzer for unicode characters.

          I'll continue testing it and report any problems if I find one.

          Show
          Vilaythong Southavilay added a comment - I am developing an IR system for Lao. I've been searching for this kind of analyzers to be used in my development to index documents containing languages like Lao, French and English in one single passage. I tested it for Lao language for Lucene 2.9 and 3.0 using my short passage. It worked correctly for both versions as I expected, especially for segmenting Lao single syllables. I also tried it with the bi-gram filter option for two syllables, which worked fine for simple words. The result contained some two-syllable words which do not make sense in Lao language. I guess this not a big issue. As Robert pointed out (in an email to me), we still need dictionary-based word segmentation for Lao, which can be integrated in ICU and used by this analyzer. Any way, thanks for your assistance. This work will be helpful not only for Lao, but others as well because it's good to have a common analyzer for unicode characters. I'll continue testing it and report any problems if I find one.
          Hide
          Robert Muir added a comment -

          Thanks for sharing those results! Yes the bigram behavior (right now enabled for Han, Lao, Khmer, and Myanmar) is an attempt to boost relevance in a consistent way since we do not have dictionary-based word segmentation for those writing systems, only the ability to segment into syllables.

          In the next patch I'll make it easier to configure this behavior, and turn it off when you want, without writing your own analyzer.

          I am glad to hear the syllable segmentation algorithm is working well!
          The credit really belongs to the Pan Localization Project, I simply implemented the algorithm described here: http://www.panl10n.net/english/final%20reports/pdf%20files/Laos/LAO06.pdf
          You can see the code in Lao.rbbi in the patch, warning, as it mentions, I am pretty sure Lao numeric digits are not yet working correctly, but hopefully I will fix those too in the next version.

          Show
          Robert Muir added a comment - Thanks for sharing those results! Yes the bigram behavior (right now enabled for Han, Lao, Khmer, and Myanmar) is an attempt to boost relevance in a consistent way since we do not have dictionary-based word segmentation for those writing systems, only the ability to segment into syllables. In the next patch I'll make it easier to configure this behavior, and turn it off when you want, without writing your own analyzer. I am glad to hear the syllable segmentation algorithm is working well! The credit really belongs to the Pan Localization Project, I simply implemented the algorithm described here: http://www.panl10n.net/english/final%20reports/pdf%20files/Laos/LAO06.pdf You can see the code in Lao.rbbi in the patch, warning, as it mentions, I am pretty sure Lao numeric digits are not yet working correctly, but hopefully I will fix those too in the next version.
          Hide
          Vilaythong Southavilay added a comment -

          I tested Lao numbers. It only worked for 2 digit numbers (because of two syllable segmentation), but the result tokens were converted to Arabic numbers (instead of Lao). This is not too bad for analyzing heading numbers and ordered lists with less than 100 items (the meaning and order are preserved).

          In the documents i encountered most of scientific numerals or financial figures (complex numeric digits) were written using Arabic numbers.

          Nevertheless, recognizing long Lao numeric digits is a "must-to-have" to complete this set for Laos.

          Show
          Vilaythong Southavilay added a comment - I tested Lao numbers. It only worked for 2 digit numbers (because of two syllable segmentation), but the result tokens were converted to Arabic numbers (instead of Lao). This is not too bad for analyzing heading numbers and ordered lists with less than 100 items (the meaning and order are preserved). In the documents i encountered most of scientific numerals or financial figures (complex numeric digits) were written using Arabic numbers. Nevertheless, recognizing long Lao numeric digits is a "must-to-have" to complete this set for Laos.
          Hide
          Robert Muir added a comment - - edited

          Hi,this is not intentional to split them into 2 digits, it is really only because of rbbi rule-chaining turned off.
          So now ໐໑໒໓ stays as a single token, and later becomes 0123.

          I've written tests for, and fixed numerics in my local copy for lao, myanmar, and khmer. I will post an updated patch hopefully soon with all the improvements.

          but the result tokens were converted to Arabic numbers (instead of Lao).

          Yes this is intentional, later there is a filter that converts all numeric digits to Arabic so the search will match either.

          Show
          Robert Muir added a comment - - edited Hi,this is not intentional to split them into 2 digits, it is really only because of rbbi rule-chaining turned off. So now ໐໑໒໓ stays as a single token, and later becomes 0123. I've written tests for, and fixed numerics in my local copy for lao, myanmar, and khmer. I will post an updated patch hopefully soon with all the improvements. but the result tokens were converted to Arabic numbers (instead of Lao). Yes this is intentional, later there is a filter that converts all numeric digits to Arabic so the search will match either.
          Hide
          Robert Muir added a comment -

          uploading a dump of my workspace, so Uwe can review the new attribute.

          Show
          Robert Muir added a comment - uploading a dump of my workspace, so Uwe can review the new attribute.
          Hide
          Uwe Schindler added a comment -

          Attribute looks good! I would only fix toString() to match the defaulkt impl by using syntax variableName + "=" + value, here "code="+getName(code). This makes AttrubuteSource.toString() look nice.

          Show
          Uwe Schindler added a comment - Attribute looks good! I would only fix toString() to match the defaulkt impl by using syntax variableName + "=" + value, here "code="+getName(code). This makes AttrubuteSource.toString() look nice.
          Hide
          Robert Muir added a comment -

          Thanks for the review Uwe! moving forwards...

          Show
          Robert Muir added a comment - Thanks for the review Uwe! moving forwards...
          Hide
          David Bowen added a comment -

          I have a possibly naive question on the bigram filter: why would you want to index the individual one-character-tokens, as well as the bigrams? The CJK Tokenizer just emits the bigrams. Wouldn't indexing and searching on the unigrams as well as the bigrams just slow down search?

          Show
          David Bowen added a comment - I have a possibly naive question on the bigram filter: why would you want to index the individual one-character-tokens, as well as the bigrams? The CJK Tokenizer just emits the bigrams. Wouldn't indexing and searching on the unigrams as well as the bigrams just slow down search?
          Hide
          Robert Muir added a comment -

          I have a possibly naive question on the bigram filter

          Its not naive at all really! I think we to do exactly what you suggest.

          change it slightly to behave just like CJKTokenizer (except of course, working with mixed language text, and supporting UCS-4)

          Show
          Robert Muir added a comment - I have a possibly naive question on the bigram filter Its not naive at all really! I think we to do exactly what you suggest. change it slightly to behave just like CJKTokenizer (except of course, working with mixed language text, and supporting UCS-4)
          Hide
          Robert Muir added a comment -

          Marking this fixed, as all the icu functionality has been broken into smaller issues and now resolved (and simpler due to ICU 4.4 changes)

          Show
          Robert Muir added a comment - Marking this fixed, as all the icu functionality has been broken into smaller issues and now resolved (and simpler due to ICU 4.4 changes)
          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1

            People

            • Assignee:
              Robert Muir
              Reporter:
              Robert Muir
            • Votes:
              3 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development