Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7393

Incorrect ICUTokenization on South East Asian Language

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 5.5
    • Fix Version/s: master (7.0), 6.2
    • Component/s: modules/analysis
    • Labels:
      None
    • Environment:

      Ubuntu

    • Lucene Fields:
      New

      Description

      Lucene 4.10.3 correctly tokenize a syllable into one token. However in Lucune 5.5.0 it end up being two tokens which is incorrect. Please let me know segmentation rules are implemented by native speakers of a particular language? In this particular example, it is M-y-a-n-m-a-r language. I have understood that L-a-o, K-m-e-r and M-y-a-n-m-a-r fall into ICU category. Thanks a lot.

      Example 4.10.3

      GET _analyze?tokenizer=icu_tokenizer&text="နည်"
      {
         "tokens": [
            {
               "token": "နည်",
               "start_offset": 1,
               "end_offset": 4,
               "type": "<ALPHANUM>",
               "position": 1
            }
         ]
      }
      

      Example 5.5.0

      GET _analyze?tokenizer=icu_tokenizer&text="နည်"
      {
        "tokens": [
          {
            "token": "န",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<ALPHANUM>",
            "position": 0
          },
          {
            "token": "ည်",
            "start_offset": 1,
            "end_offset": 3,
            "type": "<ALPHANUM>",
            "position": 1
          }
        ]
      }
      

        Activity

        Hide
        rcmuir Robert Muir added a comment -

        Hello,

        4.10.x used a simple set of rules (https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.10.4/lucene/analysis/icu/src/data/uax29/Myanmar.rbbi)

        Newer versions use ICU's dictionary-based algorithm (http://source.icu-project.org/repos/icu/icu4j/trunk/main/classes/core/src/com/ibm/icu/text/BurmeseBreakEngine.java), but that seems to be the problem here.

        You can see that by testing here: http://unicode.org/cldr/utility/breaks.jsp

        I think we should file a bug with ICU.

        Show
        rcmuir Robert Muir added a comment - Hello, 4.10.x used a simple set of rules ( https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.10.4/lucene/analysis/icu/src/data/uax29/Myanmar.rbbi ) Newer versions use ICU's dictionary-based algorithm ( http://source.icu-project.org/repos/icu/icu4j/trunk/main/classes/core/src/com/ibm/icu/text/BurmeseBreakEngine.java ), but that seems to be the problem here. You can see that by testing here: http://unicode.org/cldr/utility/breaks.jsp I think we should file a bug with ICU.
        Hide
        rcmuir Robert Muir added a comment -

        Looking at their code, i think you can see the problem. It only has a very simplistic fBeginWordSet and fEndWordSet, but no real handling for syllable structure (for example, no code to handle asat sign).

        Show
        rcmuir Robert Muir added a comment - Looking at their code, i think you can see the problem. It only has a very simplistic fBeginWordSet and fEndWordSet , but no real handling for syllable structure (for example, no code to handle asat sign).
        Hide
        rcmuir Robert Muir added a comment -
        Show
        rcmuir Robert Muir added a comment - I opened this bug at ICU: http://bugs.icu-project.org/trac/ticket/12650
        Hide
        aungmaw AM added a comment -

        Thank you Robert. 4.10.x does a good job even for words borrowed from foreign language. For example, it would correctly segment ဘူးလ် as one syllable. There are exceptional rules applied for loan words and it seems like rules in .rbbi file captures it correctly even for these exceptions. However again in 5.5.0 it would end up two syllables ဘူး and လ် which is not correct. Hand coding all the logic in ICU's dictionary-based algorithm seems to be quite challenging. Rules are more compact and does it nicely I think.

        Please let me know where to find dictionary words use in ICU4j?

        Many thanks.

        Show
        aungmaw AM added a comment - Thank you Robert. 4.10.x does a good job even for words borrowed from foreign language. For example, it would correctly segment ဘူးလ် as one syllable. There are exceptional rules applied for loan words and it seems like rules in .rbbi file captures it correctly even for these exceptions. However again in 5.5.0 it would end up two syllables ဘူး and လ် which is not correct. Hand coding all the logic in ICU's dictionary-based algorithm seems to be quite challenging. Rules are more compact and does it nicely I think. Please let me know where to find dictionary words use in ICU4j? Many thanks.
        Hide
        aungmaw AM added a comment -

        Also to clarify 4.10.x uses rules from Lucene project and 5.5.0 uses algorithm from ICU4J project?

        Show
        aungmaw AM added a comment - Also to clarify 4.10.x uses rules from Lucene project and 5.5.0 uses algorithm from ICU4J project?
        Hide
        rcmuir Robert Muir added a comment -

        OK, that is interesting to hear. I agree that fixing the hand-coded stuff looks tricky. From my perspective, the ideal solution would first use rules to find syllable breaks: this would restrict where breaks can happen at all, and then the dictionary would just refine that further.

        Here is the link for the icu4j dictionary:
        http://source.icu-project.org/repos/icu/icu/trunk/source/data/brkitr/dictionaries/burmesedict.txt

        Perhaps we should restore the old syllable rules, and make "syllable" vs "word" available as an option for Myanmar?

        I replaced these syllable rules with the ICU dictionary functionality, for two reasons:
        1. Rules were of varying quality depending on language. Lao syllable splitting came from a paper (see https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.0.0/lucene/analysis/icu/src/data/uax29/Lao.rbbi) which claims > 98% accuracy. This is quite sophisticated and even has backtracking logic. On the other hand, I think the Myanmar rules were just something I came up with (unknown quality)...
        2. Unclear if syllable is a good indexing unit for search. In my mind, syllable-as-token does make sense when the language is mostly monosyllabic, at the same time, we don't have any kind of advanced IR test suites for these languages to really know for sure.

        Show
        rcmuir Robert Muir added a comment - OK, that is interesting to hear. I agree that fixing the hand-coded stuff looks tricky. From my perspective, the ideal solution would first use rules to find syllable breaks: this would restrict where breaks can happen at all, and then the dictionary would just refine that further. Here is the link for the icu4j dictionary: http://source.icu-project.org/repos/icu/icu/trunk/source/data/brkitr/dictionaries/burmesedict.txt Perhaps we should restore the old syllable rules, and make "syllable" vs "word" available as an option for Myanmar? I replaced these syllable rules with the ICU dictionary functionality, for two reasons: 1. Rules were of varying quality depending on language. Lao syllable splitting came from a paper (see https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.0.0/lucene/analysis/icu/src/data/uax29/Lao.rbbi ) which claims > 98% accuracy. This is quite sophisticated and even has backtracking logic. On the other hand, I think the Myanmar rules were just something I came up with (unknown quality)... 2. Unclear if syllable is a good indexing unit for search. In my mind, syllable-as-token does make sense when the language is mostly monosyllabic, at the same time, we don't have any kind of advanced IR test suites for these languages to really know for sure.
        Hide
        rcmuir Robert Muir added a comment -

        Also to clarify 4.10.x uses rules from Lucene project and 5.5.0 uses algorithm from ICU4J project?

        Yes, that is correct: For Myanmar language, we upgraded ICU library for 5.0 under this issue: https://issues.apache.org/jira/browse/LUCENE-5995

        Show
        rcmuir Robert Muir added a comment - Also to clarify 4.10.x uses rules from Lucene project and 5.5.0 uses algorithm from ICU4J project? Yes, that is correct: For Myanmar language, we upgraded ICU library for 5.0 under this issue: https://issues.apache.org/jira/browse/LUCENE-5995
        Hide
        aungmaw AM added a comment -

        Yes, syllable vs word option would be perfect. For the dictionary base approach, some of the words might not always be correct, since semantic meaning of a word depends on the context. For example, 'ရန်ကုန်' means Yangon city and 'ကုန်သည်' means trader. But, when we have overlap in the phrase like တက်လာရန်ကုန်သည်များက it should be segmented as တက်|လာ|ရန်|ကုန်သည်|များ|က, instead of တက်|လာ|ရန်ကုန်|သည်|များ|က. As you can see, syllable ကုန် is the overlap. Both words could be in the dictionary and it would require context knowledge to select the correct word and it would be very hard with hand-crafted algorithms. Anyways, it is still good to have until we have better language understanding.

        Would it be possible to add other words not in the ICU dictionary during analysis?

        Thanks a lot.

        Show
        aungmaw AM added a comment - Yes, syllable vs word option would be perfect. For the dictionary base approach, some of the words might not always be correct, since semantic meaning of a word depends on the context. For example, 'ရန်ကုန်' means Yangon city and 'ကုန်သည်' means trader. But, when we have overlap in the phrase like တက်လာရန်ကုန်သည်များက it should be segmented as တက်|လာ|ရန်|ကုန်သည်|များ|က, instead of တက်|လာ|ရန်ကုန်|သည်|များ|က. As you can see, syllable ကုန် is the overlap. Both words could be in the dictionary and it would require context knowledge to select the correct word and it would be very hard with hand-crafted algorithms. Anyways, it is still good to have until we have better language understanding. Would it be possible to add other words not in the ICU dictionary during analysis? Thanks a lot.
        Hide
        rcmuir Robert Muir added a comment -

        Here is a patch restoring the previous rule-based algorithm as an option.

        Since we may keep it around and improve it in the future, I added some simple tests.

        Based on the rules and statistical analysis here, I think we should improve it further to handle more of the special cases (these cases account for less than 1% but we should still try to do better)?

        So as a followup issue, I think it would be good to simply adopt the algorithm they developed, to improve that additional 1%. The reason I do not do it here, is because maybe it is best to do that part in ICU itself. Their algorithm does not require huge amounts of context and can be implemented with tables and sets, might be a good solution for the ICU issue.

        Show
        rcmuir Robert Muir added a comment - Here is a patch restoring the previous rule-based algorithm as an option. Since we may keep it around and improve it in the future, I added some simple tests. Based on the rules and statistical analysis here, I think we should improve it further to handle more of the special cases (these cases account for less than 1% but we should still try to do better)? http://www.aclweb.org/anthology/I08-3010 http://gii2.nagaokaut.ac.jp/gii/media/share/20080901-ZMM%20Presentation.pdf So as a followup issue, I think it would be good to simply adopt the algorithm they developed, to improve that additional 1%. The reason I do not do it here, is because maybe it is best to do that part in ICU itself. Their algorithm does not require huge amounts of context and can be implemented with tables and sets, might be a good solution for the ICU issue.
        Hide
        aungmaw AM added a comment -

        Agree, it is better ICU handle it. To clarify, you meant 1% is for rule base syllable segmentation correct? Because dictionary base approach for word segmentation would be definitely more than 1% (error rate). In the ICU algorithm I noticed it does not segment person names. As a user, if ICU algorithm could identify basic syllables + [Person, Location and Organizations] would be ideal. But, dictionary is static and new words always popping up in addition to context sensitive nature, so I'm not sure how to handle it. Rule base syllable algorithm is nearly to its perfection in Lucene and I'm satisfied with it. Just also curious, where did you got the rules?

        I didn't see the patch link though.

        Thanks a lot.

        Show
        aungmaw AM added a comment - Agree, it is better ICU handle it. To clarify, you meant 1% is for rule base syllable segmentation correct? Because dictionary base approach for word segmentation would be definitely more than 1% (error rate). In the ICU algorithm I noticed it does not segment person names. As a user, if ICU algorithm could identify basic syllables + [Person, Location and Organizations] would be ideal. But, dictionary is static and new words always popping up in addition to context sensitive nature, so I'm not sure how to handle it. Rule base syllable algorithm is nearly to its perfection in Lucene and I'm satisfied with it. Just also curious, where did you got the rules? I didn't see the patch link though. Thanks a lot.
        Hide
        rcmuir Robert Muir added a comment -

        To clarify, you meant 1% is for rule base syllable segmentation correct?

        Yes: it is unmodified as before but I did some inspection of it. It handles all common structures but has no rules for rarer cases mentioned in that study: syllable chaining, great sa, etc.

        Rule base syllable algorithm is nearly to its perfection in Lucene and I'm satisfied with it. Just also curious, where did you got the rules?

        As I mentioned earlier, I created these almost 7 years ago informally. This is why I was eager to remove these rules, because we know they are not perfect. They were created when Myanmar in unicode was still rapidly changing, and I didn't find such formal algorithms at the time.

        The rules are done in a "unicode way", really just using the base consonant and tries to let unicode properties take care of the rest (Word_Break=Extend, etc). It is really not much more than just this main part:

        $Cons = [[:Other_Letter:]&[:Myanmar:]];
        $Virama = [\u1039];
        $Asat = [\u103A];
        
        $ConsEx = $Cons ($Extend | $Format)*;
        $AsatEx = $Cons $Asat ($Virama $ConsEx)? ($Extend | $Format)*;
        $MyanmarSyllableEx = $ConsEx ($Virama $ConsEx)? ($AsatEx)*;
        

        I didn't see the patch link though.

        See the top of this issue: there is an Attachments section, underneath the Description section.

        Show
        rcmuir Robert Muir added a comment - To clarify, you meant 1% is for rule base syllable segmentation correct? Yes: it is unmodified as before but I did some inspection of it. It handles all common structures but has no rules for rarer cases mentioned in that study: syllable chaining, great sa, etc. Rule base syllable algorithm is nearly to its perfection in Lucene and I'm satisfied with it. Just also curious, where did you got the rules? As I mentioned earlier, I created these almost 7 years ago informally. This is why I was eager to remove these rules, because we know they are not perfect. They were created when Myanmar in unicode was still rapidly changing, and I didn't find such formal algorithms at the time. The rules are done in a "unicode way", really just using the base consonant and tries to let unicode properties take care of the rest (Word_Break=Extend, etc). It is really not much more than just this main part: $Cons = [[:Other_Letter:]&[:Myanmar:]]; $Virama = [\u1039]; $Asat = [\u103A]; $ConsEx = $Cons ($Extend | $Format)*; $AsatEx = $Cons $Asat ($Virama $ConsEx)? ($Extend | $Format)*; $MyanmarSyllableEx = $ConsEx ($Virama $ConsEx)? ($AsatEx)*; I didn't see the patch link though. See the top of this issue: there is an Attachments section, underneath the Description section.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 58f0fbd3767af649da1d47ea62f6f35b1ae28c19 in lucene-solr's branch refs/heads/master from Robert Muir
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=58f0fbd ]

        LUCENE-7393: restore old myanmar syllable tokenization as an option.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 58f0fbd3767af649da1d47ea62f6f35b1ae28c19 in lucene-solr's branch refs/heads/master from Robert Muir [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=58f0fbd ] LUCENE-7393 : restore old myanmar syllable tokenization as an option.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 5d88b242057177410a90a2ea74b07d6e25b4ac84 in lucene-solr's branch refs/heads/branch_6x from Robert Muir
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5d88b24 ]

        LUCENE-7393: restore old myanmar syllable tokenization as an option.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 5d88b242057177410a90a2ea74b07d6e25b4ac84 in lucene-solr's branch refs/heads/branch_6x from Robert Muir [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5d88b24 ] LUCENE-7393 : restore old myanmar syllable tokenization as an option.
        Hide
        rcmuir Robert Muir added a comment -

        Thanks for reporting this AM.

        Show
        rcmuir Robert Muir added a comment - Thanks for reporting this AM.
        Hide
        aungmaw AM added a comment - - edited

        Thank you Robert. Please let me know if there is a way to add more keywords to dictionary at run time?

        Show
        aungmaw AM added a comment - - edited Thank you Robert. Please let me know if there is a way to add more keywords to dictionary at run time?
        Hide
        rcmuir Robert Muir added a comment -

        I don't think ICU exposes anything like that.

        Show
        rcmuir Robert Muir added a comment - I don't think ICU exposes anything like that.
        Hide
        aungmaw AM added a comment -

        Ok.

        Show
        aungmaw AM added a comment - Ok.
        Hide
        mikemccand Michael McCandless added a comment -

        Bulk close resolved issues after 6.2.0 release.

        Show
        mikemccand Michael McCandless added a comment - Bulk close resolved issues after 6.2.0 release.

          People

          • Assignee:
            Unassigned
            Reporter:
            aungmaw AM
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development