Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 3.5, 4.0-ALPHA
    • Fix Version/s: 4.0-BETA, 6.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Now that Unicode 6.1.0 has been released, Lucene/Solr should support it.

      JFlex trunk now supports Unicode 6.1.0.

      Tasks include:

      • Upgrade ICU4J to v49 (after it's released, on 2012-03-21, according to http://icu-project.org).
      • Use icu module tools to regenerate the supplementary character additions to JFlex grammars.
      • Version the JFlex grammars: copy the current implementations to *Impl3<X>; cause the versioning tokenizer wrappers to instantiate this version when the Version c-tor param is in the range 3.1 to the version in which these changes are released (excluding the range endpoints); then change the specified Unicode version in the non-versioned JFlex grammars from 6.0 to 6.1.
      • Regenerate JFlex scanners, including StandardTokenizerImpl, UAX29URLEmailTokenizerImpl, and HTMLStripCharFilter.
      • Using generateJavaUnicodeWordBreakTest.pl, generate and then run WordBreakTestUnicode_6_1_0.java under modules/analysis/common/src/test/org/apache/lucene/analysis/core/
      1. LUCENE-3747.patch
        1.02 MB
        Steve Rowe
      2. LUCENE-3747.patch
        678 kB
        Steve Rowe
      3. LUCENE-3747-remainders.patch
        22 kB
        Steve Rowe

        Activity

        Hide
        Robert Muir added a comment -

        +1 as soon as the icu release comes out we should start working on the update.

        additional things for updating icu:

        • check for any uax#29 differences (i think we are unaffected)
        • update files in icu/src/data/utr30 (i really need to make a script to automate this, but it does document what has to happen)
        • try again to remove the java7-workaround-hack in LuceneTestCase for http://bugs.icu-project.org/trac/ticket/8734
        Show
        Robert Muir added a comment - +1 as soon as the icu release comes out we should start working on the update. additional things for updating icu: check for any uax#29 differences (i think we are unaffected) update files in icu/src/data/utr30 (i really need to make a script to automate this, but it does document what has to happen) try again to remove the java7-workaround-hack in LuceneTestCase for http://bugs.icu-project.org/trac/ticket/8734
        Hide
        Steve Rowe added a comment -

        check for any uax#29 differences (i think we are unaffected)

        Right, I'm not sure about this - I plan on upgrading the JFlex test cases that implement the uax#29 rules and test against the data Unicode.org provides. I should know more once that's done.

        Show
        Steve Rowe added a comment - check for any uax#29 differences (i think we are unaffected) Right, I'm not sure about this - I plan on upgrading the JFlex test cases that implement the uax#29 rules and test against the data Unicode.org provides. I should know more once that's done.
        Hide
        Robert Muir added a comment -
        Show
        Robert Muir added a comment - the "changes.txt" is here http://www.unicode.org/versions/Unicode6.1.0/ along with the log here: http://www.unicode.org/reports/tr29/tr29-19.html#Modifications
        Hide
        Steve Rowe added a comment -

        check for any uax#29 differences (i think we are unaffected)

        Right, I'm not sure about this - I plan on upgrading the JFlex test cases that implement the uax#29 rules and test against the data Unicode.org provides. I should know more once that's done.

        I've finished adding Unicode 6.1 versions to JFlex's UAX#29 test cases, including the word break rules test case, and the only change I noticed that could conceivably affect Lucene's UAX#29 tokenizers is the new Section 8, which discusses Korean syllables. Since the rules listed in that section are not part of the word break rules, but rather are a tailoring, and since that section says "All standard Korean syllable blocks used in modern Korean are of the form <L V T> or <L V> and have equivalent, single-character precomposed forms.", I don't think we need to support this (right now anyway).

        (By contrast, the UAX#14 line break rules changed significantly between Unicode v6.0 and v6.1, and I'm still working to include a Unicode 6.1 version to JFlex's corresponding test case.)

        Show
        Steve Rowe added a comment - check for any uax#29 differences (i think we are unaffected) Right, I'm not sure about this - I plan on upgrading the JFlex test cases that implement the uax#29 rules and test against the data Unicode.org provides. I should know more once that's done. I've finished adding Unicode 6.1 versions to JFlex's UAX#29 test cases, including the word break rules test case, and the only change I noticed that could conceivably affect Lucene's UAX#29 tokenizers is the new Section 8, which discusses Korean syllables. Since the rules listed in that section are not part of the word break rules, but rather are a tailoring, and since that section says "All standard Korean syllable blocks used in modern Korean are of the form <L V T> or <L V> and have equivalent, single-character precomposed forms.", I don't think we need to support this (right now anyway). (By contrast, the UAX#14 line break rules changed significantly between Unicode v6.0 and v6.1, and I'm still working to include a Unicode 6.1 version to JFlex's corresponding test case.)
        Hide
        DM Smith added a comment -

        A release candidate is available.

        Show
        DM Smith added a comment - A release candidate is available.
        Hide
        Steve Rowe added a comment -

        First stab at trunk patch.

        Show
        Steve Rowe added a comment - First stab at trunk patch.
        Hide
        Steve Rowe added a comment -

        Subversion script for trunk:

        svn rm lucene/test-framework/src/java/org/apache/lucene/util/TestRuleIcuHack.java
        svn rm lucene/analysis/icu/lib/icu4j-4.8.1.1.jar.sha1
        svn rm solr/contrib/extraction/lib/icu4j-4.8.1.1.jar.sha1
        svn rm solr/contrib/analysis-extras/lib/icu4j-4.8.1.1.jar.sha1
        

        branch_4x will need more svn moves and version checks for the versioned grammars.

        lucene/analysis/common/

        • I ran ant gen-tlds
        • I ran ant jflex

        lucene/analysis/icu/

        uax29

        • I don't fully understand the syntax used in the .rbbi files, so I didn't check whether they need algorithm updates. (However, since I didn't need to make any changes for the JFlex version, probably no algorithm changes needed.)
        • I ran ant genrbbi.

        utr30

        BasicFoldings.txt:

        • For those things that are "additions to kd" - how to extend?
        • For dashes folding, I added some non-included ranges. Q: should wave dash be folded to swung dash? (They look the same.)
        • I don't know how to extend underline folding - is there a property?
        • I don't know how to extend punctuation folding - is there a property?

        DiacriticFolding.txt:

        • In the [:Diacritic:] section, I'm not sure how to proceed, as I can see several missing Latin-1 code points that were almost certainly part of Unicode 6.0.0, so the selection mechanism is non-transparent.
        • In the [:Mark:]&[:Lm:] section, I'm not sure how to make selections, so I didn't try.
        • In the "Additional Arabic/Hebrew decompositions" section, I don't know how to extend.
        • Other sections were based either on AsciiFoldingFilter or UTR#30, neither of which has changed

        DigbatFolding.txt:

        • based on AsciiFoldingFilter, which hasn't changed

        HanRadicalFolding.txt:

        • based on UTR#30, which hasn't changed

        NativeDigitFolding.txt:

        • I wrote a shell/perl script, embedded in the text file, to update.
        • Should [:No:] DIGIT chars be included? One currently is: 19DA;NEW TAI LUE THAM DIGIT ONE;No;0;L;;;1;1;N;;;;;, but others are not (other ranges listed in the patch).

        nfkc.txt:

        • New version copied directly from icu-project.org
        • There is a problem: the following from TestICUFoldingFilter fails:
        46:  assertAnalyzesTo(a, "Μάϊος", new String[] { "μαιοσ" });
        

        AFAICT, this is because the accent decomposition mappings are no longer present in nfkc.txt. This makes no sense; Robert, do you know what's happening here?

        nfkc_cf.txt:

        • New version copied directly from icu-project.org
        Show
        Steve Rowe added a comment - Subversion script for trunk: svn rm lucene/test-framework/src/java/org/apache/lucene/util/TestRuleIcuHack.java svn rm lucene/analysis/icu/lib/icu4j-4.8.1.1.jar.sha1 svn rm solr/contrib/extraction/lib/icu4j-4.8.1.1.jar.sha1 svn rm solr/contrib/analysis-extras/lib/icu4j-4.8.1.1.jar.sha1 branch_4x will need more svn moves and version checks for the versioned grammars. lucene/analysis/common/ I ran ant gen-tlds I ran ant jflex lucene/analysis/icu/ uax29 I don't fully understand the syntax used in the .rbbi files, so I didn't check whether they need algorithm updates. (However, since I didn't need to make any changes for the JFlex version, probably no algorithm changes needed.) I ran ant genrbbi . utr30 BasicFoldings.txt: For those things that are "additions to kd" - how to extend? For dashes folding, I added some non-included ranges. Q: should wave dash be folded to swung dash? (They look the same.) I don't know how to extend underline folding - is there a property? I don't know how to extend punctuation folding - is there a property? DiacriticFolding.txt: In the [:Diacritic:] section, I'm not sure how to proceed, as I can see several missing Latin-1 code points that were almost certainly part of Unicode 6.0.0, so the selection mechanism is non-transparent. In the [:Mark:] & [:Lm:] section, I'm not sure how to make selections, so I didn't try. In the "Additional Arabic/Hebrew decompositions" section, I don't know how to extend. Other sections were based either on AsciiFoldingFilter or UTR#30, neither of which has changed DigbatFolding.txt: based on AsciiFoldingFilter, which hasn't changed HanRadicalFolding.txt: based on UTR#30, which hasn't changed NativeDigitFolding.txt: I wrote a shell/perl script, embedded in the text file, to update. Should [:No:] DIGIT chars be included? One currently is: 19DA;NEW TAI LUE THAM DIGIT ONE;No;0;L;;;1;1;N;;;;; , but others are not (other ranges listed in the patch). nfkc.txt: New version copied directly from icu-project.org There is a problem: the following from TestICUFoldingFilter fails: 46: assertAnalyzesTo(a, "Μάϊος" , new String [] { "μαιοσ" }); AFAICT, this is because the accent decomposition mappings are no longer present in nfkc.txt. This makes no sense; Robert, do you know what's happening here? nfkc_cf.txt: New version copied directly from icu-project.org
        Hide
        Steve Rowe added a comment -

        try again to remove the java7-workaround-hack in LuceneTestCase for http://bugs.icu-project.org/trac/ticket/8734

        I ran your little program with ICU4J 49.1, and the Error is no longer raised. I've removed the workaround class.

        Show
        Steve Rowe added a comment - try again to remove the java7-workaround-hack in LuceneTestCase for http://bugs.icu-project.org/trac/ticket/8734 I ran your little program with ICU4J 49.1, and the Error is no longer raised. I've removed the workaround class.
        Hide
        Robert Muir added a comment -

        Thanks so much for tackling this! Give me some time to check out what you did...

        Show
        Robert Muir added a comment - Thanks so much for tackling this! Give me some time to check out what you did...
        Hide
        Robert Muir added a comment -

        Ill comment on the other things, but just to answer this:

        AFAICT, this is because the accent decomposition mappings are no longer present in nfkc.txt. This makes no sense; Robert, do you know what's happening here?

        You need to bring in nfc.txt too now.

        http://bugs.icu-project.org/trac/ticket/9023

        Show
        Robert Muir added a comment - Ill comment on the other things, but just to answer this: AFAICT, this is because the accent decomposition mappings are no longer present in nfkc.txt. This makes no sense; Robert, do you know what's happening here? You need to bring in nfc.txt too now. http://bugs.icu-project.org/trac/ticket/9023
        Hide
        Steve Rowe added a comment -

        You need to bring in nfc.txt too now. http://bugs.icu-project.org/trac/ticket/9023

        Whew, cool!

        Show
        Steve Rowe added a comment - You need to bring in nfc.txt too now. http://bugs.icu-project.org/trac/ticket/9023 Whew, cool!
        Hide
        Robert Muir added a comment -

        For those things that are "additions to kd" - how to extend?

        What this means is that its the mappings in utr30 minus what nfkc does.
        So i dont think we need to really worry about this much?
        if we wanted we could check the sets present for some, defined by the below link:
        http://www.unicode.org/reports/tr30/tr30-4.html#_Toc42
        So thats sorta the rule for this whole file?

        Example:

        ## Space Folding
        # [:Zs:] > U+0020
        1680>0020
        180E>0020
        

        So that is basically iterating this UnicodeSet (can be done in code with ICU) and generating mappings to 0020:

        [[:Zs:]-[\u0020]-[:Changes_When_NFKC_Casefolded=Yes:]]
        

        Does that make sense?

        Should [:No:] DIGIT chars be included? One currently is: 19DA;NEW TAI LUE THAM DIGIT ONE;No;0;L;;;1;1;N;;;;;, but others are not (other ranges listed in the patch).

        I dont see any problems including [:Numeric_Type=Digit:]. But i wouldnt use [:No:].
        So something like [[:Numeric_Type=Digit:][:Nd:]] ?

        In the [:Diacritic:] section, I'm not sure how to proceed, as I can see several missing Latin-1 code points that were almost certainly part of Unicode 6.0.0, so the selection mechanism is non-transparent.

        Its definitely a subset. Some stuff in here like viramas should not be folded away.
        Sorry i dont have a well-defined set or criteria, it was just common-sense.
        For the last update, i basically only reviewed the 'new' ones and made a decision: e.g.

        [[:Diacritic:]-[:Age=6.0:]]
        

        So this will be the trickiest part of the file to automate i think, as it was originally
        defined as a list for the most part: http://www.unicode.org/reports/tr30/datafiles/DiacriticFolding.txt

        Show
        Robert Muir added a comment - For those things that are "additions to kd" - how to extend? What this means is that its the mappings in utr30 minus what nfkc does. So i dont think we need to really worry about this much? if we wanted we could check the sets present for some, defined by the below link: http://www.unicode.org/reports/tr30/tr30-4.html#_Toc42 So thats sorta the rule for this whole file? Example: ## Space Folding # [:Zs:] > U+0020 1680>0020 180E>0020 So that is basically iterating this UnicodeSet (can be done in code with ICU) and generating mappings to 0020: [[:Zs:]-[\u0020]-[:Changes_When_NFKC_Casefolded=Yes:]] Does that make sense? Should [:No:] DIGIT chars be included? One currently is: 19DA;NEW TAI LUE THAM DIGIT ONE;No;0;L;;;1;1;N;;;;;, but others are not (other ranges listed in the patch). I dont see any problems including [:Numeric_Type=Digit:] . But i wouldnt use [:No:] . So something like [ [:Numeric_Type=Digit:] [:Nd:] ] ? In the [:Diacritic:] section, I'm not sure how to proceed, as I can see several missing Latin-1 code points that were almost certainly part of Unicode 6.0.0, so the selection mechanism is non-transparent. Its definitely a subset. Some stuff in here like viramas should not be folded away. Sorry i dont have a well-defined set or criteria, it was just common-sense. For the last update, i basically only reviewed the 'new' ones and made a decision: e.g. [[:Diacritic:]-[:Age=6.0:]] So this will be the trickiest part of the file to automate i think, as it was originally defined as a list for the most part: http://www.unicode.org/reports/tr30/datafiles/DiacriticFolding.txt
        Hide
        Robert Muir added a comment -

        Basically Steve, my opinion is if we have a good way to script this thing, we should just try to come
        up with some appropriate Sets for this stuff and automate it. It doesn't need to be perfect.

        And then go forward from there with fine tuning the script... but I think automation should be
        the priority!

        Show
        Robert Muir added a comment - Basically Steve, my opinion is if we have a good way to script this thing, we should just try to come up with some appropriate Sets for this stuff and automate it. It doesn't need to be perfect. And then go forward from there with fine tuning the script... but I think automation should be the priority!
        Hide
        Steve Rowe added a comment -
        • I ran perl generateJavaUnicodeWordBreakTest.pl and deleted the previously-generated WordBreakTestUnicode_6_0_0.java in favor of the new WordBreakTestUnicode_6_1_0.java. The new full svn script is:
          svn rm lucene/test-framework/src/java/org/apache/lucene/util/TestRuleIcuHack.java
          svn rm lucene/analysis/icu/lib/icu4j-4.8.1.1.jar.sha1
          svn rm solr/contrib/extraction/lib/icu4j-4.8.1.1.jar.sha1
          svn rm solr/contrib/analysis-extras/lib/icu4j-4.8.1.1.jar.sha1
          svn rm lucene/analysis/common/src/test/org/apache/lucene/analysis/core/WordBreakTestUnicode_6_0_0.java
          
        • Updated to automate the following via a new ant target gen-utr30-data-files, which gennorm2 now depends on:
        • Download nfc.txt, nfkc.txt and nfkc_cf.txt from Unicode.org
        • Convert round-trip mappings in nfc.txt to one-way mappings if the right-hand side contains [:Diacritic:]
        • Expand UnicodeSet rules in the other norm2 files.

        Where I couldn't figure out a rule, I put in an annotation ("# Rule: verbatim") to leave the following mappings as-is.

        Robert, I couldn't discern any logic to the exceptions you made to the "[:Diacritic:]>" mappings, so I left it at the full [:Diacritic:] set; feel free to amend the rule.

        After these changes, I ran ant gennorm2.

        All tests pass. I think this is ready to go.

        (More work to be done on branch_4x, where the current Unicode 6.0 JFlex-based implementations need to be acessible via LUCENE_36.)

        Show
        Steve Rowe added a comment - I ran perl generateJavaUnicodeWordBreakTest.pl and deleted the previously-generated WordBreakTestUnicode_6_0_0.java in favor of the new WordBreakTestUnicode_6_1_0.java . The new full svn script is: svn rm lucene/test-framework/src/java/org/apache/lucene/util/TestRuleIcuHack.java svn rm lucene/analysis/icu/lib/icu4j-4.8.1.1.jar.sha1 svn rm solr/contrib/extraction/lib/icu4j-4.8.1.1.jar.sha1 svn rm solr/contrib/analysis-extras/lib/icu4j-4.8.1.1.jar.sha1 svn rm lucene/analysis/common/src/test/org/apache/lucene/analysis/core/WordBreakTestUnicode_6_0_0.java Updated to automate the following via a new ant target gen-utr30-data-files , which gennorm2 now depends on: Download nfc.txt, nfkc.txt and nfkc_cf.txt from Unicode.org Convert round-trip mappings in nfc.txt to one-way mappings if the right-hand side contains [:Diacritic:] Expand UnicodeSet rules in the other norm2 files. Where I couldn't figure out a rule, I put in an annotation ("# Rule: verbatim") to leave the following mappings as-is. Robert, I couldn't discern any logic to the exceptions you made to the " [:Diacritic:] >" mappings, so I left it at the full [:Diacritic:] set; feel free to amend the rule. After these changes, I ran ant gennorm2 . All tests pass. I think this is ready to go. (More work to be done on branch_4x, where the current Unicode 6.0 JFlex-based implementations need to be acessible via LUCENE_36.)
        Hide
        Robert Muir added a comment -

        If its automated then I'm +1. We can refine with other issues (keeping with the automated approach)

        I took a glance at the patch and it looks nice and very thorough... thank you!!!!!

        Show
        Robert Muir added a comment - If its automated then I'm +1. We can refine with other issues (keeping with the automated approach) I took a glance at the patch and it looks nice and very thorough... thank you!!!!!
        Hide
        Steve Rowe added a comment - - edited

        Committed to trunk: r1365971.

        Backporting to branch_4x now.

        Show
        Steve Rowe added a comment - - edited Committed to trunk: r1365971 . Backporting to branch_4x now.
        Hide
        Steve Rowe added a comment -

        There was a source generation problem: "Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8" got embedded in two of the intermediate generated .jflex-macro files. If JAVA_TOOLS_OPTIONS env. var is set, it gets picked up as if it were cmdline options by JVM, then the JVM outputs that string, apparently into the same stream that gets captured by one of the source generation processes.

        I committed a fix to trunk: r1366231.

        Show
        Steve Rowe added a comment - There was a source generation problem: "Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8" got embedded in two of the intermediate generated .jflex-macro files. If JAVA_TOOLS_OPTIONS env. var is set, it gets picked up as if it were cmdline options by JVM, then the JVM outputs that string, apparently into the same stream that gets captured by one of the source generation processes. I committed a fix to trunk: r1366231 .
        Hide
        Steve Rowe added a comment -

        Committed to branch_4x: r1366298.

        Show
        Steve Rowe added a comment - Committed to branch_4x: r1366298 .
        Hide
        Steve Rowe added a comment -

        I missed a couple of Unicode 6.0 mentions. Patch in a moment.

        Show
        Steve Rowe added a comment - I missed a couple of Unicode 6.0 mentions. Patch in a moment.
        Hide
        Steve Rowe added a comment -

        HTMLStripCharFilter.jflex needed to be upgraded (%unicode 6.0 -> %unicode 6.1) and regenerated, but the rest is just documentation, though this patch does include all regenerated .java files.

        Committing shortly.

        Show
        Steve Rowe added a comment - HTMLStripCharFilter.jflex needed to be upgraded ( %unicode 6.0 -> %unicode 6.1 ) and regenerated, but the rest is just documentation, though this patch does include all regenerated .java files. Committing shortly.
        Hide
        Steve Rowe added a comment -

        Committed:

        Show
        Steve Rowe added a comment - Committed: trunk r1387813 branch_4x r1387837
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Steven Rowe
        http://svn.apache.org/viewvc?view=revision&revision=1387837

        LUCENE-3747: finish upgrading to Unicode 6.1 (merge trunk r1387813)

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Steven Rowe http://svn.apache.org/viewvc?view=revision&revision=1387837 LUCENE-3747 : finish upgrading to Unicode 6.1 (merge trunk r1387813)

          People

          • Assignee:
            Steve Rowe
            Reporter:
            Steve Rowe
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development