Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-6993

Update UAX29URLEmailTokenizer TLDs to latest list, and upgrade all JFlex-based tokenizers to support Unicode 8.0

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 6.x
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      We did this once before in LUCENE-5357, but it might be time to update the list of TLDs again. Comparing our old list with a new list indicates 800+ new domains, so it would be nice to include them.

      Also the JFlex tokenizer grammars should be upgraded to support Unicode 8.0.

      1. LUCENE-6993.patch
        1.76 MB
        Mike Drob
      2. LUCENE-6993.patch
        2.18 MB
        Mike Drob
      3. LUCENE-6993.patch
        2.20 MB
        Mike Drob
      4. LUCENE-6993.patch
        2.20 MB
        Mike Drob
      5. LUCENE-6993.patch
        3.00 MB
        Mike Drob
      6. LUCENE-6993.patch
        5.50 MB
        Mike Drob
      7. LUCENE-6993.patch
        170 kB
        Mike Drob
      8. LUCENE-6993.patch
        170 kB
        Mike Drob

        Activity

        Hide
        mdrob Mike Drob added a comment -

        Attaching a patch against trunk that updates the TLD Macro file and the UAX29URLEmailTokenizerImpl.

        When running ant jflex I had to increase the amount of heap space available due to the increased number of TLDs, not sure if this will result in a negative impact to the rest of the build.

        The .an and .tp domains were removed from the list, random.text.with.urls was updated accordingly.

        Show
        mdrob Mike Drob added a comment - Attaching a patch against trunk that updates the TLD Macro file and the UAX29URLEmailTokenizerImpl. When running ant jflex I had to increase the amount of heap space available due to the increased number of TLDs, not sure if this will result in a negative impact to the rest of the build. The .an and .tp domains were removed from the list, random.text.with.urls was updated accordingly.
        Hide
        mdrob Mike Drob added a comment -

        Steve Rowe - you did the previous incarnation of this fix, do you have time to look at this one?

        Show
        mdrob Mike Drob added a comment - Steve Rowe - you did the previous incarnation of this fix, do you have time to look at this one?
        Hide
        steve_rowe Steve Rowe added a comment -

        Hi Mike Drob, sure, I'll try to look at it some time this week.

        Show
        steve_rowe Steve Rowe added a comment - Hi Mike Drob , sure, I'll try to look at it some time this week.
        Hide
        mdrob Mike Drob added a comment -

        Hi Steve - do you have any updates or would you like me to ping somebody else? Thanks!

        Show
        mdrob Mike Drob added a comment - Hi Steve - do you have any updates or would you like me to ping somebody else? Thanks!
        Hide
        mdrob Mike Drob added a comment -

        Robert Muir - Do you have any thoughts on this since you were involved in the previous patch too?

        Show
        mdrob Mike Drob added a comment - Robert Muir - Do you have any thoughts on this since you were involved in the previous patch too?
        Hide
        rcmuir Robert Muir added a comment -

        I with a major release looming we should update all this stuff. Also the unicode version (and icu library) to Unicode 8.0 because java has already done this for JDK 9 (http://openjdk.java.net/jeps/267), and we should not fall so far behind.

        We should copy the current generated grammar with a 'std55' subdirectory and hook it in for backwards compatibility before applying grammar changes. Then I think just fix all this stuff at once? It sounds worse than it is, I think it can be done today, I will help.

        Show
        rcmuir Robert Muir added a comment - I with a major release looming we should update all this stuff. Also the unicode version (and icu library) to Unicode 8.0 because java has already done this for JDK 9 ( http://openjdk.java.net/jeps/267 ), and we should not fall so far behind. We should copy the current generated grammar with a 'std55' subdirectory and hook it in for backwards compatibility before applying grammar changes. Then I think just fix all this stuff at once? It sounds worse than it is, I think it can be done today, I will help.
        Hide
        mdrob Mike Drob added a comment -

        That all makes sense. I was looking at the unicode spec changes between 6.3 and 8.0 and did not really understand what the impact to our grammars is.

        I'll add the current grammar to a std55 directory, but will need some help making sure that I've got all the right back-compat hooks. I'll post an updated patch shortly when I get stuck.

        Show
        mdrob Mike Drob added a comment - That all makes sense. I was looking at the unicode spec changes between 6.3 and 8.0 and did not really understand what the impact to our grammars is. I'll add the current grammar to a std55 directory, but will need some help making sure that I've got all the right back-compat hooks. I'll post an updated patch shortly when I get stuck.
        Hide
        rcmuir Robert Muir added a comment -

        OK, I can look into the icu part in a separate issue, since its somewhat unrelated but I think worthwhile for consistency.

        Show
        rcmuir Robert Muir added a comment - OK, I can look into the icu part in a separate issue, since its somewhat unrelated but I think worthwhile for consistency.
        Hide
        mdrob Mike Drob added a comment -

        From analysis/common directory, I ran ANT_OPTS="-Xmx6g" ant gen-tlds unicode-data jflex

        Do I need to manually update the %unicode X.X directive in the .jflex files or do we want to leave that for compatibility? I am unclear on the impacts here.

        Also, I did not see any previous examples of keeping the tokenizers around for compatibility, so I guess I didn't quite understand where the hooks are.

        Show
        mdrob Mike Drob added a comment - From analysis/common directory, I ran ANT_OPTS="-Xmx6g" ant gen-tlds unicode-data jflex Do I need to manually update the %unicode X.X directive in the .jflex files or do we want to leave that for compatibility? I am unclear on the impacts here. Also, I did not see any previous examples of keeping the tokenizers around for compatibility, so I guess I didn't quite understand where the hooks are.
        Hide
        rcmuir Robert Muir added a comment -

        Basically the old versions of the Tokenizer and Impl are just "saved" to a subdirectory, and in the Analyzer and TokenizerFactory we conditionally use them, if you request that compatibility version.

        Have a look at branch_5x which still has std40 containing StandardTokenizer40, StandardTokenizerImpl40, UAX29URLEmailTokenizer40, and so on. TestStandardAnalyzer and TestUAX29URLEmailAnalyzer also have a testBackcompat40 which calls setVersion and ensures it works. Finally, see StandardAnalyzer/TokenizerFactory.java, and UAXURLEmailAnalyzer/TokenizerFactory.java which conditionally use StandardTokenizer40 depending on version.

        So we should do a similar thing with the current stuff in master before modifying the files, and make them std55. We can just test that it works at all (e.g. foo bar -> foo,bar) initially and later maybe add a test ensuring "old behavior" stays the same.

        Then you can bump unicode version and tld lists and it won't change any behavior if someone asks for version < 6.0, because they will get the exact same tokenizer as before.

        Show
        rcmuir Robert Muir added a comment - Basically the old versions of the Tokenizer and Impl are just "saved" to a subdirectory, and in the Analyzer and TokenizerFactory we conditionally use them, if you request that compatibility version. Have a look at branch_5x which still has std40 containing StandardTokenizer40, StandardTokenizerImpl40, UAX29URLEmailTokenizer40, and so on. TestStandardAnalyzer and TestUAX29URLEmailAnalyzer also have a testBackcompat40 which calls setVersion and ensures it works. Finally, see StandardAnalyzer/TokenizerFactory.java, and UAXURLEmailAnalyzer/TokenizerFactory.java which conditionally use StandardTokenizer40 depending on version. So we should do a similar thing with the current stuff in master before modifying the files, and make them std55 . We can just test that it works at all (e.g. foo bar -> foo,bar) initially and later maybe add a test ensuring "old behavior" stays the same. Then you can bump unicode version and tld lists and it won't change any behavior if someone asks for version < 6.0, because they will get the exact same tokenizer as before.
        Hide
        rcmuir Robert Muir added a comment -

        And i guess really we should call it std50 to keep things simple. if someone asks for 5.4 compatibility, they should get this one and then the logic in the Analyzer will be clear that is the case even going forward.

        Show
        rcmuir Robert Muir added a comment - And i guess really we should call it std50 to keep things simple. if someone asks for 5.4 compatibility, they should get this one and then the logic in the Analyzer will be clear that is the case even going forward.
        Hide
        rcmuir Robert Muir added a comment -

        I took care of the icu parts here: LUCENE-7035

        please ping me here if you have trouble setting up the back compat. I can always do that part, if it gets too frustrating. But it is better if more people can do it.

        Show
        rcmuir Robert Muir added a comment - I took care of the icu parts here: LUCENE-7035 please ping me here if you have trouble setting up the back compat. I can always do that part, if it gets too frustrating. But it is better if more people can do it.
        Hide
        mdrob Mike Drob added a comment -

        Attaching a new version of the patch that differentiates between the new and old tokenizers. A couple of notes...

        I used LUCENE_5_6 as the cutoff, since this could conceivably be backported to the 5.x branch. I know there aren't any plans for a 5.6 release ever, but this seems like a low cost change.

        The UAX29URLEmailTokenizerImpl50.java currently shows up with private methods, which breaks compilation. I can't figure out how to make those public, even though the old and new jflex files look almost identical. Would appreciate a second set of eyes here.

        Show
        mdrob Mike Drob added a comment - Attaching a new version of the patch that differentiates between the new and old tokenizers. A couple of notes... I used LUCENE_5_6 as the cutoff, since this could conceivably be backported to the 5.x branch. I know there aren't any plans for a 5.6 release ever, but this seems like a low cost change. The UAX29URLEmailTokenizerImpl50.java currently shows up with private methods, which breaks compilation. I can't figure out how to make those public, even though the old and new jflex files look almost identical. Would appreciate a second set of eyes here.
        Hide
        rcmuir Robert Muir added a comment -

        I can't even imagine us releasing a 5.6 after a 6.0, I really do not think we should drag that idea into this issue. Its a bad one.

        Lets target 6.0 here for all this stuff: these are major changes that impact backwards compatibility. The logic should be:

        if (version.onOrAfter(LUCENE_6_0_0)) {
          // new tokenizer
        } else {
          // old tokenizer
        }
        
        Show
        rcmuir Robert Muir added a comment - I can't even imagine us releasing a 5.6 after a 6.0, I really do not think we should drag that idea into this issue. Its a bad one. Lets target 6.0 here for all this stuff: these are major changes that impact backwards compatibility. The logic should be: if (version.onOrAfter(LUCENE_6_0_0)) { // new tokenizer } else { // old tokenizer }
        Hide
        mdrob Mike Drob added a comment -

        Ok, switched to 6_0_0 in the next patch.

        JFlex 1.6.1 currently only supports Unicode 7.0, not 8.0 - Steve Rowe, do you know what the jflex timeline for upgrading looks like?

        I changed jflex-legacy to be a public target and ran it individually, and it generated public accessors like we need, so I'm still not sure what's going on, but we can move past it at this point. The jflex target still screws it up, though, so I'm going to take jflex-legacy out of it.

        Show
        mdrob Mike Drob added a comment - Ok, switched to 6_0_0 in the next patch. JFlex 1.6.1 currently only supports Unicode 7.0, not 8.0 - Steve Rowe , do you know what the jflex timeline for upgrading looks like? I changed jflex-legacy to be a public target and ran it individually, and it generated public accessors like we need, so I'm still not sure what's going on, but we can move past it at this point. The jflex target still screws it up, though, so I'm going to take jflex-legacy out of it.
        Hide
        steve_rowe Steve Rowe added a comment -

        JFlex 1.6.1 currently only supports Unicode 7.0, not 8.0 - Steve Rowe, do you know what the jflex timeline for upgrading looks like?

        Unicode 8.0 support is committed on JFlex master, but no release includes it yet. (So if you want to test I think you could build JFlex locally, change the JFlex dependency in Lucene to use the snapshot, then run the Lucene build.) No timeline for release has been set. I'll ping JFlex founder Gerwin Klein, who has done all the releases, and get back to you here.

        Show
        steve_rowe Steve Rowe added a comment - JFlex 1.6.1 currently only supports Unicode 7.0, not 8.0 - Steve Rowe, do you know what the jflex timeline for upgrading looks like? Unicode 8.0 support is committed on JFlex master, but no release includes it yet. (So if you want to test I think you could build JFlex locally, change the JFlex dependency in Lucene to use the snapshot, then run the Lucene build.) No timeline for release has been set. I'll ping JFlex founder Gerwin Klein, who has done all the releases, and get back to you here.
        Hide
        steve_rowe Steve Rowe added a comment - - edited

        Mike Drob, I haven't looked at your patch yet but there is a non-rote Unicode upgrade item that needs to be dealt with - from LUCENE-5357's TODO list:

        • Upgrade the UAX#29-based grammars to the Unicode 6.3 8.0 word break rules, in StandardTokenizerImpl.jflex and UAX29URLEmailTokenizer.jflex.

        UAX#29 word break rules can (and usually do) change with each Unicode release, so we'll need to review the changes between 6.3 and 8.0 and see what, if anything, needs changing in the tokenizer grammars. Another item from the LUCENE-5357 TODO list will confirm that this has been done correctly:

        • Test the new scanners against the Unicode 6.3 8.0 word break test data
          • [...]
        Show
        steve_rowe Steve Rowe added a comment - - edited Mike Drob , I haven't looked at your patch yet but there is a non-rote Unicode upgrade item that needs to be dealt with - from LUCENE-5357 's TODO list: Upgrade the UAX#29-based grammars to the Unicode 6.3 8.0 word break rules, in StandardTokenizerImpl.jflex and UAX29URLEmailTokenizer.jflex. UAX#29 word break rules can (and usually do) change with each Unicode release, so we'll need to review the changes between 6.3 and 8.0 and see what, if anything, needs changing in the tokenizer grammars. Another item from the LUCENE-5357 TODO list will confirm that this has been done correctly: Test the new scanners against the Unicode 6.3 8.0 word break test data [...]
        Hide
        mdrob Mike Drob added a comment -

        Looking at http://unicode.org/reports/tr29/#Modifications I see

        Revision 27 [KW, LI]
        
            Reissued for Unicode 8.0.
            Modified rule SB7 to prevent sentence breaks within a word segment such as “Mr.Hamster”.
            Updated notes on tailoring using CLDR boundary suppressions.
            Recast rule tables to use macros for compactness.
            Updated table styles, removed inconsistently applied styles on character names and code points, and adjusted layout of various tables and figures.
            Section 3.1 Default Grapheme Cluster Boundary Specification
                Removed the New Tai Lue characters U+19B0..U+19B4, U+19B8..U+19B9, U+19BB..U+19C0, U+19C8..U+19C9 from the exception list for SpacingMark in Table 2, Grapheme_Cluster_Break Property Values.
                Added U+11720 AHOM VOWEL SIGN A and U+11721 AHOM VOWEL SIGN AA to the same exception list for SpacingMark.
        
        Revision 26 being a proposed update, only changes between versions 27 and 25 are noted here.
        Revision 25
        
            Reissued for Unicode 7.0.
            General text cleanup, including “_” in property and property value names, use of curly-quotes and italics.
            Section 3.1 Default Grapheme Cluster Boundary Specification
                Added U+AA7D MYANMAR SIGN TAI LAING TONE-5 to the exception list for SpacingMark in Table 2, Grapheme_Cluster_Break Property Values.
            Section 5.1 Default Sentence Boundary Specification
                Added note to clarify that Format and Extend characters are not joined to separators like LF.
                Added note about the fact that words can span a sentence break.
        

        I am by no means an expert in Unicode, but it looks like the Sentence Break rules are not relevant to us, right? But the Spacing Mark // Grapheme Cluster changes are relevant?
        When you refer to the word break test data, is that something that the Unicode Consortium publishes or do you mean our internal data?

        Show
        mdrob Mike Drob added a comment - Looking at http://unicode.org/reports/tr29/#Modifications I see Revision 27 [KW, LI] Reissued for Unicode 8.0. Modified rule SB7 to prevent sentence breaks within a word segment such as “Mr.Hamster”. Updated notes on tailoring using CLDR boundary suppressions. Recast rule tables to use macros for compactness. Updated table styles, removed inconsistently applied styles on character names and code points, and adjusted layout of various tables and figures. Section 3.1 Default Grapheme Cluster Boundary Specification Removed the New Tai Lue characters U+19B0..U+19B4, U+19B8..U+19B9, U+19BB..U+19C0, U+19C8..U+19C9 from the exception list for SpacingMark in Table 2, Grapheme_Cluster_Break Property Values. Added U+11720 AHOM VOWEL SIGN A and U+11721 AHOM VOWEL SIGN AA to the same exception list for SpacingMark. Revision 26 being a proposed update, only changes between versions 27 and 25 are noted here. Revision 25 Reissued for Unicode 7.0. General text cleanup, including “_” in property and property value names, use of curly-quotes and italics. Section 3.1 Default Grapheme Cluster Boundary Specification Added U+AA7D MYANMAR SIGN TAI LAING TONE-5 to the exception list for SpacingMark in Table 2, Grapheme_Cluster_Break Property Values. Section 5.1 Default Sentence Boundary Specification Added note to clarify that Format and Extend characters are not joined to separators like LF. Added note about the fact that words can span a sentence break. I am by no means an expert in Unicode, but it looks like the Sentence Break rules are not relevant to us, right? But the Spacing Mark // Grapheme Cluster changes are relevant? When you refer to the word break test data, is that something that the Unicode Consortium publishes or do you mean our internal data?
        Hide
        steve_rowe Steve Rowe added a comment -

        I think you're right, Mike, I don't see any default word break rule modifications in that list between versions 6.3 and 8.0.

        The test data is here: http://www.unicode.org/Public/8.0.0/ucd/auxiliary/WordBreakTest.txt

        Here's the full TODO item from LUCENE-5357 (after s/6.3/8.0/g and s/6.1/6.3/g):

        • Test the new scanners against the Unicode 8.0 word break test data
          • Update generateJavaUnicodeWordBreakTest.pl to handle above-BMP characters in the Unicode character database's ucd/auxiliary/WordBreakTest.txt (previous Unicode versions included only BMP characters in that file). (one time operation, already done.)
          • Using generateJavaUnicodeWordBreakTest.pl, generate WordBreakTestUnicode_8_0_0.java under modules/analysis/common/src/test/org/apache/lucene/analysis/core/.
          • Update TestStandardAnalyzer.java and TestUAX29URLEmailTokenizer.java to invoke WordBreakTestUnicode_8_0_0 rather than WordBreakTestUnicode_6_3_0.
          • Remove WordBreakTestUnicode_6_3_0.java.
        Show
        steve_rowe Steve Rowe added a comment - I think you're right, Mike, I don't see any default word break rule modifications in that list between versions 6.3 and 8.0. The test data is here: http://www.unicode.org/Public/8.0.0/ucd/auxiliary/WordBreakTest.txt Here's the full TODO item from LUCENE-5357 (after s/6.3/8.0/g and s/6.1/6.3/g ): Test the new scanners against the Unicode 8.0 word break test data Update generateJavaUnicodeWordBreakTest.pl to handle above-BMP characters in the Unicode character database's ucd/auxiliary/WordBreakTest.txt (previous Unicode versions included only BMP characters in that file). (one time operation, already done.) Using generateJavaUnicodeWordBreakTest.pl , generate WordBreakTestUnicode_8_0_0.java under modules/analysis/common/src/test/org/apache/lucene/analysis/core/ . Update TestStandardAnalyzer.java and TestUAX29URLEmailTokenizer.java to invoke WordBreakTestUnicode_8_0_0 rather than WordBreakTestUnicode_6_3_0 . Remove WordBreakTestUnicode_6_3_0.java .
        Hide
        mdrob Mike Drob added a comment -

        Ok, making progress...

        To recap what I've done so far (so that it's easier next time)

        * ant gen-tlds
        copy affected analyzers to a stdX directory
        update affected factories
        update jflex-legacy task in build.xml
        update test data to use valid domains
        * ANT_OPTS='-Xmx5g' ant jflex
        * perl generateJavaUnicodeWordBreakTest.pl -v 8.0.0
        * sed 's/6_3/8_0/g' TestStandardAnalyzer.java TestUAX29URLEmailTokenizer.java
        

        (starred entries are verbatim commands)

        I also had to add a special case to the perl script where if there is a double quote inside of output string text to add a java escape. I saw this in one of the other Unicode release notes relating to hebrew text. The specific test this is necessary for is א"א

        I was not able to figure out how to install jflex snapshot locally – I thought by running mvn install on master it would work but turns out I don't understand the nuances of ivy. Consequentially, I have not updated the unicode directives to 8.0 for any of the parsers, but all the analyzer/tokenizer tests still seem to pass so we might be ok?

        Show
        mdrob Mike Drob added a comment - Ok, making progress... To recap what I've done so far (so that it's easier next time) * ant gen-tlds copy affected analyzers to a stdX directory update affected factories update jflex-legacy task in build.xml update test data to use valid domains * ANT_OPTS='-Xmx5g' ant jflex * perl generateJavaUnicodeWordBreakTest.pl -v 8.0.0 * sed 's/6_3/8_0/g' TestStandardAnalyzer.java TestUAX29URLEmailTokenizer.java (starred entries are verbatim commands) I also had to add a special case to the perl script where if there is a double quote inside of output string text to add a java escape. I saw this in one of the other Unicode release notes relating to hebrew text. The specific test this is necessary for is א"א I was not able to figure out how to install jflex snapshot locally – I thought by running mvn install on master it would work but turns out I don't understand the nuances of ivy. Consequentially, I have not updated the unicode directives to 8.0 for any of the parsers, but all the analyzer/tokenizer tests still seem to pass so we might be ok?
        Hide
        rcmuir Robert Muir added a comment -

        I think we need to regenerate still, because there are new characters/character property changes so the actual tokenizer will change (even if the rules stay the same: the alphabet got bigger).

        Show
        rcmuir Robert Muir added a comment - I think we need to regenerate still, because there are new characters/character property changes so the actual tokenizer will change (even if the rules stay the same: the alphabet got bigger).
        Hide
        mdrob Mike Drob added a comment -

        Question about what is proper behaviour in terms of backwards compatibility here...

        Upgrading JFlex from 1.6.0 to 1.6.1 (and 1.7.0, I assume) changes the generated output. I have no idea if the behaviour is identical between the new class files and the old. I imagine that we want to keep the Impls generated by the old version when operating with an old lucene match version, rather than regenerating those with the new jflex. If so, I'll drop the work I did on updating jflex-legacy task, since it doesn't make sense to keep around (it woudn't generate code to match what is in source control).

        Does this make sense?

        Show
        mdrob Mike Drob added a comment - Question about what is proper behaviour in terms of backwards compatibility here... Upgrading JFlex from 1.6.0 to 1.6.1 (and 1.7.0, I assume) changes the generated output. I have no idea if the behaviour is identical between the new class files and the old. I imagine that we want to keep the Impls generated by the old version when operating with an old lucene match version, rather than regenerating those with the new jflex. If so, I'll drop the work I did on updating jflex-legacy task, since it doesn't make sense to keep around (it woudn't generate code to match what is in source control). Does this make sense?
        Hide
        mdrob Mike Drob added a comment -

        I think we need to regenerate still, because there are new characters/character property changes so the actual tokenizer will change (even if the rules stay the same: the alphabet got bigger).

        Ok. My current plan will be to copy all existing tokenizers to std50 packages, update the factories to be cognizant of lucene version, update current jflex files to all use unicode 8.0 and then regenerate all of the new tokenizer classes.

        Some of the tokenizers have a unicode 3.0 directive, which indicates that they haven't been touched in a long time. This worries me a bit, but I'll see how it goes.

        Show
        mdrob Mike Drob added a comment - I think we need to regenerate still, because there are new characters/character property changes so the actual tokenizer will change (even if the rules stay the same: the alphabet got bigger). Ok. My current plan will be to copy all existing tokenizers to std50 packages, update the factories to be cognizant of lucene version, update current jflex files to all use unicode 8.0 and then regenerate all of the new tokenizer classes. Some of the tokenizers have a unicode 3.0 directive, which indicates that they haven't been touched in a long time. This worries me a bit, but I'll see how it goes.
        Hide
        rcmuir Robert Muir added a comment -

        I think we should be ok. As far as i understand it, jflex will respect that unicode directive and the grammar and generate the equivalent state machine. But regenerating the "old grammars" means we get bugfixes from jflex: e.g. performance or memory improvements or whatever improved there, so I think its the right thing to do.

        Show
        rcmuir Robert Muir added a comment - I think we should be ok. As far as i understand it, jflex will respect that unicode directive and the grammar and generate the equivalent state machine. But regenerating the "old grammars" means we get bugfixes from jflex: e.g. performance or memory improvements or whatever improved there, so I think its the right thing to do.
        Hide
        steve_rowe Steve Rowe added a comment -

        +1

        Show
        steve_rowe Steve Rowe added a comment - +1
        Hide
        mdrob Mike Drob added a comment -

        Using newer version of jflex breaks our existing macros...

          <macrodef name="run-jflex-and-disable-buffer-expansion">
            <attribute name="dir"/>
            <attribute name="name"/>
            <sequential>
              <jflex file="@{dir}/@{name}.jflex" outdir="@{dir}" nobak="on" inputstreamctor="false"/>
              <!-- LUCENE-5897: Disallow scanner buffer expansion -->
        <!-- ... snip ... -->
              <replaceregexp file="@{dir}/@{name}.java"
                             match="(zzFinalHighSurrogate = 1;)(\r?\n)"
                             replace="\1\2          if (totalRead == 1) { return true; }\2"/>
            </sequential>
          </macrodef>
        

        There is no longer a totalRead variable tracked by the JFlex generated code, instead we could check numRead I think. However, from reading LUCENE-5897 it is unclear whether this behaviour would have been fixed in later JFlex releases and we don't need the "-and-disable-buffer-expansion" marco at all.

        Show
        mdrob Mike Drob added a comment - Using newer version of jflex breaks our existing macros... <macrodef name= "run-jflex-and-disable-buffer-expansion" > <attribute name= "dir" /> <attribute name= "name" /> <sequential> <jflex file= "@{dir}/@{name}.jflex" outdir= "@{dir}" nobak= "on" inputstreamctor= " false " /> <!-- LUCENE-5897: Disallow scanner buffer expansion --> <!-- ... snip ... --> <replaceregexp file= "@{dir}/@{name}.java" match= "(zzFinalHighSurrogate = 1;)(\r?\n)" replace= "\1\2 if (totalRead == 1) { return true ; }\2" /> </sequential> </macrodef> There is no longer a totalRead variable tracked by the JFlex generated code, instead we could check numRead I think. However, from reading LUCENE-5897 it is unclear whether this behaviour would have been fixed in later JFlex releases and we don't need the "-and-disable-buffer-expansion" marco at all.
        Hide
        steve_rowe Steve Rowe added a comment -

        Yeah, the generated code underwent some changes there, so the hack we use to disable buffer expansion will require some adjustment. This technique should be in JFlex though, I'll take a look this weekend at getting it in before the 1.7 release.

        Show
        steve_rowe Steve Rowe added a comment - Yeah, the generated code underwent some changes there, so the hack we use to disable buffer expansion will require some adjustment. This technique should be in JFlex though, I'll take a look this weekend at getting it in before the 1.7 release.
        Hide
        mdrob Mike Drob added a comment -

        Steve, did you get a chance to look at the buffer adjustment? I'm going to pre-emptively remove the hack from our code and trust that it will be in the next release of JFlex.

        Unrelated; the ClassicTokenizer is already advertised as a legacy option – I don't think it makes sense to create a new version of it here. WDYT?

        Show
        mdrob Mike Drob added a comment - Steve, did you get a chance to look at the buffer adjustment? I'm going to pre-emptively remove the hack from our code and trust that it will be in the next release of JFlex. Unrelated; the ClassicTokenizer is already advertised as a legacy option – I don't think it makes sense to create a new version of it here. WDYT?
        Hide
        steve_rowe Steve Rowe added a comment -

        Steve, did you get a chance to look at the buffer adjustment? I'm going to pre-emptively remove the hack from our code and trust that it will be in the next release of JFlex.

        Not yet. Planning on it though.

        Unrelated; the ClassicTokenizer is already advertised as a legacy option – I don't think it makes sense to create a new version of it here. WDYT?

        Uwe Schindler has written that he still recommends this tokenizer in some cases, so if you're asking if we should remove it, I don't think so.

        Show
        steve_rowe Steve Rowe added a comment - Steve, did you get a chance to look at the buffer adjustment? I'm going to pre-emptively remove the hack from our code and trust that it will be in the next release of JFlex. Not yet. Planning on it though. Unrelated; the ClassicTokenizer is already advertised as a legacy option – I don't think it makes sense to create a new version of it here. WDYT? Uwe Schindler has written that he still recommends this tokenizer in some cases, so if you're asking if we should remove it, I don't think so.
        Hide
        thetaphi Uwe Schindler added a comment -

        Uwe Schindler has written that he still recommends this tokenizer in some cases, so if you're asking if we should remove it, I don't think so.

        I think the question was if it should also be upgraded to newer Unicode. But it does not rely on any unicode version the JAVA files should be identical. Please don't remove it!

        Show
        thetaphi Uwe Schindler added a comment - Uwe Schindler has written that he still recommends this tokenizer in some cases, so if you're asking if we should remove it, I don't think so. I think the question was if it should also be upgraded to newer Unicode. But it does not rely on any unicode version the JAVA files should be identical. Please don't remove it!
        Hide
        mdrob Mike Drob added a comment -

        Yea, Uwe understood my question. I wasn't planning on removing it, but I also didn't quite understand if it needed to be updated to the new Unicode. Sounds like I can leave it as is.

        Show
        mdrob Mike Drob added a comment - Yea, Uwe understood my question. I wasn't planning on removing it, but I also didn't quite understand if it needed to be updated to the new Unicode. Sounds like I can leave it as is.
        Hide
        steve_rowe Steve Rowe added a comment - - edited

        ClassicTokenizer does have direct Unicode version dependencies: [:digit:] and [:alpha:] are the equivalent of \p{Digit} and \p{Letter}, respectively. Right now those definitions are pinned at Unicode 3.0, which means that characters added since Unicode 3.0 (released 15 years ago, in 2000) will not be properly tokenized.

        Also, there are several effectively-pinned character sets (for CJK and Thai) that are hard-coded in the grammar, and don't include any supplementary characters at all. If the Unicode version changes, these will need to be moved to use the appropriate Unicode properties instead.

        I guess I'm -0 on leaving the Unicode version as-is because of the above, but since this tokenizer will never be removed, it seems bad to me to keep it pinned to such an old Unicode version.

        Show
        steve_rowe Steve Rowe added a comment - - edited ClassicTokenizer does have direct Unicode version dependencies: [:digit:] and [:alpha:] are the equivalent of \p{Digit} and \p{Letter}, respectively. Right now those definitions are pinned at Unicode 3.0, which means that characters added since Unicode 3.0 (released 15 years ago, in 2000) will not be properly tokenized. Also, there are several effectively-pinned character sets (for CJK and Thai) that are hard-coded in the grammar, and don't include any supplementary characters at all. If the Unicode version changes, these will need to be moved to use the appropriate Unicode properties instead. I guess I'm -0 on leaving the Unicode version as-is because of the above, but since this tokenizer will never be removed, it seems bad to me to keep it pinned to such an old Unicode version.
        Hide
        rcmuir Robert Muir added a comment -

        Yeah its tricky. I kinda view classictokenizer as a tokenizer for the ignorant... its got tons of bogus western only assumptions and is basically wrong in every possible way. But arguing with this is like arguing with donald trump, so better to give folks like this their own dedicated crappy tokenizer and keep them off our back. From this perspective, it can be wired to unicode 1.0 and it serves its intended purpose.

        Show
        rcmuir Robert Muir added a comment - Yeah its tricky. I kinda view classictokenizer as a tokenizer for the ignorant... its got tons of bogus western only assumptions and is basically wrong in every possible way. But arguing with this is like arguing with donald trump, so better to give folks like this their own dedicated crappy tokenizer and keep them off our back. From this perspective, it can be wired to unicode 1.0 and it serves its intended purpose.
        Hide
        mdrob Mike Drob added a comment -

        I think I am getting to a good place here, just a few more issues that I need some additional direction –

          /**
           * Sets the scanner buffer size in chars
           */
           public final void setBufferSize(int numChars) {
             ZZ_BUFFERSIZE = numChars;
             char[] newZzBuffer = new char[ZZ_BUFFERSIZE];
             System.arraycopy(zzBuffer, 0, newZzBuffer, 0, Math.min(zzBuffer.length, ZZ_BUFFERSIZE));
             zzBuffer = newZzBuffer;
           }
        

        This is code that we inject directly from our jflex templates, not generated code. True to their promises, ZZ prefixed items in jflex are subject to change, and this one has become final between old and new versions. We could fix this with an additional post-processing step to take out the final modifier, or put changes in jflex to add a new constructor or something like that. It looks like all of the non-test usage of setting the size happens immediately post construction.

        Also, there are several effectively-pinned character sets (for CJK and Thai) that are hard-coded in the grammar, and don't include any supplementary characters at all. If the Unicode version changes, these will need to be moved to use the appropriate Unicode properties instead.

        Currently ClassicTokenizer has THAI = [\u0E00-\u0E59]; ALPHANUM = ({LETTER}|{THAI}|[:digit:])+;. If I understand the Unicode spec correctly with Unicode 8.0 we can remove the THAI declaration and it would be correctly included in LETTER. But I have near zero confidence in this. Alternatively, leaving it as is should be fine because the assigned THAI characters have not gone outside of that range.
        For CJK, we have a special call out for CJ, but K was apparently already included in LETTER? I don't understand the relationship between ALPHANUM, \p{Letter} and CJK.

        Show
        mdrob Mike Drob added a comment - I think I am getting to a good place here, just a few more issues that I need some additional direction – /** * Sets the scanner buffer size in chars */ public final void setBufferSize( int numChars) { ZZ_BUFFERSIZE = numChars; char [] newZzBuffer = new char [ZZ_BUFFERSIZE]; System .arraycopy(zzBuffer, 0, newZzBuffer, 0, Math .min(zzBuffer.length, ZZ_BUFFERSIZE)); zzBuffer = newZzBuffer; } This is code that we inject directly from our jflex templates, not generated code. True to their promises, ZZ prefixed items in jflex are subject to change, and this one has become final between old and new versions. We could fix this with an additional post-processing step to take out the final modifier, or put changes in jflex to add a new constructor or something like that. It looks like all of the non-test usage of setting the size happens immediately post construction. Also, there are several effectively-pinned character sets (for CJK and Thai) that are hard-coded in the grammar, and don't include any supplementary characters at all. If the Unicode version changes, these will need to be moved to use the appropriate Unicode properties instead. Currently ClassicTokenizer has THAI = [\u0E00-\u0E59]; ALPHANUM = ({LETTER}|{THAI}|[:digit:])+; . If I understand the Unicode spec correctly with Unicode 8.0 we can remove the THAI declaration and it would be correctly included in LETTER. But I have near zero confidence in this. Alternatively, leaving it as is should be fine because the assigned THAI characters have not gone outside of that range. For CJK, we have a special call out for CJ, but K was apparently already included in LETTER? I don't understand the relationship between ALPHANUM, \p{Letter } and CJK.
        Hide
        rcmuir Robert Muir added a comment -

        I wouldnt change any of the ClassicTokenizer ranges, it should just continue to do what it did before.

        not all of the thai characters are letters, and its important not to e.g. split on tone marks or make other mistakes like that: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[%3AThai%3A]-[%3ALetter%3A]]%0D%0A&g=&i=

        CJ is a separate category because ClassicTokenizer will return each han character individually as token. On the other hand hangul (K) is kept with letter because it is an alphabet.

        Show
        rcmuir Robert Muir added a comment - I wouldnt change any of the ClassicTokenizer ranges, it should just continue to do what it did before. not all of the thai characters are letters, and its important not to e.g. split on tone marks or make other mistakes like that: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[%3AThai%3A]-[%3ALetter%3A]]%0D%0A&g=&i= CJ is a separate category because ClassicTokenizer will return each han character individually as token. On the other hand hangul (K) is kept with letter because it is an alphabet.
        Hide
        mdrob Mike Drob added a comment -

        New patch, takes care of all 5 generated tokenizers.

        This patch is built using jflex 1.6.1 and unicode 7, so that we can at least have something in time for 6.0.

        I looked at the new generated jflex code and I think it takes care of the buffer expansion issue. At the very least, our existing StandardAnalyzer tests pass. Still need to have a macro for fixing buffersize, though.

        Had issues with TestUAX29URLEmailTokenizer.testLongEMAILatomText taking a while, not sure if that's part of the same issue or not.

        Also, moved jflex version to the properties file with everything else instead of set directly in the build.xml

        Show
        mdrob Mike Drob added a comment - New patch, takes care of all 5 generated tokenizers. This patch is built using jflex 1.6.1 and unicode 7, so that we can at least have something in time for 6.0. I looked at the new generated jflex code and I think it takes care of the buffer expansion issue. At the very least, our existing StandardAnalyzer tests pass. Still need to have a macro for fixing buffersize, though. Had issues with TestUAX29URLEmailTokenizer.testLongEMAILatomText taking a while, not sure if that's part of the same issue or not. Also, moved jflex version to the properties file with everything else instead of set directly in the build.xml
        Hide
        rcmuir Robert Muir added a comment -

        Had issues with TestUAX29URLEmailTokenizer.testLongEMAILatomText taking a while, not sure if that's part of the same issue or not.

        Well this test is already marked @Slow and just took 41.2s on my machine. Were you seeing stuff like that? As far as i know from the original issue, there were tests for this bug that would basically never finish at all .

        Show
        rcmuir Robert Muir added a comment - Had issues with TestUAX29URLEmailTokenizer.testLongEMAILatomText taking a while, not sure if that's part of the same issue or not. Well this test is already marked @Slow and just took 41.2s on my machine. Were you seeing stuff like that? As far as i know from the original issue, there were tests for this bug that would basically never finish at all .
        Hide
        steve_rowe Steve Rowe added a comment - - edited

        Mike, can you please exclude generated files from your patch? The patches here are way big, and reviewers/committers will want to regenerate anyway.

        Show
        steve_rowe Steve Rowe added a comment - - edited Mike, can you please exclude generated files from your patch? The patches here are way big, and reviewers/committers will want to regenerate anyway.
        Hide
        mdrob Mike Drob added a comment -

        Well this test is already marked @Slow and just took 41.2s on my machine. Were you seeing stuff like that? As far as i know from the original issue, there were tests for this bug that would basically never finish at all .

        I left it to run and came back later and saw that it took 50 minutes. But it passed. 40 seconds on your machine sounds great, I won't worry about it, thanks.

        Mike, can you please exclude generated files from your patch? The patches here are way big, and reviewers/committers will want to regenerate anyway.

        Sure, this makes sense.

        Steps to generate everything:

        #!/usr/bin/env bash
        
        pushd lucene/analysis/common
        ANT_OPTS="-Xmx5g" ant gen-tlds jflex
        ant jflex-legacy # For some reason this needs to be run separately from the jflex command. I could never figure out why.
        pushd src/test/org/apache/lucene/analysis/standard
        rm WordBreakTestUnicode_6_3_0.java
        perl generateJavaUnicodeWordBreakTest.pl -v 8.0.0
        popd
        popd
        
        Show
        mdrob Mike Drob added a comment - Well this test is already marked @Slow and just took 41.2s on my machine. Were you seeing stuff like that? As far as i know from the original issue, there were tests for this bug that would basically never finish at all . I left it to run and came back later and saw that it took 50 minutes. But it passed. 40 seconds on your machine sounds great, I won't worry about it, thanks. Mike, can you please exclude generated files from your patch? The patches here are way big, and reviewers/committers will want to regenerate anyway. Sure, this makes sense. Steps to generate everything: #!/usr/bin/env bash pushd lucene/analysis/common ANT_OPTS= "-Xmx5g" ant gen-tlds jflex ant jflex-legacy # For some reason this needs to be run separately from the jflex command. I could never figure out why. pushd src/test/org/apache/lucene/analysis/standard rm WordBreakTestUnicode_6_3_0.java perl generateJavaUnicodeWordBreakTest.pl -v 8.0.0 popd popd
        Hide
        rcmuir Robert Muir added a comment -

        If that test really took 50 minutes, there may be some issue there...

        Show
        rcmuir Robert Muir added a comment - If that test really took 50 minutes, there may be some issue there...
        Hide
        mdrob Mike Drob added a comment -

        Apologies for the patch churn here...

        I took another look at the buffer expansion pieces and ran through a bit of it with a debugger. Ended up making some slight tweaks to our post-generation regex replaces, tests pass in a reasonable amount of time now. I would really appreciate a second set of eyes there, though.

        Show
        mdrob Mike Drob added a comment - Apologies for the patch churn here... I took another look at the buffer expansion pieces and ran through a bit of it with a debugger. Ended up making some slight tweaks to our post-generation regex replaces, tests pass in a reasonable amount of time now. I would really appreciate a second set of eyes there, though.
        Hide
        rcmuir Robert Muir added a comment -

        I don't understand this change:

        -      <fileset dir="src/java/org/apache/lucene/analysis/standard" includes="**/*.java">
        +      <fileset dir="src/java/org/apache/lucene/analysis/standard" includes="*.java">
        

        This makes me worried it will not be applied to e.g. std50 subpackages, and re-introduce the performance bug?

        Show
        rcmuir Robert Muir added a comment - I don't understand this change: - <fileset dir= "src/java/org/apache/lucene/analysis/standard" includes= "**/*.java" > + <fileset dir= "src/java/org/apache/lucene/analysis/standard" includes= "*.java" > This makes me worried it will not be applied to e.g. std50 subpackages, and re-introduce the performance bug?
        Hide
        rcmuir Robert Muir added a comment -

        sorry, that just affects the cleaning part. But still, it should stay, because otherwise clean-jflex does not really work on the subdirectories.

        Show
        rcmuir Robert Muir added a comment - sorry, that just affects the cleaning part. But still, it should stay, because otherwise clean-jflex does not really work on the subdirectories.
        Hide
        mdrob Mike Drob added a comment -

        There's a clean-jflex-legacy target that takes care of the subdirectories.

        I had to split the jflex and jflex-legacy tasks because when run together all of the legacy code is generated with private methods instead of public ones and won't compile. I probably spent a full day looking at that trying to figure out what's going on, and eventually gave up.

        Show
        mdrob Mike Drob added a comment - There's a clean-jflex-legacy target that takes care of the subdirectories. I had to split the jflex and jflex-legacy tasks because when run together all of the legacy code is generated with private methods instead of public ones and won't compile. I probably spent a full day looking at that trying to figure out what's going on, and eventually gave up.
        Hide
        rcmuir Robert Muir added a comment -

        OK, thanks for the work! I will try to review all the changes in detail and confirm things are ok. I do think it is best to target 6.1 here.

        Show
        rcmuir Robert Muir added a comment - OK, thanks for the work! I will try to review all the changes in detail and confirm things are ok. I do think it is best to target 6.1 here.
        Hide
        mdrob Mike Drob added a comment -

        Robert Muir - did you get a chance to look at this? Should I wait to ping you again until after 6.0 has been released?

        Show
        mdrob Mike Drob added a comment - Robert Muir - did you get a chance to look at this? Should I wait to ping you again until after 6.0 has been released?
        Hide
        rcmuir Robert Muir added a comment -

        My mistake. thanks for the reminder. I have been working to get things off the old numeric encoding but need to break up that work with other things too.

        Show
        rcmuir Robert Muir added a comment - My mistake. thanks for the reminder. I have been working to get things off the old numeric encoding but need to break up that work with other things too.
        Hide
        mdrob Mike Drob added a comment -

        Any updates here? I'm not sure if there is anything I need to be doing to keep this patch up to date.

        Show
        mdrob Mike Drob added a comment - Any updates here? I'm not sure if there is anything I need to be doing to keep this patch up to date.
        Hide
        mdrob Mike Drob added a comment -

        Steve Rowe - I pinged the jflex list about getting the release going, but it looks like there are still a few outstanding issues to be resolved on that end. Do you think it is still worth waiting on the release there, or should we move forward here until jflex catches up and re-engage then?

        Show
        mdrob Mike Drob added a comment - Steve Rowe - I pinged the jflex list about getting the release going, but it looks like there are still a few outstanding issues to be resolved on that end. Do you think it is still worth waiting on the release there, or should we move forward here until jflex catches up and re-engage then?
        Hide
        steve_rowe Steve Rowe added a comment -

        Hi Mike, sorry I haven't had the bandwidth to engage on this issue and on JFlex recently. Thanks for the work you've done here and in creating JFlex issues for some of the release blocking issues there.

        Since Lucene 6.0 will be released shortly, and there is usually a gap of at least a month between minor releases, I think it makes sense to delay the decision about waiting on JFlex release for a couple weeks. I hope to be able to work on JFlex release blockers this weekend and next week. If after a couple weeks no JFlex release is imminent, I'd say move forward here.

        Show
        steve_rowe Steve Rowe added a comment - Hi Mike, sorry I haven't had the bandwidth to engage on this issue and on JFlex recently. Thanks for the work you've done here and in creating JFlex issues for some of the release blocking issues there. Since Lucene 6.0 will be released shortly, and there is usually a gap of at least a month between minor releases, I think it makes sense to delay the decision about waiting on JFlex release for a couple weeks. I hope to be able to work on JFlex release blockers this weekend and next week. If after a couple weeks no JFlex release is imminent, I'd say move forward here.
        Hide
        mdrob Mike Drob added a comment -

        Steve Rowe - I see no movement coming from the JFlex community. How would you feel most comfortable proceeding?

        Show
        mdrob Mike Drob added a comment - Steve Rowe - I see no movement coming from the JFlex community. How would you feel most comfortable proceeding?
        Hide
        steve_rowe Steve Rowe added a comment -

        Thanks for persisting Mike.

        I (and other JFlex community members) still haven't made any progress on the JFlex release blockers, so it's probably best to move forward using the current JFlex release rather than waiting for JFlex 1.7 to be released.

        I'll try to review your work on this issue this week.

        Show
        steve_rowe Steve Rowe added a comment - Thanks for persisting Mike. I (and other JFlex community members) still haven't made any progress on the JFlex release blockers, so it's probably best to move forward using the current JFlex release rather than waiting for JFlex 1.7 to be released. I'll try to review your work on this issue this week.
        Hide
        steve_rowe Steve Rowe added a comment - - edited

        Hi Mike, my review of your latest patch:

        • All the on-or-after tests in analysis factories should switch to 6.1.0 (since that's where this will likely land)
        • I agree with Robert that ClassicTokenizer should stay on Unicode 3.0 - you changed it to 7.0. That means it doesn't need version-specific behavior, so the factory changes and the build.xml targets aren't required.
        • WikipediaTokenizer is in the same boat as ClassicTokenizer - it's essentially a cloned ClassicTokenizer with some modifications for Wiki syntax. I vote for keeping it at Unicode 3.0 and reverting the factory changes and the build.xml targets. Performing an effective upgrade would probably mean cloning the current StandardTokenizer and then re-layering the wiki syntax.
        • In generateJavaUnicodeWordBreakTest.pl, you added the test for the double quote code point: push @tokens, '"'.join('', map { $_ ~~ /0022/ ? "\\\"" : "⧹⧹u$_" } @chars).'"'; - why use the smartmatch operator here? No recursion required or unknown types here. Just /0022/ instead of $_ ~~ /0022/ would work, right?
        • TestStandardAnalyzer and TestUAX29URLEmailTokenizer refer to WordBreakTestUnicode_8_0_0, but that should be _7_0_0.
        • In the run-jflex-and-disable-buffer-expansion target in build.xml, you removed the comment with the link to the relevant JIRA - why?
        • You said:

          I looked at the new generated jflex code and I think it takes care of the buffer expansion issue.\

          +1 - LGTM

        • Robert said:

          TestStandardAnalyzer and TestUAX29URLEmailAnalyzer also have a testBackcompat40 which calls setVersion and ensures it works.

          but AFAICT you didn't put any backcompat tests in place?

        • you said:

          ant jflex-legacy # For some reason this needs to be run separately from the jflex command. I could never figure out why.

          What happens if you make it a dependency of the jflex target?

        Show
        steve_rowe Steve Rowe added a comment - - edited Hi Mike, my review of your latest patch: All the on-or-after tests in analysis factories should switch to 6.1.0 (since that's where this will likely land) I agree with Robert that ClassicTokenizer should stay on Unicode 3.0 - you changed it to 7.0. That means it doesn't need version-specific behavior, so the factory changes and the build.xml targets aren't required. WikipediaTokenizer is in the same boat as ClassicTokenizer - it's essentially a cloned ClassicTokenizer with some modifications for Wiki syntax. I vote for keeping it at Unicode 3.0 and reverting the factory changes and the build.xml targets. Performing an effective upgrade would probably mean cloning the current StandardTokenizer and then re-layering the wiki syntax. In generateJavaUnicodeWordBreakTest.pl, you added the test for the double quote code point: push @tokens, '"'.join('', map { $_ ~~ /0022/ ? "\\\"" : "⧹⧹u$_" } @chars).'"'; - why use the smartmatch operator here? No recursion required or unknown types here. Just /0022/ instead of $_ ~~ /0022/ would work, right? TestStandardAnalyzer and TestUAX29URLEmailTokenizer refer to WordBreakTestUnicode_8_0_0, but that should be _7_0_0. In the run-jflex-and-disable-buffer-expansion target in build.xml, you removed the comment with the link to the relevant JIRA - why? You said: I looked at the new generated jflex code and I think it takes care of the buffer expansion issue.\ +1 - LGTM Robert said: TestStandardAnalyzer and TestUAX29URLEmailAnalyzer also have a testBackcompat40 which calls setVersion and ensures it works. but AFAICT you didn't put any backcompat tests in place? you said: ant jflex-legacy # For some reason this needs to be run separately from the jflex command. I could never figure out why. What happens if you make it a dependency of the jflex target?

          People

          • Assignee:
            rcmuir Robert Muir
            Reporter:
            mdrob Mike Drob
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:

              Development