Lucene - Core
  1. Lucene - Core
  2. LUCENE-2074

Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.0
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      The current trunk version of StandardTokenizerImpl was generated by Java 1.4 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should regenerate the file.

      After regeneration the Tokenizer behaves different for some characters. Because of that we should only use the new TokenizerImpl when Version.LUCENE_30 or LUCENE_31 is used as matchVersion.

      1. LUCENE-2074-lucene30.patch
        3 kB
        Uwe Schindler
      2. LUCENE-2074.patch
        40 kB
        Uwe Schindler
      3. LUCENE-2074.patch
        40 kB
        Uwe Schindler
      4. LUCENE-2074.patch
        41 kB
        Uwe Schindler
      5. LUCENE-2074.patch
        40 kB
        Uwe Schindler
      6. LUCENE-2074.patch
        55 kB
        Uwe Schindler
      7. LUCENE-2074.patch
        65 kB
        Uwe Schindler
      8. LUCENE-2074.patch
        63 kB
        Uwe Schindler
      9. LUCENE-2074.patch
        65 kB
        Uwe Schindler
      10. LUCENE-2074.patch
        67 kB
        Uwe Schindler
      11. LUCENE-2074.patch
        69 kB
        Uwe Schindler
      12. jflexwarning.patch
        0.7 kB
        Mark Miller
      13. jflex-1.4.1-vs-1.5-snapshot.diff
        5 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          Here the patch. It uses an interface containing the needed methods to easyliy switch between both impl. The old one was deprecated and stays there (little modified) without any jflex (which is no longer needed). If we want to change the syntax, we must generate new JFlex file and reenerate a new parser from it (using separate class name)

          Show
          Uwe Schindler added a comment - Here the patch. It uses an interface containing the needed methods to easyliy switch between both impl. The old one was deprecated and stays there (little modified) without any jflex (which is no longer needed). If we want to change the syntax, we must generate new JFlex file and reenerate a new parser from it (using separate class name)
          Hide
          Simon Willnauer added a comment -

          This might be the wrong place to mention it but I feel bad about this whole Version Enum. It became a pest spread all over the code. Lucene Code begins to look like the C++ Boost library where you see more precompiler statements than template code. We should really try hard to find different solutions than spreading Version all over the place.
          I know this is a hard problem but I want to make sure that we do not spread is out in every corner of the code. The Version thing is already annoying enough in Contrib/analyzer.

          Show
          Simon Willnauer added a comment - This might be the wrong place to mention it but I feel bad about this whole Version Enum. It became a pest spread all over the code. Lucene Code begins to look like the C++ Boost library where you see more precompiler statements than template code. We should really try hard to find different solutions than spreading Version all over the place. I know this is a hard problem but I want to make sure that we do not spread is out in every corner of the code. The Version thing is already annoying enough in Contrib/analyzer.
          Hide
          Uwe Schindler added a comment -

          For this one it's not new, it was there before my patch

          Show
          Uwe Schindler added a comment - For this one it's not new, it was there before my patch
          Hide
          Mark Miller added a comment -

          I still think we also still need a more prominent warning system.

          Show
          Mark Miller added a comment - I still think we also still need a more prominent warning system.
          Hide
          Robert Muir added a comment -

          I am anti-Version too in a lot of ways. I worry it will spread everywhere and make things a mess, and maybe we can come up with more creative solutions to get rid of it in the future.

          But I think Uwe's patch is ok, Version was actually created with StandardTokenizer in mind I believe...

          Show
          Robert Muir added a comment - I am anti-Version too in a lot of ways. I worry it will spread everywhere and make things a mess, and maybe we can come up with more creative solutions to get rid of it in the future. But I think Uwe's patch is ok, Version was actually created with StandardTokenizer in mind I believe...
          Hide
          Simon Willnauer added a comment -

          For this one it's not new, it was there before my patch

          This is just the latest issue I found related to that stuff. The version member was there already but the conditional was introduced. Don't get me wrong I just wanna make sure we are not using it as a general purpose conditional! This is going to be a nightmare otherwise. I would only use it if there is NO other way at all.

          simon

          Show
          Simon Willnauer added a comment - For this one it's not new, it was there before my patch This is just the latest issue I found related to that stuff. The version member was there already but the conditional was introduced. Don't get me wrong I just wanna make sure we are not using it as a general purpose conditional! This is going to be a nightmare otherwise. I would only use it if there is NO other way at all. simon
          Hide
          Mark Miller added a comment -

          We should really try hard to find different solutions than spreading Version all over the place.
          I know this is a hard problem but I want to make sure that we do not spread is out in every corner of the code. The Version thing is already annoying enough in Contrib/analyzer.

          The problem is, these are the hard backwards compat situations that it was created for - the whole analyzer package was/is bound to have lots of Version stuff.

          Show
          Mark Miller added a comment - We should really try hard to find different solutions than spreading Version all over the place. I know this is a hard problem but I want to make sure that we do not spread is out in every corner of the code. The Version thing is already annoying enough in Contrib/analyzer. The problem is, these are the hard backwards compat situations that it was created for - the whole analyzer package was/is bound to have lots of Version stuff.
          Hide
          Uwe Schindler added a comment -

          Updated patch with comment fixed and dead Token-related code removed.

          Show
          Uwe Schindler added a comment - Updated patch with comment fixed and dead Token-related code removed.
          Hide
          Simon Willnauer added a comment -

          nothing against the patch! I just used this issue as a channel and I agree there would have been a better choice though.

          simon

          Show
          Simon Willnauer added a comment - nothing against the patch! I just used this issue as a channel and I agree there would have been a better choice though. simon
          Hide
          Uwe Schindler added a comment -

          I add the warning to my patch! Thanks. What do you think about the patch, should we go that way?

          Show
          Uwe Schindler added a comment - I add the warning to my patch! Thanks. What do you think about the patch, should we go that way?
          Hide
          Robert Muir added a comment -

          Uwe, also, just checking, i don't know javacc at all, does it use unicode properties? We have a lot of queryparsers out there...

          Show
          Robert Muir added a comment - Uwe, also, just checking, i don't know javacc at all, does it use unicode properties? We have a lot of queryparsers out there...
          Hide
          Simon Willnauer added a comment -

          The problem is, these are the hard backwards compat situations that it was created for - the whole analyzer package was/is bound to have lots of Version stuff.

          Afaik in the contrib/analyzis package this is only used because of STDAnalyzer and StopFilter. It seems like a kind of an overkill. But again I should on do any "issue - high-jacking" here. The thing is even if you come up with a better solution this will most likely stay forever - just like all the IFDEF stuff in Boost

          Show
          Simon Willnauer added a comment - The problem is, these are the hard backwards compat situations that it was created for - the whole analyzer package was/is bound to have lots of Version stuff. Afaik in the contrib/analyzis package this is only used because of STDAnalyzer and StopFilter. It seems like a kind of an overkill. But again I should on do any "issue - high-jacking" here. The thing is even if you come up with a better solution this will most likely stay forever - just like all the IFDEF stuff in Boost
          Hide
          Uwe Schindler added a comment - - edited

          Uwe, also, just checking, i don't know javacc at all, does it use unicode properties? We have a lot of queryparsers out there...

          I do not know it, too

          The only query parser using jflex is the new one. And the new one should normally use no unicode properties. Can you check the JFlex file? All other query parsers use JavaCC

          edit

          No Jflex is used by any query parser. But WikipediaTokenizer uses JFlex...

          Show
          Uwe Schindler added a comment - - edited Uwe, also, just checking, i don't know javacc at all, does it use unicode properties? We have a lot of queryparsers out there... I do not know it, too The only query parser using jflex is the new one. And the new one should normally use no unicode properties. Can you check the JFlex file? All other query parsers use JavaCC edit No Jflex is used by any query parser. But WikipediaTokenizer uses JFlex...
          Hide
          Michael McCandless added a comment -

          I feel bad about this whole Version Enum

          I think this is simply a sign of 1) Lucene's maturity, and 2) that we
          take back compat seriously. I actually think we don't yet use it
          enough...

          EG, LUCENE-1255 was one nasty bug, that we at first fixed, but then
          rolled back, because of the back-compat break. Then it was
          rediscovered and opened again, as LUCENE-1542, when we decided it was
          nasty enough to just fix it and put an entry in CHANGES that you
          hopefully will read.

          But it really is a back-compat break, in that apps could quite easily
          be relying on the buggy behavior. I think that bug would have been a
          good reason to add Version to IW.

          Fixing invalid acronyms in StandardAnalyzer, but then leaving it
          broken by default, was the original "inspiration" for Version. We
          shouldn't every fix a bug, but then be forced to leave the bug in
          place due to back compat.

          Version lets us fix bugs, change defaults for the better, etc., w/o
          compromising on our back compat policy. It's an impoprtant
          tool...

          The problem is, these are the hard backwards compat situations that it was created for - the whole analyzer package was/is bound to have lots of Version stuff.

          Right, I think Version will especially find its way into changes that
          alter what's indexed (analyzers, bugs like LUCENE-1255, etc.).

          Show
          Michael McCandless added a comment - I feel bad about this whole Version Enum I think this is simply a sign of 1) Lucene's maturity, and 2) that we take back compat seriously. I actually think we don't yet use it enough... EG, LUCENE-1255 was one nasty bug, that we at first fixed, but then rolled back, because of the back-compat break. Then it was rediscovered and opened again, as LUCENE-1542 , when we decided it was nasty enough to just fix it and put an entry in CHANGES that you hopefully will read. But it really is a back-compat break, in that apps could quite easily be relying on the buggy behavior. I think that bug would have been a good reason to add Version to IW. Fixing invalid acronyms in StandardAnalyzer, but then leaving it broken by default, was the original "inspiration" for Version. We shouldn't every fix a bug, but then be forced to leave the bug in place due to back compat. Version lets us fix bugs, change defaults for the better, etc., w/o compromising on our back compat policy. It's an impoprtant tool... The problem is, these are the hard backwards compat situations that it was created for - the whole analyzer package was/is bound to have lots of Version stuff. Right, I think Version will especially find its way into changes that alter what's indexed (analyzers, bugs like LUCENE-1255 , etc.).
          Hide
          Robert Muir added a comment -

          well, the wikipediatokenizer at least is similar to standardtokenizer, except it does not use unicodeproperties, instead hardcoded ranges.

          so the behavior won't change from 2.9.x, but it wont be unicode 4 either, don't know if we should worry about this?

          Show
          Robert Muir added a comment - well, the wikipediatokenizer at least is similar to standardtokenizer, except it does not use unicodeproperties, instead hardcoded ranges. so the behavior won't change from 2.9.x, but it wont be unicode 4 either, don't know if we should worry about this?
          Hide
          Uwe Schindler added a comment - - edited

          It uses hardcode char ranges, the parser is therefore not JDK-dependent. Let's keep it as it is for now. Mediawiki itsself is not unicode conform, because it's written in PHP and PHP only gets unicode in 6.0 (lol) (that says a PHP core committer named U.S. from Bremen in Germany...)

          Show
          Uwe Schindler added a comment - - edited It uses hardcode char ranges, the parser is therefore not JDK-dependent. Let's keep it as it is for now. Mediawiki itsself is not unicode conform, because it's written in PHP and PHP only gets unicode in 6.0 ( lol ) (that says a PHP core committer named U.S. from Bremen in Germany...)
          Hide
          Uwe Schindler added a comment -

          Should we fix this for 3.0 or not?
          The current JFlex file in trunk/lucene_30 is generated by Java 1.4 (I verified), so it does not break. So we could wait for 3.1 and provide there a new StandardTokenizer with unicode 5 support

          Show
          Uwe Schindler added a comment - Should we fix this for 3.0 or not? The current JFlex file in trunk/lucene_30 is generated by Java 1.4 (I verified), so it does not break. So we could wait for 3.1 and provide there a new StandardTokenizer with unicode 5 support
          Hide
          Robert Muir added a comment -

          Uwe, we could fix in 3.1 (but we should commit the warning no matter what I think!)

          If we commit for 3.0, then its still not really correct for Unicode 4.
          in my opinion, better would be to wait for 3.1 and use this interface you built, along with a new version of Jflex for much better unicode support?

          Show
          Robert Muir added a comment - Uwe, we could fix in 3.1 (but we should commit the warning no matter what I think!) If we commit for 3.0, then its still not really correct for Unicode 4. in my opinion, better would be to wait for 3.1 and use this interface you built, along with a new version of Jflex for much better unicode support?
          Hide
          Uwe Schindler added a comment -

          This is the patch for version 3.0, that keeps the old jflex file but adds an extra warning.

          Show
          Uwe Schindler added a comment - This is the patch for version 3.0, that keeps the old jflex file but adds an extra warning.
          Hide
          Uwe Schindler added a comment -

          Patch for trunk using Version.LUCENE_31

          Show
          Uwe Schindler added a comment - Patch for trunk using Version.LUCENE_31
          Hide
          Uwe Schindler added a comment -

          Do we agree, that this patch should wait for 3.1, as the JFlex parser is already ok and backwards compatible in 3.0, so no need to do anything? In 3.1 together with the other unicode changes, we will update StandardTokenizer with Version.LUCENE_31?

          Show
          Uwe Schindler added a comment - Do we agree, that this patch should wait for 3.1, as the JFlex parser is already ok and backwards compatible in 3.0, so no need to do anything? In 3.1 together with the other unicode changes, we will update StandardTokenizer with Version.LUCENE_31?
          Hide
          Mark Miller added a comment -

          +1 here

          Show
          Mark Miller added a comment - +1 here
          Hide
          Uwe Schindler added a comment -

          Committed 3.0 part in revision: 881317
          I also committed this to trunk, to have the warning also in trunk.

          Show
          Uwe Schindler added a comment - Committed 3.0 part in revision: 881317 I also committed this to trunk, to have the warning also in trunk.
          Hide
          Uwe Schindler added a comment -

          Updated patch for current trunk (merged). It also minimalized the interface declaration for simple upgrades to JFlex 1.5

          Show
          Uwe Schindler added a comment - Updated patch for current trunk (merged). It also minimalized the interface declaration for simple upgrades to JFlex 1.5
          Hide
          Uwe Schindler added a comment -

          I think we can go forward with this now?

          I would like to also add a check in the build.xml, that tests before running jflex, that the Java version is as exspected.

          Show
          Uwe Schindler added a comment - I think we can go forward with this now? I would like to also add a check in the build.xml, that tests before running jflex, that the Java version is as exspected.
          Hide
          Robert Muir added a comment -

          I think we can go forward with this now?

          Should we wait for Jflex 1.5? I do not know how close it is. Steven Rowe will know.
          But I think the unicode branch he was working has been merged to their trunk so we could check to make sure your interface will work for it, at the least.

          Show
          Robert Muir added a comment - I think we can go forward with this now? Should we wait for Jflex 1.5? I do not know how close it is. Steven Rowe will know. But I think the unicode branch he was working has been merged to their trunk so we could check to make sure your interface will work for it, at the least.
          Hide
          Uwe Schindler added a comment - - edited

          I tried with JFlex 1.5 svn trunk version from today, built with maven. The produced java file is almost identical, the big array is 100% identical. There are only some new private methods and one switch was reordered. The interface works for both versions, so nothing to change.

          I am not sure, which unicode version this JFlex now uses and if Steven's changes are in it. I cannot test with Java 1.4, as this version requires 1.5.

          I attached a patch with the difference between the JFlex-1.4.1 on Java 5 generated java file and the snapshot version (also Java 5). You can see that the big DFA array was not changed at all.

          Show
          Uwe Schindler added a comment - - edited I tried with JFlex 1.5 svn trunk version from today, built with maven. The produced java file is almost identical, the big array is 100% identical. There are only some new private methods and one switch was reordered. The interface works for both versions, so nothing to change. I am not sure, which unicode version this JFlex now uses and if Steven's changes are in it. I cannot test with Java 1.4, as this version requires 1.5. I attached a patch with the difference between the JFlex-1.4.1 on Java 5 generated java file and the snapshot version (also Java 5). You can see that the big DFA array was not changed at all.
          Hide
          Robert Muir added a comment -

          Uwe, I think you need to change your grammar to take advantage of the new features

          Show
          Robert Muir added a comment - Uwe, I think you need to change your grammar to take advantage of the new features
          Hide
          Uwe Schindler added a comment -

          That's obvious. But do we want to do this? I thought the only interesting "new feature" was that it uses an internal unicode impl, that is not JVM specific´, so the JVM of the jflex compilation is not relevant. This issue only waits for this, doesn't it?

          So I think we only have to change the grammar to specify a specific unicode version in the header?

          Show
          Uwe Schindler added a comment - That's obvious. But do we want to do this? I thought the only interesting "new feature" was that it uses an internal unicode impl, that is not JVM specific´, so the JVM of the jflex compilation is not relevant. This issue only waits for this, doesn't it? So I think we only have to change the grammar to specify a specific unicode version in the header?
          Hide
          Robert Muir added a comment -

          Uwe, thats not the only interesting new feature. The other is it supports the properties necessary to actually tokenize text according to unicode standards, instead of words just being runs of IsLetter=True.

          You can see such a grammar here: http://jflex.svn.sourceforge.net/viewvc/jflex/trunk/testsuite/testcases/src/test/cases/unicode-word-break/UnicodeWordBreakRules_5_2.flex?revision=578&view=markup

          you should try %unicode 5.2 for now, but we shouldn't wimp out with that for 3.1.

          Show
          Robert Muir added a comment - Uwe, thats not the only interesting new feature. The other is it supports the properties necessary to actually tokenize text according to unicode standards, instead of words just being runs of IsLetter=True. You can see such a grammar here: http://jflex.svn.sourceforge.net/viewvc/jflex/trunk/testsuite/testcases/src/test/cases/unicode-word-break/UnicodeWordBreakRules_5_2.flex?revision=578&view=markup you should try %unicode 5.2 for now, but we shouldn't wimp out with that for 3.1.
          Hide
          Uwe Schindler added a comment -

          So I should also be able to regenerate the old grammar with %Unicode 3.1 (that was the version of JDK 1.4?) - that would be fine, because I could simply provide another grammar file to regenerate all parsers!

          That would be helpful as a first step for 3.1!

          Show
          Uwe Schindler added a comment - So I should also be able to regenerate the old grammar with %Unicode 3.1 (that was the version of JDK 1.4?) - that would be fine, because I could simply provide another grammar file to regenerate all parsers! That would be helpful as a first step for 3.1!
          Hide
          Robert Muir added a comment -

          ooh i actually do liek that... it should work... give it a try

          Show
          Robert Muir added a comment - ooh i actually do liek that... it should work... give it a try
          Hide
          Uwe Schindler added a comment -

          Accoring to the title of this issue, we should regenerate two parsers with %unicode 3.1 for the legacy !onOrAfter(LUCENE_31) parser and one for Java 5, which is %unicode 4.0. Then we can close the issue and work with later improvements as you suggested (which may be language dependent, what this tokenizer should not be).

          Show
          Uwe Schindler added a comment - Accoring to the title of this issue, we should regenerate two parsers with %unicode 3.1 for the legacy !onOrAfter(LUCENE_31) parser and one for Java 5, which is %unicode 4.0. Then we can close the issue and work with later improvements as you suggested (which may be language dependent, what this tokenizer should not be).
          Hide
          Robert Muir added a comment -

          Uwe I agree, I am just answering your questions about "only interesting new feature"...

          i don't think we shoudl do anything here (except experiment) until we have an official jflex release either...

          Show
          Robert Muir added a comment - Uwe I agree, I am just answering your questions about "only interesting new feature"... i don't think we shoudl do anything here (except experiment) until we have an official jflex release either...
          Hide
          Uwe Schindler added a comment -

          Attached is a patch.

          To regenerate the parsers you can run "ant jflex", but the sysprop jflex.home has to point to the JFlex trunk checkout, where mvn install has run before (I changed build.xml and common-build.xml to work correctly).

          I added a test that tests the tokenization in Java 1.4 (Version.LUCENE_30) and Java 1.5 mode (Version.LUCENE_CURRENT). There are two JFlex files, one that is Unicode 3.0 (Java 1.4.1) compatible (and even when run in JDK 5, it produces now an Java 1.4 compatible parser!) and one with unicode version 4.0 (Java 5).

          Show
          Uwe Schindler added a comment - Attached is a patch. To regenerate the parsers you can run "ant jflex", but the sysprop jflex.home has to point to the JFlex trunk checkout, where mvn install has run before (I changed build.xml and common-build.xml to work correctly). I added a test that tests the tokenization in Java 1.4 (Version.LUCENE_30) and Java 1.5 mode (Version.LUCENE_CURRENT). There are two JFlex files, one that is Unicode 3.0 (Java 1.4.1) compatible (and even when run in JDK 5, it produces now an Java 1.4 compatible parser!) and one with unicode version 4.0 (Java 5).
          Hide
          Robert Muir added a comment -

          Hi Uwe, do we even need READ_BEFORE_REGENERATING.txt after this patch?

          Will the old jflex fail on %unicode

          {x.y}

          syntax ???

          Show
          Robert Muir added a comment - Hi Uwe, do we even need READ_BEFORE_REGENERATING.txt after this patch? Will the old jflex fail on %unicode {x.y} syntax ???
          Hide
          Uwe Schindler added a comment - - edited

          I will try out if it fails. But you cannot even regenerate, because the package name of Flex changed in 1.5 (because of that I had to change build.xml).

          One thing: I will rename the generated classes:

          • StandardTokenizerImplOrig (the initial one without Version)
          • StandardTokenizerImpl31 for the new one
          • ... newer ones come after that

          By this whenever we create a new version/syntax, we can name it exactly like the first version it is supported. Will upload (not commit g) a patch later.

          Show
          Uwe Schindler added a comment - - edited I will try out if it fails. But you cannot even regenerate, because the package name of Flex changed in 1.5 (because of that I had to change build.xml). One thing: I will rename the generated classes: StandardTokenizerImplOrig (the initial one without Version) StandardTokenizerImpl31 for the new one ... newer ones come after that By this whenever we create a new version/syntax, we can name it exactly like the first version it is supported. Will upload (not commit g ) a patch later.
          Hide
          Robert Muir added a comment -

          By this whenever we create a new version/syntax, we can name it exactly like the first version it is supported. Will commit a patch later.

          commit a patch? or upload a patch? I still think we should wait for official Jflex release, unless a jflex developer comments on this issue otherwise

          Show
          Robert Muir added a comment - By this whenever we create a new version/syntax, we can name it exactly like the first version it is supported. Will commit a patch later. commit a patch? or upload a patch? I still think we should wait for official Jflex release, unless a jflex developer comments on this issue otherwise
          Hide
          Uwe Schindler added a comment -

          This patch now implements my latest proposal about the filenames. To easy see, what changed in the TokenizerImpls, the patch cannot be applied before doing some copy/rename before.

          Do the following:

          • svn copy StandardTokenizerImpl.* to StandardTokenizerImplOrig.*
          • svn move StandardTokenizerImpl.* to StandardTokenizerImpl31.*

          After that you have two copies of the original Tokenizer Impls. After that apply the patch. The patch clearly shows, that even after regeneration with Java 1.5, the original version using Java 1.4 (Unicode 3) is equal to before (esp. the DFA matrix). The 31-version is different (other matrix).

          If we later create new versions, we can call them 32 etc.

          This patch solves the JFlex 1.4 problem with needing the explicit java version. It currently requires the trunk version of JFlex, which would be no problem for this parsers (as verified, that they produce the same DFA & code for 1.4). So other speak up, Steven Rowe? What do you think. Only developers need the trunk version at the moment as the generated files are in the checkout.

          Hopefully JFlex 1.5 comes out until we release 3.1, I would be happy. In later issues we can optimize the newly added 31 version with more unicode features, the Orig version stays as it is. We could also remove the special cases in the latest version like replaceInvlaidAcronym and so on, as this only applies for Version.LUCENE_2x.

          Show
          Uwe Schindler added a comment - This patch now implements my latest proposal about the filenames. To easy see, what changed in the TokenizerImpls, the patch cannot be applied before doing some copy/rename before. Do the following: svn copy StandardTokenizerImpl.* to StandardTokenizerImplOrig.* svn move StandardTokenizerImpl.* to StandardTokenizerImpl31.* After that you have two copies of the original Tokenizer Impls. After that apply the patch. The patch clearly shows, that even after regeneration with Java 1.5, the original version using Java 1.4 (Unicode 3) is equal to before (esp. the DFA matrix). The 31-version is different (other matrix). If we later create new versions, we can call them 32 etc. This patch solves the JFlex 1.4 problem with needing the explicit java version. It currently requires the trunk version of JFlex, which would be no problem for this parsers (as verified, that they produce the same DFA & code for 1.4). So other speak up, Steven Rowe? What do you think. Only developers need the trunk version at the moment as the generated files are in the checkout. Hopefully JFlex 1.5 comes out until we release 3.1, I would be happy. In later issues we can optimize the newly added 31 version with more unicode features, the Orig version stays as it is. We could also remove the special cases in the latest version like replaceInvlaidAcronym and so on, as this only applies for Version.LUCENE_2x.
          Hide
          Steve Rowe added a comment -

          Will the old jflex fail on %unicode {x.y} syntax ???

          I haven't tested it, but JFlex <1.5 likely will fail on this syntax, since nothing is expected after the %unicode directive.

          Hopefully JFlex 1.5 comes out until we release 3.1, I would be happy.

          I think the JFlex 1.5 release will happen before March of next year, since we're down to just a few blocking issues.

          Show
          Steve Rowe added a comment - Will the old jflex fail on %unicode {x.y} syntax ??? I haven't tested it, but JFlex <1.5 likely will fail on this syntax, since nothing is expected after the %unicode directive. Hopefully JFlex 1.5 comes out until we release 3.1, I would be happy. I think the JFlex 1.5 release will happen before March of next year, since we're down to just a few blocking issues.
          Hide
          Uwe Schindler added a comment -

          Thanks Steve.

          Do you see a problem with just requiring Flex 1.5 for Lucene trunk at the moment? It would hep us to not require a Java 1.4 JRE just to convert the jflex files.

          The new parsers (see patch) are pre-generated in SVN, so somebody compiling lucene from source does need to use jflex. And the parsers for StandardTokenizer are verified to work correct and are even identical (DFA wise) for the old Java 1.4 / Unicode 3.0 case.

          Show
          Uwe Schindler added a comment - Thanks Steve. Do you see a problem with just requiring Flex 1.5 for Lucene trunk at the moment? It would hep us to not require a Java 1.4 JRE just to convert the jflex files. The new parsers (see patch) are pre-generated in SVN, so somebody compiling lucene from source does need to use jflex. And the parsers for StandardTokenizer are verified to work correct and are even identical (DFA wise) for the old Java 1.4 / Unicode 3.0 case.
          Hide
          Steve Rowe added a comment -

          Do you see a problem with just requiring Flex 1.5 for Lucene trunk at the moment?

          I think it's fine to do that.

          The new parsers (see patch) are pre-generated in SVN, so somebody compiling lucene from source does need to use jflex. And the parsers for StandardTokenizer are verified to work correct and are even identical (DFA wise) for the old Java 1.4 / Unicode 3.0 case.

          Most of the StandardTokenizerImpl.jflex grammar is expressed in absolute terms - the only JVM-/Unicode-version-sensistive usages are [:letter:] and [:digit:], which under JFlex <1.5 were expanded using the scanner-generation-time JVM's Character.isLetter() and .isDigit() definitions, but under JFlex 1.5-SNAPSHOT depend on the declared Unicode version definitions (i.e., [:letter:] = \p

          {Letter}

          ).

          I'm actually surprised that the DFAs are identical, since I'm almost certain that the set of characters matching [:letter:] changed between Unicode 3.0 and Unicode 4.0 (maybe [:digit:] too). I'll take a look this weekend.

          Show
          Steve Rowe added a comment - Do you see a problem with just requiring Flex 1.5 for Lucene trunk at the moment? I think it's fine to do that. The new parsers (see patch) are pre-generated in SVN, so somebody compiling lucene from source does need to use jflex. And the parsers for StandardTokenizer are verified to work correct and are even identical (DFA wise) for the old Java 1.4 / Unicode 3.0 case. Most of the StandardTokenizerImpl.jflex grammar is expressed in absolute terms - the only JVM-/Unicode-version-sensistive usages are [:letter:] and [:digit:] , which under JFlex <1.5 were expanded using the scanner-generation-time JVM's Character.isLetter() and .isDigit() definitions, but under JFlex 1.5-SNAPSHOT depend on the declared Unicode version definitions (i.e., [:letter:] = \p {Letter} ). I'm actually surprised that the DFAs are identical, since I'm almost certain that the set of characters matching [:letter:] changed between Unicode 3.0 and Unicode 4.0 (maybe [:digit:] too). I'll take a look this weekend.
          Hide
          Uwe Schindler added a comment -

          I'm actually surprised that the DFAs are identical, since I'm almost certain that the set of characters matching [:letter:] changed between Unicode 3.0 and Unicode 4.0 (maybe [:digit:] too). I'll take a look this weekend.

          Because of that we have the patch: We now have two flex files, one with %unicode 3.0, which produces the same DFA as the old flex file when processed with Java 1.4 (as it was in Lucene 2.x). This is used for backwards compatibiulity (using the matchVersion parameter of ctor).

          For later Lucene versions we will have a new jflex file (currently unicode 4.0) and that produces the same matrix as java 1.5 in jflex 1.4 (at the moment).

          By that we simply made the parser regeneration invariant to the developer's JVM. About nothing more is this issue at the moment.

          Show
          Uwe Schindler added a comment - I'm actually surprised that the DFAs are identical, since I'm almost certain that the set of characters matching [:letter:] changed between Unicode 3.0 and Unicode 4.0 (maybe [:digit:] too). I'll take a look this weekend. Because of that we have the patch: We now have two flex files, one with %unicode 3.0, which produces the same DFA as the old flex file when processed with Java 1.4 (as it was in Lucene 2.x). This is used for backwards compatibiulity (using the matchVersion parameter of ctor). For later Lucene versions we will have a new jflex file (currently unicode 4.0) and that produces the same matrix as java 1.5 in jflex 1.4 (at the moment). By that we simply made the parser regeneration invariant to the developer's JVM. About nothing more is this issue at the moment.
          Hide
          Steve Rowe added a comment -

          Thanks, Uwe, that makes sense. My bad, I only skimmed the patch, and misunderstood "3.0" in one of the new files to refer to the Lucene version, not the Unicode version.

          Show
          Steve Rowe added a comment - Thanks, Uwe, that makes sense. My bad, I only skimmed the patch, and misunderstood "3.0" in one of the new files to refer to the Lucene version, not the Unicode version.
          Hide
          Robert Muir added a comment -

          Uwe, given Steven's comment above, I think we should move forward with this issue and flex 1.5?

          Show
          Robert Muir added a comment - Uwe, given Steven's comment above, I think we should move forward with this issue and flex 1.5?
          Hide
          Uwe Schindler added a comment -

          I will update the patch (using TEST_VERSION and so on) later and then we can proceed.

          Show
          Uwe Schindler added a comment - I will update the patch (using TEST_VERSION and so on) later and then we can proceed.
          Hide
          Uwe Schindler added a comment -

          Here updated patch, svn copy/move before apply as mentioned above.

          Show
          Uwe Schindler added a comment - Here updated patch, svn copy/move before apply as mentioned above.
          Hide
          Uwe Schindler added a comment -

          As requested on the mailing list, I will look into resetting the zzBuffer on Tokenizer.reset(Reader).

          Show
          Uwe Schindler added a comment - As requested on the mailing list, I will look into resetting the zzBuffer on Tokenizer.reset(Reader).
          Hide
          Shai Erera added a comment -

          Uwe, must this be coupled with that issue? This one waits for a long time (why? for JFlex 1.5 release?) and protecting against a huge buffer allocation can be a real quick and tiny fix. And this one also focuses on getting Unicode 5 to work, which is unrelated to the buffer size. But the buffer size is not a critical issue either that we need to move fast with it ... so it's your call. Just thought they are two unrelated problems.

          Show
          Shai Erera added a comment - Uwe, must this be coupled with that issue? This one waits for a long time (why? for JFlex 1.5 release?) and protecting against a huge buffer allocation can be a real quick and tiny fix. And this one also focuses on getting Unicode 5 to work, which is unrelated to the buffer size. But the buffer size is not a critical issue either that we need to move fast with it ... so it's your call. Just thought they are two unrelated problems.
          Hide
          Uwe Schindler added a comment -

          I plan to commit this soon! So any patch will get outdated, thats why i want to fix this here. And as this patch removes direct access from the Tokenizer to the lexer (as it is only accessible through an interface now), we have to change the jflex file to do it "correctly".

          Show
          Uwe Schindler added a comment - I plan to commit this soon! So any patch will get outdated, thats why i want to fix this here. And as this patch removes direct access from the Tokenizer to the lexer (as it is only accessible through an interface now), we have to change the jflex file to do it "correctly".
          Hide
          Shai Erera added a comment -

          I plan to commit this soon!

          That's great news !

          BTW - what are you going to do w/ the JFlex 1.5 binary? Are you going to check it in somewhere? because it hasn't been released last I checked. I'm asking for general knowledge, because I know the scripts are downloading it, or rely on it to exist somewhere.

          In that case, then yes, let's fix it here.

          Show
          Shai Erera added a comment - I plan to commit this soon! That's great news ! BTW - what are you going to do w/ the JFlex 1.5 binary? Are you going to check it in somewhere? because it hasn't been released last I checked. I'm asking for general knowledge, because I know the scripts are downloading it, or rely on it to exist somewhere. In that case, then yes, let's fix it here.
          Hide
          Uwe Schindler added a comment -

          You dont need the jflex binaries in general, only if you reconstruct the source files (using "ant jflex"). And its easy to generate, check out and start "mvn install".

          Show
          Uwe Schindler added a comment - You dont need the jflex binaries in general, only if you reconstruct the source files (using "ant jflex"). And its easy to generate, check out and start "mvn install".
          Hide
          Uwe Schindler added a comment -

          Here a new patch, with the zzBuffer reset to default implemented in a separate reset(Reader) method. As yyReset is generated as final, I had to change the name.

          Before apply, run:

          svn copy StandardTokenizerImpl.* to StandardTokenizerImplOrig.* 
          svn move StandardTokenizerImpl.* to StandardTokenizerImpl31.* 
          

          I will commit this in a day or two!

          Show
          Uwe Schindler added a comment - Here a new patch, with the zzBuffer reset to default implemented in a separate reset(Reader) method. As yyReset is generated as final, I had to change the name. Before apply, run: svn copy StandardTokenizerImpl.* to StandardTokenizerImplOrig.* svn move StandardTokenizerImpl.* to StandardTokenizerImpl31.* I will commit this in a day or two!
          Hide
          Uwe Schindler added a comment -

          Updated also the error message about missing jflex when calling "ant jflex" to regenerate the lexers. The message now contains instructions for downloading and building JFlex. Also add CHANGES.txt.

          Show
          Uwe Schindler added a comment - Updated also the error message about missing jflex when calling "ant jflex" to regenerate the lexers. The message now contains instructions for downloading and building JFlex. Also add CHANGES.txt.
          Hide
          Mark Miller added a comment -

          Uwe, must this be coupled with that issue? This one waits for a long time (why? for JFlex 1.5 release?) and protecting against a huge buffer allocation can be a real quick and tiny fix. And this one also focuses on getting Unicode 5 to work, which is unrelated to the buffer size. But the buffer size is not a critical issue either that we need to move fast with it ... so it's your call. Just thought they are two unrelated problems.

          Agreed. Whether its fixed as part of this commit or not, it really deserves its own issue anyway, for changes and tracking. It has nothing to do with this issue other than convenience.

          Show
          Mark Miller added a comment - Uwe, must this be coupled with that issue? This one waits for a long time (why? for JFlex 1.5 release?) and protecting against a huge buffer allocation can be a real quick and tiny fix. And this one also focuses on getting Unicode 5 to work, which is unrelated to the buffer size. But the buffer size is not a critical issue either that we need to move fast with it ... so it's your call. Just thought they are two unrelated problems. Agreed. Whether its fixed as part of this commit or not, it really deserves its own issue anyway, for changes and tracking. It has nothing to do with this issue other than convenience.
          Hide
          Uwe Schindler added a comment -

          Created sub-issue: LUCENE-2384

          Show
          Uwe Schindler added a comment - Created sub-issue: LUCENE-2384
          Hide
          Uwe Schindler added a comment -

          New patch with replacement of deprecated TermAttribute -> CharTermAttribute. It also fixes the reset()/reset(Reader) methods to be conform to all other Tokenizers and the documentations. The current one was resetting multiple times. This has no effect on backwards. Also improve the JFlex classpath detection to work with svn checkouts or future release zips.

          I will commit this soon when all tests ran.

          Show
          Uwe Schindler added a comment - New patch with replacement of deprecated TermAttribute -> CharTermAttribute. It also fixes the reset()/reset(Reader) methods to be conform to all other Tokenizers and the documentations. The current one was resetting multiple times. This has no effect on backwards. Also improve the JFlex classpath detection to work with svn checkouts or future release zips. I will commit this soon when all tests ran.
          Hide
          Uwe Schindler added a comment -

          Committed revision: 932163

          Show
          Uwe Schindler added a comment - Committed revision: 932163
          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Uwe Schindler
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development