Lucene - Core
  1. Lucene - Core
  2. LUCENE-3361

port url+email tokenizer to standardtokenizerinterface (or similar)

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.3
    • Fix Version/s: 3.4, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      We should do this so that we can fix the LUCENE-3358 bug there, and preserve backwards.
      We also want this mechanism anyway, for upgrading to new unicode versions in the future.

      We can regenerate the new TLD list for 3.4 but, we should ensure the existing one is used for the urlemail33 or whatever,
      so that its exactly the same.

      1. LUCENE-3361.patch
        635 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          Attached is a patch, before applying it you must move the UAX29URLEmailTokenizer.jflex to UAX29URLEmailTOkenizerImpl.jflex

          • ports this tokenizer over to StandardTokenizerInterface
          • Fixes LUCENE-3358 bug
          • regenerates TLDs for trunk only
          • adds backwards 3.1 version with bug and old TLDs and some basic tests.
          • adds new ctors that require version, deprecates version-less ones
          • deprecates inputstream ctor that uses default charset
          • reorganizes constants like standardtokenizer and deprecates the old ones.
          Show
          Robert Muir added a comment - Attached is a patch, before applying it you must move the UAX29URLEmailTokenizer.jflex to UAX29URLEmailTOkenizerImpl.jflex ports this tokenizer over to StandardTokenizerInterface Fixes LUCENE-3358 bug regenerates TLDs for trunk only adds backwards 3.1 version with bug and old TLDs and some basic tests. adds new ctors that require version, deprecates version-less ones deprecates inputstream ctor that uses default charset reorganizes constants like standardtokenizer and deprecates the old ones.
          Hide
          Robert Muir added a comment -

          by the way, the patch is for trunk, but has all the deprecations, including API ones: these can be removed in trunk immediately after porting back,
          but I would prefer to do this as a separate step, just so i dont forget anything.

          Show
          Robert Muir added a comment - by the way, the patch is for trunk, but has all the deprecations, including API ones: these can be removed in trunk immediately after porting back, but I would prefer to do this as a separate step, just so i dont forget anything.
          Hide
          Steve Rowe added a comment -

          The jflex target depends on the clean-jflex target, which deletes all src/.../standard/*.java files whose contents match regex /generated.*by.*JFlex/. Your patch leaves intact the first line of UAX29URLEmailTokenizer.java, which matches the regex in a comment. As a result, running ant jflex deletes UAX29URLEmailTokenizer.java, and since it's no longer generated by JFlex, compilation fails.

          When I remove this JFlex comment line from UAX29URLEmailTokenizer.java, ant jflex works, everything compiles, and all tests succeed. +1 to commit after removing this line.

          Show
          Steve Rowe added a comment - The jflex target depends on the clean-jflex target, which deletes all src/.../standard/*.java files whose contents match regex /generated.*by.*JFlex/ . Your patch leaves intact the first line of UAX29URLEmailTokenizer.java , which matches the regex in a comment. As a result, running ant jflex deletes UAX29URLEmailTokenizer.java , and since it's no longer generated by JFlex, compilation fails. When I remove this JFlex comment line from UAX29URLEmailTokenizer.java , ant jflex works, everything compiles, and all tests succeed. +1 to commit after removing this line.
          Hide
          Steve Rowe added a comment -

          One other minor issue: ant clean-jflex doesn't remove the JFlex-generated *.java files under the new directory src/.../standard/std31/.

          To include them, on line #92 in modules/analysis/common/build.xml, change includes="*.java" to includes="**/*.java".

          Show
          Steve Rowe added a comment - One other minor issue: ant clean-jflex doesn't remove the JFlex-generated *.java files under the new directory src/.../standard/std31/ . To include them, on line #92 in modules/analysis/common/build.xml , change includes="*.java" to includes="**/*.java" .
          Hide
          Robert Muir added a comment -

          good catch, thanks for reviewing and finding these issues!

          Show
          Robert Muir added a comment - good catch, thanks for reviewing and finding these issues!

            People

            • Assignee:
              Robert Muir
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development