Lucene - Core
  1. Lucene - Core
  2. LUCENE-1068

Invalid behavior of StandardTokenizerImpl

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.3
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      Patch Available

      Description

      The following code prints the output of StandardAnalyzer:

      Analyzer analyzer = new StandardAnalyzer();
      TokenStream ts = analyzer.tokenStream("content", new StringReader("<some text>"));
      Token t;
      while ((t = ts.next()) != null)

      { System.out.println(t); }

      If you pass "www.abc.com", the output is (www.abc.com,0,11,type=<HOST>) (which is correct in my opinion).
      However, if you pass "www.abc.com." (notice the extra '.' at the end), the output is (wwwabccom,0,12,type=<ACRONYM>).

      I think the behavior in the second case is incorrect for several reasons:
      1. It recognizes the string incorrectly (no argue on that).
      2. It kind of prevents you from putting URLs at the end of a sentence, which is perfectly legal.
      3. An ACRONYM, at least to the best of my understanding, is of the form A.B.C. and not ABC.DEF.

      I looked at StandardTokenizerImpl.jflex and I think the problem comes from this definition:
      // acronyms: U.S.A., I.B.M., etc.
      // use a post-filter to remove dots
      ACRONYM =

      {ALPHA} "." ({ALPHA}

      ".")+

      Notice how the comment relates to acronym as U.S.A., I.B.M. and not something else. I changed the definition to
      ACRONYM =

      {LETTER} "." ({LETTER}

      ".")+
      and it solved the problem.

      This was also reported here:
      http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
      http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926

      1. standardTokenizerImpl.jflex.patch
        0.7 kB
        Shai Erera
      2. standardTokenizerImpl.patch
        7 kB
        Shai Erera
      3. StandardTokenizerImpl-2.patch
        12 kB
        Shai Erera
      4. StandardTokenizerImpl-3.patch
        15 kB
        Shai Erera
      5. StandardTokenizer-java-4.patch
        14 kB
        Shai Erera
      6. StandardTokenizer-test-4.patch
        2 kB
        Shai Erera
      7. StandardTokenizerImpl-5.patch
        16 kB
        Shai Erera
      8. LUCENE-1068.patch
        22 kB
        Grant Ingersoll

        Issue Links

          Activity

          Hide
          Shai Erera added a comment -

          This fixes the JFlex definition file. The change simply replaces:
          ACRONYM =

          {ALPHA} "." ({ALPHA}

          ".")+
          with
          ACRONYM =

          {LETTER} "." ({LETTER}

          ".")+

          Show
          Shai Erera added a comment - This fixes the JFlex definition file. The change simply replaces: ACRONYM = {ALPHA} "." ({ALPHA} ".")+ with ACRONYM = {LETTER} "." ({LETTER} ".")+
          Hide
          Shai Erera added a comment -

          This is the result of re-compiling the JFlex fixed file. Not sure how useful this patch is, but I'm attaching it anyway.

          Show
          Shai Erera added a comment - This is the result of re-compiling the JFlex fixed file. Not sure how useful this patch is, but I'm attaching it anyway.
          Hide
          Shai Erera added a comment -

          I've found a way to do it (I think):
          I've added a new type called ACRONYM_DEP that identifies the old ACRONYMs and fixed the current ACRONYM to identify proper ones.
          I also marked ACRONYM_DEP as deprecated.
          I added code to StandardTokenizer to set the type of a token to HOST if the type returned is ACRONYM_DEP. This behavior can be changed if you think the type should be set to ACRONYM, in case there are applications that count on the Token type.

          I wrote these 4 lines of code to verify it works:
          public static void main(String[] args) throws Exception

          { parse("www.abc.com."); parse("www.abc.com"); parse("I.B.M."); }

          public static void parse(String text) throws Exception {
          Analyzer analyzer = new StandardAnalyzer();
          TokenStream ts = analyzer.tokenStream("content", new StringReader(text));
          Token t;
          while ((t = ts.next()) != null)

          { System.out.println(t); }

          }
          And the output is:
          (www.abc.com.,0,12,type=<HOST>)
          (www.abc.com,0,11,type=<HOST>)
          (ibm,0,6,type=<ACRONYM>)

          Show
          Shai Erera added a comment - I've found a way to do it (I think): I've added a new type called ACRONYM_DEP that identifies the old ACRONYMs and fixed the current ACRONYM to identify proper ones. I also marked ACRONYM_DEP as deprecated. I added code to StandardTokenizer to set the type of a token to HOST if the type returned is ACRONYM_DEP. This behavior can be changed if you think the type should be set to ACRONYM, in case there are applications that count on the Token type. I wrote these 4 lines of code to verify it works: public static void main(String[] args) throws Exception { parse("www.abc.com."); parse("www.abc.com"); parse("I.B.M."); } public static void parse(String text) throws Exception { Analyzer analyzer = new StandardAnalyzer(); TokenStream ts = analyzer.tokenStream("content", new StringReader(text)); Token t; while ((t = ts.next()) != null) { System.out.println(t); } } And the output is: (www.abc.com.,0,12,type=<HOST>) (www.abc.com,0,11,type=<HOST>) (ibm,0,6,type=<ACRONYM>)
          Hide
          Shai Erera added a comment -

          The previous patch I put was incorrect since it would still break existing applications. The current patch does:
          1. Introduces a new type ACRONYM_DEP which is deprecated and recognizes the old ACRONYM format.
          2. Fixes ACRONYM to recognize LETTER + "." (LETTER + ".")+.
          3. Added a public member to StandardTokenizer and StandardAnalyzer replaceDepAcronym which can be set if the application would like the deprecated acronym format to be treated as ACRONYM or HOST. The default behavior, if not set is to recognize the old ACRONYM as HOST.

          This is how it should be used:
          public static void main(String[] args) throws Exception

          { parse("www.abc.com.", false); parse("www.abc.com.", true); parse("www.abc.com", true); parse("I.B.M.", true); }

          public static void parse(String text, boolean replaceDepAcronym) throws Exception {
          StandardAnalyzer analyzer = new StandardAnalyzer();
          analyzer.replaceDepAcronym = replaceDepAcronym;
          TokenStream ts = analyzer.tokenStream("content", new StringReader(text));
          Token t;
          while ((t = ts.next()) != null)

          { System.out.println(t); }

          }
          And here is the output:
          (wwwabccom,0,12,type=<ACRONYM>)
          (www.abc.com.,0,12,type=<HOST>)
          (www.abc.com,0,11,type=<HOST>)
          (ibm,0,6,type=<ACRONYM>)

          The member is marked deprecated so we can remove it in the next release. Applications that would like to new behavior need to do nothing, and therefore will not be impacted once we remove that member. Applications that want the old behavior need to explicitly set it and in the next major release remove it.

          I think that solves it. How should I proceed?

          Show
          Shai Erera added a comment - The previous patch I put was incorrect since it would still break existing applications. The current patch does: 1. Introduces a new type ACRONYM_DEP which is deprecated and recognizes the old ACRONYM format. 2. Fixes ACRONYM to recognize LETTER + "." (LETTER + ".")+. 3. Added a public member to StandardTokenizer and StandardAnalyzer replaceDepAcronym which can be set if the application would like the deprecated acronym format to be treated as ACRONYM or HOST. The default behavior, if not set is to recognize the old ACRONYM as HOST. This is how it should be used: public static void main(String[] args) throws Exception { parse("www.abc.com.", false); parse("www.abc.com.", true); parse("www.abc.com", true); parse("I.B.M.", true); } public static void parse(String text, boolean replaceDepAcronym) throws Exception { StandardAnalyzer analyzer = new StandardAnalyzer(); analyzer.replaceDepAcronym = replaceDepAcronym; TokenStream ts = analyzer.tokenStream("content", new StringReader(text)); Token t; while ((t = ts.next()) != null) { System.out.println(t); } } And here is the output: (wwwabccom,0,12,type=<ACRONYM>) (www.abc.com.,0,12,type=<HOST>) (www.abc.com,0,11,type=<HOST>) (ibm,0,6,type=<ACRONYM>) The member is marked deprecated so we can remove it in the next release. Applications that would like to new behavior need to do nothing, and therefore will not be impacted once we remove that member. Applications that want the old behavior need to explicitly set it and in the next major release remove it. I think that solves it. How should I proceed?
          Hide
          Grant Ingersoll added a comment -

          Hi Shai,

          Thanks for the patch. Can you please add unit tests in TestStandardAnalyzer?

          Also, if you run svn diff in the Lucene directory then it will generate a patch that doesn't need to be modified (your patch has references to D:/ etc.)

          Show
          Grant Ingersoll added a comment - Hi Shai, Thanks for the patch. Can you please add unit tests in TestStandardAnalyzer? Also, if you run svn diff in the Lucene directory then it will generate a patch that doesn't need to be modified (your patch has references to D:/ etc.)
          Hide
          Shai Erera added a comment -

          Hi Grant,

          I used Eclipse to generate the patch (right-click on
          org.apache.lucene.analysis.standard, select Team and Create Patch). How do I
          run svn diff? Can I do it from inside Eclipse or should I install SVN
          cmd-line tools?


          Regards,

          Shai Erera

          Show
          Shai Erera added a comment - Hi Grant, I used Eclipse to generate the patch (right-click on org.apache.lucene.analysis.standard, select Team and Create Patch). How do I run svn diff? Can I do it from inside Eclipse or should I install SVN cmd-line tools? – Regards, Shai Erera
          Hide
          Grant Ingersoll added a comment -

          Hmmm, maybe there is a way in Eclipse to make the path relative to the
          working directory? Otherwise, from the command line in the Lucene
          directory: svn diff > StandardTokenizer-4.patch

          -Grant

          --------------------------
          Grant Ingersoll
          http://lucene.grantingersoll.com

          Lucene Helpful Hints:
          http://wiki.apache.org/lucene-java/BasicsOfPerformance
          http://wiki.apache.org/lucene-java/LuceneFAQ

          Show
          Grant Ingersoll added a comment - Hmmm, maybe there is a way in Eclipse to make the path relative to the working directory? Otherwise, from the command line in the Lucene directory: svn diff > StandardTokenizer-4.patch -Grant -------------------------- Grant Ingersoll http://lucene.grantingersoll.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ
          Hide
          Shai Erera added a comment -

          Code fies under java and test packages. This should be applied under "src"

          Show
          Shai Erera added a comment - Code fies under java and test packages. This should be applied under "src"
          Hide
          Michael Busch added a comment -

          The member is marked deprecated so we can remove it in the next release. Applications that would like to new behavior need to do nothing, and therefore will not be impacted once we remove that member. Applications that want the old behavior need to explicitly set it and in the next major release remove it.

          Doesn't this mean it is an API change if we make the new behavior the default? Apps that upgrade will see the new behavior unless they set they call replaceDepAcronym.

          To be fully backwards compatible I think this patch should use the old behavior as default. Then in 3.0 we can make the new behavior the default.

          Show
          Michael Busch added a comment - The member is marked deprecated so we can remove it in the next release. Applications that would like to new behavior need to do nothing, and therefore will not be impacted once we remove that member. Applications that want the old behavior need to explicitly set it and in the next major release remove it. Doesn't this mean it is an API change if we make the new behavior the default? Apps that upgrade will see the new behavior unless they set they call replaceDepAcronym. To be fully backwards compatible I think this patch should use the old behavior as default. Then in 3.0 we can make the new behavior the default.
          Hide
          Shai Erera added a comment -

          Changed the default behavior to match the current behavior. Applications that want to use the new definitions of HOST and ACRONYM should call StandardAnalyzer.replaceDepAcronym = true.

          Show
          Shai Erera added a comment - Changed the default behavior to match the current behavior. Applications that want to use the new definitions of HOST and ACRONYM should call StandardAnalyzer.replaceDepAcronym = true.
          Hide
          Grant Ingersoll added a comment -

          Applied patch. Updated some documentation. Changed it to use a private boolean along with getters and setters, plus added some new constructors. All of these should be deprecated and marked as being removed in 3.x.

          I will apply patch tomorrow or Friday unless I hear objections

          Show
          Grant Ingersoll added a comment - Applied patch. Updated some documentation. Changed it to use a private boolean along with getters and setters, plus added some new constructors. All of these should be deprecated and marked as being removed in 3.x. I will apply patch tomorrow or Friday unless I hear objections
          Hide
          Grant Ingersoll added a comment -

          StandardTokenizer also incorrectly marks numbers as HOST.

          For example, on line 108 of TestStandardAnalyzer, the type of 21.35 is HOST when I think it should be NUM.

          Show
          Grant Ingersoll added a comment - StandardTokenizer also incorrectly marks numbers as HOST. For example, on line 108 of TestStandardAnalyzer, the type of 21.35 is HOST when I think it should be NUM.
          Hide
          Shai Erera added a comment -

          Even if you run testNumeric() on the trunk version, it recognizes "21.35" as HOST and not NUM ... The problem is that HOST is configured to recognized letters or digits. I'll check if there's a way to define precedence in JFlex, i.e., first detect NUM, then HOST (as every floating number is a HOST).
          Another option would be to set HOST do detect series of xxx.yyy.(zzz .)+, meaning aaa.bbb won't be a HOST, but aaa.bbb.ccc will be. Do you see any problem with that? Are you aware of hosts that are of the form aa.bb?

          Show
          Shai Erera added a comment - Even if you run testNumeric() on the trunk version, it recognizes "21.35" as HOST and not NUM ... The problem is that HOST is configured to recognized letters or digits. I'll check if there's a way to define precedence in JFlex, i.e., first detect NUM, then HOST (as every floating number is a HOST). Another option would be to set HOST do detect series of xxx.yyy.(zzz .)+, meaning aaa.bbb won't be a HOST, but aaa.bbb.ccc will be. Do you see any problem with that? Are you aware of hosts that are of the form aa.bb?
          Hide
          Shai Erera added a comment -

          Maybe this is a separate issue?
          Notice that IP addresses are also recognized as HOST, however StandardTokenizerImpl.jflex documentation specifies they should be recognized as NUM.
          // floating point, serial, model numbers, ip addresses, etc.
          // every other segment must have at least one digit
          NUM = (

          {ALPHANUM} {P} {HAS_DIGIT}
          | {HAS_DIGIT} {P} {ALPHANUM}
          {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM}

          )+

          {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT}

          )+

          {ALPHANUM} {P} {HAS_DIGIT} ({P} {ALPHANUM}

          {P} {HAS_DIGIT})+
          | {HAS_DIGIT} {P}

          {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM}

          )+)

          Show
          Shai Erera added a comment - Maybe this is a separate issue? Notice that IP addresses are also recognized as HOST, however StandardTokenizerImpl.jflex documentation specifies they should be recognized as NUM. // floating point, serial, model numbers, ip addresses, etc. // every other segment must have at least one digit NUM = ( {ALPHANUM} {P} {HAS_DIGIT} | {HAS_DIGIT} {P} {ALPHANUM} {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM} )+ {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT} )+ {ALPHANUM} {P} {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+ | {HAS_DIGIT} {P} {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM} )+)
          Hide
          Grant Ingersoll added a comment -

          Let's commit this patch, and move the floating point issue to later.

          Show
          Grant Ingersoll added a comment - Let's commit this patch, and move the floating point issue to later.
          Hide
          Grant Ingersoll added a comment -

          Committed.

          Show
          Grant Ingersoll added a comment - Committed.
          Hide
          Mark Lassau added a comment -

          Causes JIRA issue JRA-15484.

          Show
          Mark Lassau added a comment - Causes JIRA issue JRA-15484 .

            People

            • Assignee:
              Grant Ingersoll
              Reporter:
              Shai Erera
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development