Lucene - Core
  1. Lucene - Core
  2. LUCENE-1068

Invalid behavior of StandardTokenizerImpl


    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.3
    • Component/s: modules/analysis
    • Labels:
    • Lucene Fields:
      Patch Available


      The following code prints the output of StandardAnalyzer:

      Analyzer analyzer = new StandardAnalyzer();
      TokenStream ts = analyzer.tokenStream("content", new StringReader("<some text>"));
      Token t;
      while ((t = != null)

      { System.out.println(t); }

      If you pass "", the output is (,0,11,type=<HOST>) (which is correct in my opinion).
      However, if you pass "" (notice the extra '.' at the end), the output is (wwwabccom,0,12,type=<ACRONYM>).

      I think the behavior in the second case is incorrect for several reasons:
      1. It recognizes the string incorrectly (no argue on that).
      2. It kind of prevents you from putting URLs at the end of a sentence, which is perfectly legal.
      3. An ACRONYM, at least to the best of my understanding, is of the form A.B.C. and not ABC.DEF.

      I looked at StandardTokenizerImpl.jflex and I think the problem comes from this definition:
      // acronyms: U.S.A., I.B.M., etc.
      // use a post-filter to remove dots
      ACRONYM =

      {ALPHA} "." ({ALPHA}


      Notice how the comment relates to acronym as U.S.A., I.B.M. and not something else. I changed the definition to
      ACRONYM =

      {LETTER} "." ({LETTER}

      and it solved the problem.

      This was also reported here:

      1. standardTokenizerImpl.jflex.patch
        0.7 kB
        Shai Erera
      2. standardTokenizerImpl.patch
        7 kB
        Shai Erera
      3. StandardTokenizerImpl-2.patch
        12 kB
        Shai Erera
      4. StandardTokenizerImpl-3.patch
        15 kB
        Shai Erera
      5. StandardTokenizer-java-4.patch
        14 kB
        Shai Erera
      6. StandardTokenizer-test-4.patch
        2 kB
        Shai Erera
      7. StandardTokenizerImpl-5.patch
        16 kB
        Shai Erera
      8. LUCENE-1068.patch
        22 kB
        Grant Ingersoll

        Issue Links



            • Assignee:
              Grant Ingersoll
              Shai Erera
            • Votes:
              1 Vote for this issue
              2 Start watching this issue


              • Created: