Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1068

Invalid behavior of StandardTokenizerImpl

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 2.3
    • modules/analysis
    • None
    • Patch Available

    Description

      The following code prints the output of StandardAnalyzer:

      Analyzer analyzer = new StandardAnalyzer();
      TokenStream ts = analyzer.tokenStream("content", new StringReader("<some text>"));
      Token t;
      while ((t = ts.next()) != null)

      { System.out.println(t); }

      If you pass "www.abc.com", the output is (www.abc.com,0,11,type=<HOST>) (which is correct in my opinion).
      However, if you pass "www.abc.com." (notice the extra '.' at the end), the output is (wwwabccom,0,12,type=<ACRONYM>).

      I think the behavior in the second case is incorrect for several reasons:
      1. It recognizes the string incorrectly (no argue on that).
      2. It kind of prevents you from putting URLs at the end of a sentence, which is perfectly legal.
      3. An ACRONYM, at least to the best of my understanding, is of the form A.B.C. and not ABC.DEF.

      I looked at StandardTokenizerImpl.jflex and I think the problem comes from this definition:
      // acronyms: U.S.A., I.B.M., etc.
      // use a post-filter to remove dots
      ACRONYM =

      {ALPHA} "." ({ALPHA}

      ".")+

      Notice how the comment relates to acronym as U.S.A., I.B.M. and not something else. I changed the definition to
      ACRONYM =

      {LETTER} "." ({LETTER}

      ".")+
      and it solved the problem.

      This was also reported here:
      http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
      http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926

      Attachments

        1. LUCENE-1068.patch
          22 kB
          Grant Ingersoll
        2. standardTokenizerImpl.jflex.patch
          0.7 kB
          Shai Erera
        3. standardTokenizerImpl.patch
          7 kB
          Shai Erera
        4. StandardTokenizerImpl-2.patch
          12 kB
          Shai Erera
        5. StandardTokenizerImpl-3.patch
          15 kB
          Shai Erera
        6. StandardTokenizerImpl-5.patch
          16 kB
          Shai Erera
        7. StandardTokenizer-java-4.patch
          14 kB
          Shai Erera
        8. StandardTokenizer-test-4.patch
          2 kB
          Shai Erera

        Issue Links

          Activity

            People

              gsingers Grant Ingersoll
              shaie Shai Erera
              Votes:
              1 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: