[LUCENE-1068] Invalid behavior of StandardTokenizerImpl - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.3
Component/s: modules/analysis
Labels:
None

Lucene Fields:

Patch Available

Description

The following code prints the output of StandardAnalyzer:

Analyzer analyzer = new StandardAnalyzer();
TokenStream ts = analyzer.tokenStream("content", new StringReader("<some text>"));
Token t;
while ((t = ts.next()) != null)

{ System.out.println(t); }

If you pass "www.abc.com", the output is (www.abc.com,0,11,type=<HOST>) (which is correct in my opinion).
However, if you pass "www.abc.com." (notice the extra '.' at the end), the output is (wwwabccom,0,12,type=<ACRONYM>).

I think the behavior in the second case is incorrect for several reasons:
1. It recognizes the string incorrectly (no argue on that).
2. It kind of prevents you from putting URLs at the end of a sentence, which is perfectly legal.
3. An ACRONYM, at least to the best of my understanding, is of the form A.B.C. and not ABC.DEF.

I looked at StandardTokenizerImpl.jflex and I think the problem comes from this definition:
// acronyms: U.S.A., I.B.M., etc.
// use a post-filter to remove dots
ACRONYM =

{ALPHA} "." ({ALPHA}

".")+

Notice how the comment relates to acronym as U.S.A., I.B.M. and not something else. I changed the definition to
ACRONYM =

{LETTER} "." ({LETTER}

".")+
and it solved the problem.

This was also reported here:
http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-1068.patch
19/Dec/07 02:54
22 kB
Grant Ingersoll
standardTokenizerImpl.jflex.patch
27/Nov/07 11:55
0.7 kB
Shai Erera
standardTokenizerImpl.patch
27/Nov/07 11:56
7 kB
Shai Erera
StandardTokenizerImpl-2.patch
29/Nov/07 13:37
12 kB
Shai Erera
StandardTokenizerImpl-3.patch
30/Nov/07 06:53
15 kB
Shai Erera
StandardTokenizerImpl-5.patch
12/Dec/07 14:37
16 kB
Shai Erera
StandardTokenizer-java-4.patch
11/Dec/07 13:26
14 kB
Shai Erera
StandardTokenizer-test-4.patch
11/Dec/07 13:26
2 kB
Shai Erera

Issue Links

is related to

LUCENE-1373 Most of the contributed Analyzers suffer from invalid recognition of acronyms.

Resolved

LUCENE-1140 NPE in StopFilter caused by StandardAnalyzer(boolean replaceInvalidAcronym) constructor

Closed

relates to

LUCENE-1100 StandardTokenizer incorrectly types certain values

Closed

LUCENE-1151 Fix StandardAnalyzer to not mis-identify HOST as ACRONYM by default

Closed

Activity

People

Assignee:: Grant Ingersoll

Reporter:: Shai Erera

Votes:: 1 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 27/Nov/07 11:54

Updated:: 28/Aug/22 11:43

Resolved:: 28/Dec/07 02:46