The following code prints the output of StandardAnalyzer:
Analyzer analyzer = new StandardAnalyzer();
TokenStream ts = analyzer.tokenStream("content", new StringReader("<some text>"));
while ((t = ts.next()) != null)
If you pass "www.abc.com", the output is (www.abc.com,0,11,type=<HOST>) (which is correct in my opinion).
However, if you pass "www.abc.com." (notice the extra '.' at the end), the output is (wwwabccom,0,12,type=<ACRONYM>).
I think the behavior in the second case is incorrect for several reasons:
1. It recognizes the string incorrectly (no argue on that).
2. It kind of prevents you from putting URLs at the end of a sentence, which is perfectly legal.
3. An ACRONYM, at least to the best of my understanding, is of the form A.B.C. and not ABC.DEF.
I looked at StandardTokenizerImpl.jflex and I think the problem comes from this definition:
// acronyms: U.S.A., I.B.M., etc.
// use a post-filter to remove dots
Notice how the comment relates to acronym as U.S.A., I.B.M. and not something else. I changed the definition to
and it solved the problem.
|Assignee||Grant Ingersoll [ gsingers ]|
|Fix Version/s||2.3 [ 12312531 ]|
|Lucene Fields||[Patch Available, New]||[Patch Available]|
|Priority||Major [ 3 ]||Minor [ 4 ]|
|Status||Open [ 1 ]||In Progress [ 3 ]|
|Status||In Progress [ 3 ]||Resolved [ 5 ]|
|Resolution||Fixed [ 1 ]|
|Status||Resolved [ 5 ]||Closed [ 6 ]|
|Workflow||jira [ 12417957 ]||Default workflow, editable Closed status [ 12564628 ]|
|Workflow||Default workflow, editable Closed status [ 12564628 ]||jira [ 12584623 ]|