Issue Details (XML | Word | Printable)

Key: LUCENE-1068
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Grant Ingersoll
Reporter: Shai Erera
Votes: 1
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

Invalid behavior of StandardTokenizerImpl

Created: 27/Nov/07 11:54 AM   Updated: 05/Sep/08 06:20 AM
Return to search
Component/s: Analysis
Affects Version/s: None
Fix Version/s: 2.3

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works LUCENE-1068.patch 2007-12-19 02:54 AM Grant Ingersoll 22 kB
Text File Licensed for inclusion in ASF works StandardTokenizer-java-4.patch 2007-12-11 01:26 PM Shai Erera 14 kB
Text File Licensed for inclusion in ASF works StandardTokenizer-test-4.patch 2007-12-11 01:26 PM Shai Erera 2 kB
Text File Licensed for inclusion in ASF works StandardTokenizerImpl-2.patch 2007-11-29 01:37 PM Shai Erera 12 kB
Text File Licensed for inclusion in ASF works StandardTokenizerImpl-3.patch 2007-11-30 06:53 AM Shai Erera 15 kB
Text File Licensed for inclusion in ASF works StandardTokenizerImpl-5.patch 2007-12-12 02:37 PM Shai Erera 16 kB
Text File Licensed for inclusion in ASF works standardTokenizerImpl.jflex.patch 2007-11-27 11:55 AM Shai Erera 0.7 kB
Text File Licensed for inclusion in ASF works standardTokenizerImpl.patch 2007-11-27 11:56 AM Shai Erera 7 kB
Issue Links:
Reference

Lucene Fields: Patch Available
Resolution Date: 28/Dec/07 02:46 AM


 Description  « Hide
The following code prints the output of StandardAnalyzer:

Analyzer analyzer = new StandardAnalyzer();
TokenStream ts = analyzer.tokenStream("content", new StringReader("<some text>"));
Token t;
while ((t = ts.next()) != null) { System.out.println(t); }

If you pass "www.abc.com", the output is (www.abc.com,0,11,type=<HOST>) (which is correct in my opinion).
However, if you pass "www.abc.com." (notice the extra '.' at the end), the output is (wwwabccom,0,12,type=<ACRONYM>).

I think the behavior in the second case is incorrect for several reasons:
1. It recognizes the string incorrectly (no argue on that).
2. It kind of prevents you from putting URLs at the end of a sentence, which is perfectly legal.
3. An ACRONYM, at least to the best of my understanding, is of the form A.B.C. and not ABC.DEF.

I looked at StandardTokenizerImpl.jflex and I think the problem comes from this definition:
// acronyms: U.S.A., I.B.M., etc.
// use a post-filter to remove dots
ACRONYM = {ALPHA} "." ({ALPHA} ".")+

Notice how the comment relates to acronym as U.S.A., I.B.M. and not something else. I changed the definition to
ACRONYM = {LETTER} "." ({LETTER} ".")+
and it solved the problem.

This was also reported here:
http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Shai Erera added a comment - 27/Nov/07 11:55 AM
This fixes the JFlex definition file. The change simply replaces:
ACRONYM = {ALPHA} "." ({ALPHA} ".")+
with
ACRONYM = {LETTER} "." ({LETTER} ".")+

Shai Erera added a comment - 27/Nov/07 11:56 AM
This is the result of re-compiling the JFlex fixed file. Not sure how useful this patch is, but I'm attaching it anyway.

Shai Erera added a comment - 29/Nov/07 01:37 PM
I've found a way to do it (I think):
I've added a new type called ACRONYM_DEP that identifies the old ACRONYMs and fixed the current ACRONYM to identify proper ones.
I also marked ACRONYM_DEP as deprecated.
I added code to StandardTokenizer to set the type of a token to HOST if the type returned is ACRONYM_DEP. This behavior can be changed if you think the type should be set to ACRONYM, in case there are applications that count on the Token type.

I wrote these 4 lines of code to verify it works:
public static void main(String[] args) throws Exception { parse("www.abc.com."); parse("www.abc.com"); parse("I.B.M."); }

public static void parse(String text) throws Exception {
Analyzer analyzer = new StandardAnalyzer();
TokenStream ts = analyzer.tokenStream("content", new StringReader(text));
Token t;
while ((t = ts.next()) != null) { System.out.println(t); }
}
And the output is:
(www.abc.com.,0,12,type=<HOST>)
(www.abc.com,0,11,type=<HOST>)
(ibm,0,6,type=<ACRONYM>)


Shai Erera added a comment - 30/Nov/07 06:53 AM
The previous patch I put was incorrect since it would still break existing applications. The current patch does:
1. Introduces a new type ACRONYM_DEP which is deprecated and recognizes the old ACRONYM format.
2. Fixes ACRONYM to recognize LETTER + "." (LETTER + ".")+.
3. Added a public member to StandardTokenizer and StandardAnalyzer replaceDepAcronym which can be set if the application would like the deprecated acronym format to be treated as ACRONYM or HOST. The default behavior, if not set is to recognize the old ACRONYM as HOST.

This is how it should be used:
public static void main(String[] args) throws Exception { parse("www.abc.com.", false); parse("www.abc.com.", true); parse("www.abc.com", true); parse("I.B.M.", true); }

public static void parse(String text, boolean replaceDepAcronym) throws Exception {
StandardAnalyzer analyzer = new StandardAnalyzer();
analyzer.replaceDepAcronym = replaceDepAcronym;
TokenStream ts = analyzer.tokenStream("content", new StringReader(text));
Token t;
while ((t = ts.next()) != null) { System.out.println(t); }
}
And here is the output:
(wwwabccom,0,12,type=<ACRONYM>)
(www.abc.com.,0,12,type=<HOST>)
(www.abc.com,0,11,type=<HOST>)
(ibm,0,6,type=<ACRONYM>)

The member is marked deprecated so we can remove it in the next release. Applications that would like to new behavior need to do nothing, and therefore will not be impacted once we remove that member. Applications that want the old behavior need to explicitly set it and in the next major release remove it.

I think that solves it. How should I proceed?


Grant Ingersoll added a comment - 10/Dec/07 09:15 PM
Hi Shai,

Thanks for the patch. Can you please add unit tests in TestStandardAnalyzer?

Also, if you run svn diff in the Lucene directory then it will generate a patch that doesn't need to be modified (your patch has references to D:/ etc.)


Shai Erera added a comment - 10/Dec/07 09:27 PM
Hi Grant,

I used Eclipse to generate the patch (right-click on
org.apache.lucene.analysis.standard, select Team and Create Patch). How do I
run svn diff? Can I do it from inside Eclipse or should I install SVN
cmd-line tools?


Regards,

Shai Erera


Grant Ingersoll added a comment - 10/Dec/07 09:43 PM
Hmmm, maybe there is a way in Eclipse to make the path relative to the
working directory? Otherwise, from the command line in the Lucene
directory: svn diff > StandardTokenizer-4.patch

-Grant

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ


Shai Erera added a comment - 11/Dec/07 01:26 PM
Code fies under java and test packages. This should be applied under "src"

Michael Busch added a comment - 12/Dec/07 12:23 PM

The member is marked deprecated so we can remove it in the next release. Applications that would like to new behavior need to do nothing, and therefore will not be impacted once we remove that member. Applications that want the old behavior need to explicitly set it and in the next major release remove it.

Doesn't this mean it is an API change if we make the new behavior the default? Apps that upgrade will see the new behavior unless they set they call replaceDepAcronym.

To be fully backwards compatible I think this patch should use the old behavior as default. Then in 3.0 we can make the new behavior the default.


Shai Erera added a comment - 12/Dec/07 02:37 PM
Changed the default behavior to match the current behavior. Applications that want to use the new definitions of HOST and ACRONYM should call StandardAnalyzer.replaceDepAcronym = true.

Grant Ingersoll added a comment - 19/Dec/07 02:54 AM
Applied patch. Updated some documentation. Changed it to use a private boolean along with getters and setters, plus added some new constructors. All of these should be deprecated and marked as being removed in 3.x.

I will apply patch tomorrow or Friday unless I hear objections


Grant Ingersoll added a comment - 21/Dec/07 05:44 PM
StandardTokenizer also incorrectly marks numbers as HOST.

For example, on line 108 of TestStandardAnalyzer, the type of 21.35 is HOST when I think it should be NUM.


Shai Erera added a comment - 23/Dec/07 07:18 AM
Even if you run testNumeric() on the trunk version, it recognizes "21.35" as HOST and not NUM ... The problem is that HOST is configured to recognized letters or digits. I'll check if there's a way to define precedence in JFlex, i.e., first detect NUM, then HOST (as every floating number is a HOST).
Another option would be to set HOST do detect series of xxx.yyy.(zzz .)+, meaning aaa.bbb won't be a HOST, but aaa.bbb.ccc will be. Do you see any problem with that? Are you aware of hosts that are of the form aa.bb?

Shai Erera added a comment - 23/Dec/07 07:36 AM
Maybe this is a separate issue?
Notice that IP addresses are also recognized as HOST, however StandardTokenizerImpl.jflex documentation specifies they should be recognized as NUM.
// floating point, serial, model numbers, ip addresses, etc.
// every other segment must have at least one digit
NUM = ({ALPHANUM} {P} {HAS_DIGIT}
| {HAS_DIGIT} {P} {ALPHANUM}
{ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+
{HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
{ALPHANUM} {P} {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
| {HAS_DIGIT} {P} {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+)

Grant Ingersoll added a comment - 26/Dec/07 12:56 PM
Let's commit this patch, and move the floating point issue to later.

Grant Ingersoll added a comment - 28/Dec/07 02:46 AM
Committed.

Mark Lassau added a comment - 03/Sep/08 12:59 AM
Causes JIRA issue JRA-15484.