Issue Details (XML | Word | Printable)

Key: LUCENE-1150
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Michael McCandless
Reporter: Nicolas Lalevée
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

The token types of the standard tokenizer is not accessible

Created: 25/Jan/08 10:16 AM   Updated: 08/May/08 07:47 PM
Return to search
Component/s: Analysis
Affects Version/s: 2.3
Fix Version/s: 2.3.2, 2.4

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works LUCENE-1150.patch 2008-01-25 12:40 PM Michael McCandless 7 kB
Text File Licensed for inclusion in ASF works LUCENE-1150.take2.patch 2008-01-25 07:16 PM Michael McCandless 14 kB

Lucene Fields: New
Resolution Date: 15/Apr/08 09:09 AM


 Description  « Hide
The StandardTokenizerImpl not being public, these token types are not accessible :
public static final int ALPHANUM          = 0;
public static final int APOSTROPHE        = 1;
public static final int ACRONYM           = 2;
public static final int COMPANY           = 3;
public static final int EMAIL             = 4;
public static final int HOST              = 5;
public static final int NUM               = 6;
public static final int CJ                = 7;
/**
 * @deprecated this solves a bug where HOSTs that end with '.' are identified
 *             as ACRONYMs. It is deprecated and will be removed in the next
 *             release.
 */
public static final int ACRONYM_DEP       = 8;

public static final String [] TOKEN_TYPES = new String [] {
    "<ALPHANUM>",
    "<APOSTROPHE>",
    "<ACRONYM>",
    "<COMPANY>",
    "<EMAIL>",
    "<HOST>",
    "<NUM>",
    "<CJ>",
    "<ACRONYM_DEP>"
};

So no custom TokenFilter can be based of the token type. Actually even the StandardFilter cannot be writen outside the org.apache.lucene.analysis.standard package.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Nicolas Lalevée added a comment - 25/Jan/08 10:17 AM
here is my workaround until it is fixed into the Lucene API :
package org.apache.lucene.analysis.standard;

public class TokenTypeAccessor {

    public static final String APOSTROPHE_TYPE = StandardTokenizerImpl.TOKEN_TYPES[StandardTokenizerImpl.APOSTROPHE];

    public static final String ACRONYM_TYPE = StandardTokenizerImpl.TOKEN_TYPES[StandardTokenizerImpl.ACRONYM];

    public static final String HOST_TYPE = StandardTokenizerImpl.TOKEN_TYPES[StandardTokenizerImpl.HOST];

}

Michael McCandless added a comment - 25/Jan/08 12:36 PM
Ugh, I missed that we lost this when we switched to JFlex (LUCENE-966). I'll take this.

Michael McCandless made changes - 25/Jan/08 12:36 PM
Field Original Value New Value
Assignee Michael McCandless [ mikemccand ]
Michael McCandless added a comment - 25/Jan/08 12:40 PM
Attached patch fixing this. I just added a new Constants.java that has static constants defined, and added a compile-time testcase to assert that these constants remain publicly accessible.

I will commit in a day or two.


Michael McCandless made changes - 25/Jan/08 12:40 PM
Attachment LUCENE-1150.patch [ 12374028 ]
Grant Ingersoll added a comment - 25/Jan/08 01:21 PM
Why not just add them on to the StandardTokenizer class?

For the WikipediaTokenizer (roughly based on the StandardTokenizer), I just added them to the WikipediaTokenizer wrapper class. However, I did leave the StandardTokenizer ones as they were. So, we should probably do the appropriate thing there, too.


Michael McCandless added a comment - 25/Jan/08 02:45 PM
Good! I'll take that approach, and update WikipediaTokenizer too.

Michael McCandless added a comment - 25/Jan/08 07:16 PM
New patch attached, that also exposes the token types for WikipediaTokenizer. I'll commit in a day or two.

Michael McCandless made changes - 25/Jan/08 07:16 PM
Attachment LUCENE-1150.take2.patch [ 12374074 ]
Repository Revision Date User Message
ASF #616248 Tue Jan 29 10:51:44 UTC 2008 mikemccand LUCENE-1150: make StandardAnalyzer tokenizer constants public again (public access was accidentally removed with LUCENE-966)
Files Changed
MODIFY /lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex
MODIFY /lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java
MODIFY /lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.java
MODIFY /lucene/java/trunk/contrib/wikipedia/src/java/org/apache/lucene/wikipedia/analysis/WikipediaTokenizer.java
MODIFY /lucene/java/trunk/contrib/wikipedia/src/java/org/apache/lucene/wikipedia/analysis/WikipediaTokenizerImpl.java
MODIFY /lucene/java/trunk/src/test/org/apache/lucene/analysis/TestAnalyzers.java
MODIFY /lucene/java/trunk/CHANGES.txt
MODIFY /lucene/java/trunk/src/java/org/apache/lucene/store/FSDirectory.java
MODIFY /lucene/java/trunk/contrib/wikipedia/src/java/org/apache/lucene/wikipedia/analysis/WikipediaTokenizerImpl.jflex

Michael McCandless added a comment - 29/Jan/08 10:52 AM
I just committed this. Thanks for opening this Nicolas!

Michael McCandless made changes - 29/Jan/08 10:52 AM
Status Open [ 1 ] Resolved [ 5 ]
Resolution Fixed [ 1 ]
Fix Version/s 2.4 [ 12312681 ]
Michael McCandless added a comment - 09/Apr/08 09:29 AM
Backported fix to 2.3.2.

Michael McCandless made changes - 09/Apr/08 09:29 AM
Fix Version/s 2.3.2 [ 12313057 ]
Repository Revision Date User Message
ASF #646243 Wed Apr 09 09:31:37 UTC 2008 mikemccand LUCENE-1150: re-instate constants in StandardTokenizer
Files Changed
MODIFY /lucene/java/branches/lucene_2_3/src/java/org/apache/lucene/store/FSDirectory.java
MODIFY /lucene/java/branches/lucene_2_3/src/test/org/apache/lucene/analysis/TestAnalyzers.java
MODIFY /lucene/java/branches/lucene_2_3/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java
MODIFY /lucene/java/branches/lucene_2_3/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.java
MODIFY /lucene/java/branches/lucene_2_3/contrib/wikipedia/src/java/org/apache/lucene/wikipedia/analysis/WikipediaTokenizer.java
MODIFY /lucene/java/branches/lucene_2_3/contrib/wikipedia/src/java/org/apache/lucene/wikipedia/analysis/WikipediaTokenizerImpl.java
MODIFY /lucene/java/branches/lucene_2_3/contrib/wikipedia/src/java/org/apache/lucene/wikipedia/analysis/WikipediaTokenizerImpl.jflex
MODIFY /lucene/java/branches/lucene_2_3/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex
MODIFY /lucene/java/branches/lucene_2_3/CHANGES.txt

Antony Bowesman added a comment - 15/Apr/08 07:33 AM
The original tokenImage String array from 2.2 is still not available in this patch, they are still in the Impl. These are the values returned from Token.type(), so should they not be visible as well as the static ints?

Michael McCandless added a comment - 15/Apr/08 08:17 AM
You're right. I'll put that back as well, and port to 2.3.2.

Michael McCandless made changes - 15/Apr/08 08:17 AM
Status Resolved [ 5 ] Reopened [ 4 ]
Resolution Fixed [ 1 ]
Repository Revision Date User Message
ASF #648183 Tue Apr 15 08:48:41 UTC 2008 mikemccand LUCENE-1150: put back public tokenImage/TOKEN_TYPES in StandardTokenizer and WikipediaTokenizer
Files Changed
MODIFY /lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex
MODIFY /lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java
MODIFY /lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.java
MODIFY /lucene/java/trunk/contrib/wikipedia/src/java/org/apache/lucene/wikipedia/analysis/WikipediaTokenizer.java
MODIFY /lucene/java/trunk/contrib/wikipedia/src/java/org/apache/lucene/wikipedia/analysis/WikipediaTokenizerImpl.java
MODIFY /lucene/java/trunk/src/test/org/apache/lucene/analysis/TestAnalyzers.java
MODIFY /lucene/java/trunk/contrib/wikipedia/src/java/org/apache/lucene/wikipedia/analysis/WikipediaTokenizerImpl.jflex

Michael McCandless made changes - 15/Apr/08 09:09 AM
Resolution Fixed [ 1 ]
Status Reopened [ 4 ] Resolved [ 5 ]
Repository Revision Date User Message
ASF #648188 Tue Apr 15 09:12:00 UTC 2008 mikemccand LUCENE-1150: put back public tokenImage/TOKEN_TYPES in StandardTokenizer and WikipediaTokenizer
Files Changed
MODIFY /lucene/java/branches/lucene_2_3/src/test/org/apache/lucene/analysis/TestAnalyzers.java
MODIFY /lucene/java/branches/lucene_2_3/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java
MODIFY /lucene/java/branches/lucene_2_3/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.java
MODIFY /lucene/java/branches/lucene_2_3/contrib/wikipedia/src/java/org/apache/lucene/wikipedia/analysis/WikipediaTokenizer.java
MODIFY /lucene/java/branches/lucene_2_3/contrib/wikipedia/src/java/org/apache/lucene/wikipedia/analysis/WikipediaTokenizerImpl.java
MODIFY /lucene/java/branches/lucene_2_3/contrib/wikipedia/src/java/org/apache/lucene/wikipedia/analysis/WikipediaTokenizerImpl.jflex
MODIFY /lucene/java/branches/lucene_2_3/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex

Michael Busch made changes - 08/May/08 07:47 PM
Status Resolved [ 5 ] Closed [ 6 ]