Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1150

The token types of the standard tokenizer is not accessible

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.3
    • Fix Version/s: 2.3.2, 2.4
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The StandardTokenizerImpl not being public, these token types are not accessible :

      public static final int ALPHANUM          = 0;
      public static final int APOSTROPHE        = 1;
      public static final int ACRONYM           = 2;
      public static final int COMPANY           = 3;
      public static final int EMAIL             = 4;
      public static final int HOST              = 5;
      public static final int NUM               = 6;
      public static final int CJ                = 7;
      /**
       * @deprecated this solves a bug where HOSTs that end with '.' are identified
       *             as ACRONYMs. It is deprecated and will be removed in the next
       *             release.
       */
      public static final int ACRONYM_DEP       = 8;
      
      public static final String [] TOKEN_TYPES = new String [] {
          "<ALPHANUM>",
          "<APOSTROPHE>",
          "<ACRONYM>",
          "<COMPANY>",
          "<EMAIL>",
          "<HOST>",
          "<NUM>",
          "<CJ>",
          "<ACRONYM_DEP>"
      };
      

      So no custom TokenFilter can be based of the token type. Actually even the StandardFilter cannot be writen outside the org.apache.lucene.analysis.standard package.

      1. LUCENE-1150.take2.patch
        14 kB
        Michael McCandless
      2. LUCENE-1150.patch
        7 kB
        Michael McCandless

        Activity

        Hide
        mikemccand Michael McCandless added a comment -

        You're right. I'll put that back as well, and port to 2.3.2.

        Show
        mikemccand Michael McCandless added a comment - You're right. I'll put that back as well, and port to 2.3.2.
        Hide
        adb Antony Bowesman added a comment -

        The original tokenImage String array from 2.2 is still not available in this patch, they are still in the Impl. These are the values returned from Token.type(), so should they not be visible as well as the static ints?

        Show
        adb Antony Bowesman added a comment - The original tokenImage String array from 2.2 is still not available in this patch, they are still in the Impl. These are the values returned from Token.type(), so should they not be visible as well as the static ints?
        Hide
        mikemccand Michael McCandless added a comment -

        Backported fix to 2.3.2.

        Show
        mikemccand Michael McCandless added a comment - Backported fix to 2.3.2.
        Hide
        mikemccand Michael McCandless added a comment -

        I just committed this. Thanks for opening this Nicolas!

        Show
        mikemccand Michael McCandless added a comment - I just committed this. Thanks for opening this Nicolas!
        Hide
        mikemccand Michael McCandless added a comment -

        New patch attached, that also exposes the token types for WikipediaTokenizer. I'll commit in a day or two.

        Show
        mikemccand Michael McCandless added a comment - New patch attached, that also exposes the token types for WikipediaTokenizer. I'll commit in a day or two.
        Hide
        mikemccand Michael McCandless added a comment -

        Good! I'll take that approach, and update WikipediaTokenizer too.

        Show
        mikemccand Michael McCandless added a comment - Good! I'll take that approach, and update WikipediaTokenizer too.
        Hide
        gsingers Grant Ingersoll added a comment -

        Why not just add them on to the StandardTokenizer class?

        For the WikipediaTokenizer (roughly based on the StandardTokenizer), I just added them to the WikipediaTokenizer wrapper class. However, I did leave the StandardTokenizer ones as they were. So, we should probably do the appropriate thing there, too.

        Show
        gsingers Grant Ingersoll added a comment - Why not just add them on to the StandardTokenizer class? For the WikipediaTokenizer (roughly based on the StandardTokenizer), I just added them to the WikipediaTokenizer wrapper class. However, I did leave the StandardTokenizer ones as they were. So, we should probably do the appropriate thing there, too.
        Hide
        mikemccand Michael McCandless added a comment -

        Attached patch fixing this. I just added a new Constants.java that has static constants defined, and added a compile-time testcase to assert that these constants remain publicly accessible.

        I will commit in a day or two.

        Show
        mikemccand Michael McCandless added a comment - Attached patch fixing this. I just added a new Constants.java that has static constants defined, and added a compile-time testcase to assert that these constants remain publicly accessible. I will commit in a day or two.
        Hide
        mikemccand Michael McCandless added a comment -

        Ugh, I missed that we lost this when we switched to JFlex (LUCENE-966). I'll take this.

        Show
        mikemccand Michael McCandless added a comment - Ugh, I missed that we lost this when we switched to JFlex ( LUCENE-966 ). I'll take this.
        Hide
        hibou Nicolas Lalevée added a comment -

        here is my workaround until it is fixed into the Lucene API :

        package org.apache.lucene.analysis.standard;
        
        public class TokenTypeAccessor {
        
            public static final String APOSTROPHE_TYPE = StandardTokenizerImpl.TOKEN_TYPES[StandardTokenizerImpl.APOSTROPHE];
        
            public static final String ACRONYM_TYPE = StandardTokenizerImpl.TOKEN_TYPES[StandardTokenizerImpl.ACRONYM];
        
            public static final String HOST_TYPE = StandardTokenizerImpl.TOKEN_TYPES[StandardTokenizerImpl.HOST];
        
        }
        
        Show
        hibou Nicolas Lalevée added a comment - here is my workaround until it is fixed into the Lucene API : package org.apache.lucene.analysis.standard; public class TokenTypeAccessor { public static final String APOSTROPHE_TYPE = StandardTokenizerImpl.TOKEN_TYPES[StandardTokenizerImpl.APOSTROPHE]; public static final String ACRONYM_TYPE = StandardTokenizerImpl.TOKEN_TYPES[StandardTokenizerImpl.ACRONYM]; public static final String HOST_TYPE = StandardTokenizerImpl.TOKEN_TYPES[StandardTokenizerImpl.HOST]; }

          People

          • Assignee:
            mikemccand Michael McCandless
            Reporter:
            hibou Nicolas Lalevée
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development