Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.4
    • Component/s: modules/analysis
    • Labels:
      None

      Description

      It is sometimes useful to have a more compact, easy to parse, type representation for Token than the current type() String. This patch adds a BitSet onto Token, defaulting to null, with accessors for setting bit flags on a Token. This is useful for communicating information about a token to TokenFilters further down the chain.

      For example, in the WikipediaTokenizer, the possibility exists that a token could be both a category and bold (or many other variations), yet it is difficult to communicate this without adding in a lot of different Strings for type. Unlike using the payload information (which could serve this purpose), the BitSet does not get added to the index (although one could easily convert it to a payload.)

      1. LUCENE-1137.patch
        4 kB
        Grant Ingersoll
      2. LUCENE-1137.patch
        4 kB
        Grant Ingersoll
      3. LUCENE-1137.patch
        3 kB
        Grant Ingersoll

        Activity

        Hide
        Grant Ingersoll added a comment -

        Added get/setTypeBits() method and underlying storage and constructors.

        Show
        Grant Ingersoll added a comment - Added get/setTypeBits() method and underlying storage and constructors.
        Hide
        Yonik Seeley added a comment -

        Gack! I recommended a bitset on Token previously, but I meant an elemental one... an int (32 bits) or a long (64 bits).
        Half of the bits could be reserved for use by Lucene tokenizers, and half could be reserved for users. I think an actual BitSet is too heavy-weight.

        Just provide a int or long Token.getFlags() and int or long Token.setFlags(), and nothing more (we don't need to do bit twiddling for users IMO)

        Show
        Yonik Seeley added a comment - Gack! I recommended a bitset on Token previously, but I meant an elemental one... an int (32 bits) or a long (64 bits). Half of the bits could be reserved for use by Lucene tokenizers, and half could be reserved for users. I think an actual BitSet is too heavy-weight. Just provide a int or long Token.getFlags() and int or long Token.setFlags(), and nothing more (we don't need to do bit twiddling for users IMO)
        Hide
        Steve Rowe added a comment -

        I see two problems with this patch:

        1. Although in the patch you say that the "type bits" field added by the patch is completely separate from the String type information, you don't name them with sufficiently different names to distinguish them.

        2. The information encoded by BitSet is a set of <int,boolean> tuples. These are opaque values. In order for this to work, every tokenizer in the chain has to be aware of every other one's use of these. This makes sharing hard.

        At a minimum, there should be some way to declare who's using what bit for what purpose - maybe through a static hash table or something?

        Show
        Steve Rowe added a comment - I see two problems with this patch: 1. Although in the patch you say that the "type bits" field added by the patch is completely separate from the String type information, you don't name them with sufficiently different names to distinguish them. 2. The information encoded by BitSet is a set of <int,boolean> tuples. These are opaque values. In order for this to work, every tokenizer in the chain has to be aware of every other one's use of these. This makes sharing hard. At a minimum, there should be some way to declare who's using what bit for what purpose - maybe through a static hash table or something?
        Hide
        Grant Ingersoll added a comment -

        The information encoded by BitSet is a set of <int,boolean> tuples. These are opaque values. In order for this to work, every tokenizer in the chain has to be aware of every other one's use of these. This makes sharing hard.

        To some extent, though, the same is true for the current type() functionality. One may decide to change the type, based on the value of the current type.

        While I agree the sharing is hard, it is not impossible, as one need just make sure to communicate which bits are available. I suppose I could see about adding a isClaimed(int position) method or something like that, whereby one can query the chain to see if anyone claims ownership on that position. I'll give that a try. However, to some extent, I also think it is buyer beware in that TokenFilters further down the chain just need to be aware of what is going on. This is part of constructing an Analyzer that works.

        As for the naming, I suppose we could do Flags, as Yonik suggests.

        Show
        Grant Ingersoll added a comment - The information encoded by BitSet is a set of <int,boolean> tuples. These are opaque values. In order for this to work, every tokenizer in the chain has to be aware of every other one's use of these. This makes sharing hard. To some extent, though, the same is true for the current type() functionality. One may decide to change the type, based on the value of the current type. While I agree the sharing is hard, it is not impossible, as one need just make sure to communicate which bits are available. I suppose I could see about adding a isClaimed(int position) method or something like that, whereby one can query the chain to see if anyone claims ownership on that position. I'll give that a try. However, to some extent, I also think it is buyer beware in that TokenFilters further down the chain just need to be aware of what is going on. This is part of constructing an Analyzer that works. As for the naming, I suppose we could do Flags, as Yonik suggests.
        Hide
        Grant Ingersoll added a comment -

        Never mind on the isClaimed() idea, I don't see a good way of how that would work.

        Show
        Grant Ingersoll added a comment - Never mind on the isClaimed() idea, I don't see a good way of how that would work.
        Hide
        Yonik Seeley added a comment -

        If we go with the bitset (int or long!!!), "type" could be deprecated... there's no reason to have both.

        StandardTokenizer could define constants to replace
        public static final String [] TOKEN_TYPES = new String []

        { "<ALPHANUM>", "<APOSTROPHE>", "<ACRONYM>", "<COMPANY>", "<EMAIL>", "<HOST>", "<NUM>", "<CJ>" }

        ;

        StandardTokenizer.ALPHANUM, etc

        Show
        Yonik Seeley added a comment - If we go with the bitset (int or long!!!), "type" could be deprecated... there's no reason to have both. StandardTokenizer could define constants to replace public static final String [] TOKEN_TYPES = new String [] { "<ALPHANUM>", "<APOSTROPHE>", "<ACRONYM>", "<COMPANY>", "<EMAIL>", "<HOST>", "<NUM>", "<CJ>" } ; StandardTokenizer.ALPHANUM, etc
        Hide
        Grant Ingersoll added a comment -

        Per feedback from Yonik, changes this to use an int. The clear() method sets the flags value back to 0.

        Show
        Grant Ingersoll added a comment - Per feedback from Yonik, changes this to use an int. The clear() method sets the flags value back to 0.
        Hide
        Steve Rowe added a comment -

        Looks like the constructors still take a BitSet???

        My vote is for long instead of int, to maximize forward compatibility...

        Show
        Steve Rowe added a comment - Looks like the constructors still take a BitSet??? My vote is for long instead of int, to maximize forward compatibility...
        Hide
        Grant Ingersoll added a comment -

        Let's try a patch that actually compiles

        Show
        Grant Ingersoll added a comment - Let's try a patch that actually compiles
        Hide
        Grant Ingersoll added a comment -

        Committed on 614891

        Show
        Grant Ingersoll added a comment - Committed on 614891

          People

          • Assignee:
            Grant Ingersoll
            Reporter:
            Grant Ingersoll
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development