|
[
Permlink
| « Hide
]
Grant Ingersoll added a comment - 16/Jan/08 04:47 PM
Added get/setTypeBits() method and underlying storage and constructors.
Gack! I recommended a bitset on Token previously, but I meant an elemental one... an int (32 bits) or a long (64 bits).
Half of the bits could be reserved for use by Lucene tokenizers, and half could be reserved for users. I think an actual BitSet is too heavy-weight. Just provide a int or long Token.getFlags() and int or long Token.setFlags(), and nothing more (we don't need to do bit twiddling for users IMO) I see two problems with this patch:
1. Although in the patch you say that the "type bits" field added by the patch is completely separate from the String type information, you don't name them with sufficiently different names to distinguish them. 2. The information encoded by BitSet is a set of <int,boolean> tuples. These are opaque values. In order for this to work, every tokenizer in the chain has to be aware of every other one's use of these. This makes sharing hard. At a minimum, there should be some way to declare who's using what bit for what purpose - maybe through a static hash table or something?
To some extent, though, the same is true for the current type() functionality. One may decide to change the type, based on the value of the current type. While I agree the sharing is hard, it is not impossible, as one need just make sure to communicate which bits are available. I suppose I could see about adding a isClaimed(int position) method or something like that, whereby one can query the chain to see if anyone claims ownership on that position. I'll give that a try. However, to some extent, I also think it is buyer beware in that TokenFilters further down the chain just need to be aware of what is going on. This is part of constructing an Analyzer that works. As for the naming, I suppose we could do Flags, as Yonik suggests. Never mind on the isClaimed() idea, I don't see a good way of how that would work.
If we go with the bitset (int or long!!!), "type" could be deprecated... there's no reason to have both.
StandardTokenizer could define constants to replace StandardTokenizer.ALPHANUM, etc Per feedback from Yonik, changes this to use an int. The clear() method sets the flags value back to 0.
Looks like the constructors still take a BitSet???
My vote is for long instead of int, to maximize forward compatibility... Let's try a patch that actually compiles
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||