|
Hi Mike,
this is just an idea to keep Token.java simpler, but I haven't really thought about all the consequences. So feel free to tell me that it's a bad idea Could you add a new class TermBuffer including the char[] array and your resize() logic that would implement CharSequence? Then you could get rid of the duplicate constructors and setters for String and char[], because String also implements CharSequence. And CharSequence has the method charAt(int index), so it should be almost as fast as directly accessing the char array in case the TermBuffer is used. You would need to change the existing constructors and setter to take a CharSequence object instead of a String, but this is not an API change as users can still pass in a String object. And then you would just need to add a new constructor with offset and length and a similiar setter. Thoughts? > [...] implement CharSequence
I think CharSequence is Java5 CharSequence was introduced in 1.4: http://java.sun.com/j2se/1.4.2/docs/api/java/lang/CharSequence.html
> Could you add a new class TermBuffer including the char[] array and
> your resize() logic that would implement CharSequence? Then you > could get rid of the duplicate constructors and setters for String > and char[], because String also implements CharSequence. And > CharSequence has the method charAt(int index), so it should be > almost as fast as directly accessing the char array in case the > TermBuffer is used. You would need to change the existing > constructors and setter to take a CharSequence object instead of a > String, but this is not an API change as users can still pass in a > String object. And then you would just need to add a new constructor > with offset and length and a similiar setter. Thoughts? If I understand this, consumers of the Token API would need to Also I don't think this would make the public API simpler? People Internally, Token looks more complex than it will be in the future, I believe having String and char[] variants of text-processing APIs is > This could then slow down applications that still need to make a
> new Token instance for every term in their documents because now > 2 class instances would be created for every token. Yes that's true. I was thinking that in the new optimized way, where > if we add our own new class here that's another Agree. I was just thinking that the CharSequence approach would reduce the OK, the API of this patch looks good to me! +1 Updated patch based on recent commits; fixed up the javadocs and a few
other small things. I think this is ready to commit but I'll wait a few days for more comments... |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
but otherwise I think this is close...