attached is an updated patch, its still a work in progress (needs some more tests and benchmarking and some other things little fixes).
Theres a general pattern for these segmenters (this one, smartchinese, sen) thats a little tricky, that is they want to really look at sentences to determine how to segment.
So, I added a base class for this to make writing these segmenters easier, and also to hopefully improve segmentation accuracy. (I would like to switch smartchinese over to it) This class makes it easy to segment sentences with a Sentence BreakIterator... in my opinion it doesnt matter how theoretically good the word tokenization is for these things, if the sentence tokenizer is really bad (I found this issue with both sen and smartchinese).
hope to get it committable soon