> I don't know if your new PatternTokenizerFactory could replace either of these, though. For the first case, I still want the white space tokenization after I've stripped off all the junk I don't want. And for the second, I need to be able to do the remapping.
If your really good with regular expressions, perhaps it could all be combined... I'm not
In my real use case, I use the general PatternTokenizerFactory to split the input into a bunch of tokens, then I have a custom (ugly!) TokenFilter transform the stream with other one-off transformations similar to what you describe.