Maybe I'm missing something?
No, I don't think you are missing anything in that use case, it's just an example of its use. And I am not totally sold on this approach, but mostly am
I had originally considered your option, but didn't feel it was satisfactory for the case where you are extracting things like proper nouns or maybe it is generating a category value. The more general case is where not all the tokens are needed (in fact, very few are). In those cases, you have to go back through the whole list of cached tokens in order to extract the ones you want. In fact, thinking some more of on it, I am not sure my patch goes far enough in the sense that what if you want it to buffer in mid stream.
For example, if you had:
Proper Noun TF
and Proper Noun TF is solely responsible for setting aside proper nouns as it comes across them in the stream.
As for the convoluted cross-field logic, I don't think it is all that convoluted. There are only two fields and the implementing Analyzer takes care of all of it. Only real requirement the application has is that the fields be ordered correctly.
I do agree somewhat about the pre-analysis approach, except for the case where there may be a large number of tokens in the source field, in which case, you are holding them around in memory (maxFieldLength mitigates to some extent.) Also, it puts the onus on the app. writer to do it, when it could be pretty straight forward for Lucene to do it w/o it's usual analysis pipeline.
At any rate, separate of the CollaboratingAnalyzer, I do think the CachedTokenFilter is useful, especially in supporting the pre-analysis approach.