Here's the first cut - seems to work fine.
You can try it out with facet.method=fcs (the extra "s" can either stand for the plural, since there are multiple field caches, or for segment).
I haven't introduced a way to limit the number of threads used... it's currently one per segment.
I'm thinking of a local param named "threads" for that.
Note: this will probably only make sense in NRT scenarios. It will take up more memory for the field caches, more memory per-request for the accumulator arrays, and more CPU since an additional merge step is needed. One possible side benefit is a reduction in field cache memory (due to field cache insanity - per-segment and whole-index field caches both being populated).