Here's the first cut - seems to work fine.
You can try it out with facet.method=fcs (the extra "s" can either stand for the plural, since there are multiple field caches, or for segment).
I haven't introduced a way to limit the number of threads used... it's currently one per segment.
I'm thinking of a local param named "threads" for that.
Note: this will probably only make sense in NRT scenarios. It will take up more memory for the field caches, more memory per-request for the accumulator arrays, and more CPU since an additional merge step is needed. One possible side benefit is a reduction in field cache memory (due to field cache insanity - per-segment and whole-index field caches both being populated).
OK, so the idea is pretty simple: reuse the existing algorithm for single valued string fields that uses the FieldCache.
Count per-segment with a per-segment accumulator array, then merge all of the counts at the end (probably with a priority queue - same method used in MultiTermEnum). Seems like a good opportunity to introduce some threading and do the per-segment counting in parallel.