Well, there are a couple of issues here. I've attached patches for trunk and 3x for consideration.
I fixed a structural flaw that traversed all the terms in all the fields twice, once to get the total number of terms across all the fields and once to get the individual counts.
But that's not where the bulk of the time gets spent. It turns out that getting the count of documents in which each term appears is the culprit. These two lines are executed for each field
Query q = new TermRangeQuery(fieldName, null, null, false, false);
TopDocs top = searcher.search(q, 1);
and top.totalHits is reported. I have an index with 99M documents, mostly integer data that takes 360 seconds to return data when the above is executed and 150 without. Both versions traverse all the terms once, so these times would be greater without the patch due to the second traversal.
So the attached patches default to NOT doing the above and there's a new parameter reportDocCount that can be set to true to collect that information. What do people think? And is there a better way to get the count of documents in which the term appears? And do any alternate methods respect deleted docs like this one does?
I tried spinning through using TermDocs (3.6) but soon realized that the people who wrote TermRangeQuery probably got there first.
So I guess my real question is whether people object to the change in behavior, that users must explicitly request doc counts. Which also means that the admin/schema browser doesn't report this by default and I haven't made it optional from that interface. I'm not inclined to since that interface is going away, but if people feel strongly I might be persuaded. That info is available by admin/luke?fl=myfield&reportDocCount=true in a less painful fashion for a particular field anyway.
Along the way I alphabetized the fields without my other kludge of putting comparators in other classes. I'll kill that JIRA if this one goes forward.
Note that this still doesn't scale all that well, on my test index it's still a 5 minute wait. But then I guess that this kind of data gathering will take time by its nature.
If nobody objects, I'll commit this early next week after I've had a chance to put it down for a while and look at it with fresh eyes and do some more testing. I think there's some inefficiencies in the single pass that I can wring out (about 30 seconds is spent just gathering the data in the single term enumeration loop).