Good, I'll fix.
However instead of keeping adding such queries to core, I think we should consider moving all our doc values queries to misc since they have complicated trade-offs and are only useful in expert use-cases?
+1, I can move them here.
in certain cases (many terms/numbers and fewish matching hits) it should be faster than using TermsQuery
This comment got me confused: I think in general these queries are more efficient when they match many documents, ie. even when an equivalent TermsQuery would not be used as a lead iterator in a conjunction? I think the only case when such a query matching few documents would be useful would be in a prohibited clause since these prohibited clauses can never be used to lead iteration anyway and are only used in a random-access fashion?
Hmm this is hard to think about, but yes I was thinking about the "there is some other MUST'd clause as the primary" and then this query is a MUST_NOT of a big list of numeric IDs, use case.
The per-hit cost is higher with these DocValuesXXX queries (the forward lookup + check) vs visiting postings and ORing bitsets that TermsQuery does (when there are enough terms), but the setup cost is higher with TermsQuery since it must lookup many terms across N segments, which is why I thought "not matching too many total hits" would favor DocValueXXXQuery with a large number of terms.
E.g. in the extreme case where you pass a single term to your TemsQuery or DocValuesTermsQuery, matching many docs, and its the primary (only) clause in the query, TermsQuery should be much faster.
Its ok in current form to go to sandbox, but i think this needs to be integrated into the inverted approach so that based on index stats, lucene can just do the right thing.
OK, or I can just WONTFIX this ... I just thought there are use cases where this post-filter approach would be much faster then the choices we have today, e.g. when an app has numeric IDs and wants to make big "NOT in list" clauses.
I agree it would be better if we had only TermsQuery, and then it would figure out which strategy is best (use doc values, use numeric doc values if ids are really numeric, use postings) to take depending on index stats, whether clause is primary or not, etc... but this seems very tricky: I can't even properly think about the cases, see Adrien's comment above