These are hard questions. My personal goal here for this prototype (currently SimpleText only!) was to:
1. Making merging use (significantly) less RAM, to fix this bug.
2. Make it easier to write docvalues codecs, to encourage innovations (e.g. FST impls, etc etc)
3. Simplify the types to make it easier on the user.
the consumer api I think is simpler (part of #2), but I would like to (in the future) simplify the producer API too.
I'm not sure if we should do it here though? anyway we can think about the issues you raised one by one and do them separately on their own issues.
fix other issues such as LUCENE-3862?
Its my opinion we should do this sooner than later.
merge the FieldCache / FunctionValues / DocValues.Source APIs?
This really needs to be addressed, but I think not here. Its horrific that algorithms like grouping, sorting, and maybe faceting have to be duplicated for 2 different things (fieldcache and docvalues).
are you going to remove DocValues.Type.FLOAT_*?
I think the 3 types we have here are enough. Someone can do a float or double type "on top of" the "number" type we have.
Lucene is already doing this today: look at norms. I think lucene should just have a number type that stores bits.
are SimpleDVConsumer and SimpleDocValuesFormat going to replace PerDocConsumer and DocValuesFormat?
This is the idea, once we are happy with the APIs we would implement the 4.0 ones with these apis.
are you going to remove hasArray/getArray?
I don't care about this. I am unsure similarity impls should be calling this though, definitely at least
it would be better for them to fall-back: I just cant bring myself to fix it until LUCENE-3862 is fixed
will there still be a direct=true|false option at load-time or will it depend on the format impl (potentially with a PerFieldPerDocProducer similarly to the postings formats)?
I don't want to change this in the branch. Personally i feel like a codec/segmentreader/etc should generally only manage
direct, producer exposing the same "stats" (minimum, maximum, fixed, whatever) that the consumer apis get (which will also make merging more efficient!) default source impl can be something nice, read the direct impl into a packed ints,
and so on. Codec could override to e.g. just slurp in their on-disk packed ints directly. So codec still has control
of the in-memory RAM representation, i think this is important. But i think codec and segmentreader should somehow not
be in control of caching: this should be elsewhere (FieldCache.DOCVALUES.xxx????)...