Affects Version/s: None
Fix Version/s: None
Currently, SegmentMerger has supported two classes of citizens being merged:
- "foreign reader" (e.g. some FilterReader)
It does an instanceof check and executes the merge differently. In the SegmentReader case: stored field and term vectors are bulk-merged, norms and docvalues are transferred directly without piling up on the heap, CRC32 verification runs with IO locality of the data being merged, etc. Otherwise, we treat it as a "foreign" reader and its slow.
This is just the low-level, it gets worse as you wrap with more stuff. A great example there is SortingMergePolicy: not only will it have the low-level slowdowns listed above, it will e.g. cache/pile up OrdinalMaps for all string docvalues fields being merged and other silliness that just makes matters worse.
Another use case is 5.0 users wishing to upgrade from fieldcache to docvalues. This should be possible to implement with a simple incremental transition based on a mergepolicy that uses UninvertingReader. But we shouldnt populate internal fieldcache entries unnecessarily on merge and spike RAM until all those segment cores are released, and other issues like bulk merge of stored fields and not piling up norms should still work: its completely unrelated.
There are more problems we can fix if we clean this up, checkindex/checkreader can run efficiently where it doesn't need to RAM spike like merging, we can remove the checkIntegrity() method completely from LeafReader, since it can always be accomplished on producers, etc. In general it would be nice to just have one codepath for merging that is as efficient as we can make it, and to support things like index modifications during merge.
I spent a few weeks writing 3 different implementations to fix this (interface, optional abstract class, "fix LeafReader"), and the latter is the only one i don't completely hate: I think our APIs should be efficient for indexing as well as search.
So the proposal is simple, its to instead refactor LeafReader to just require the producer APIs as abstract methods (and FilterReaders should work on that). The search-oriented APIs can just be final methods that defer to those.
So we would add 5 abstract methods, but implement 10 current methods as final based on those, and then merging would always be efficient.