if I merge two consecutive segments, how come I don't merge their doc stores
Multiple segments are able to share a single set of doc-store (=
stored fields & term vectors) files, today. This only happens when
multiple segments are written in a single IndexWriter session with
EG if I open a writer, index all of wikipedia w/ autoCommit false, and
close it, you'll see a single large set of doc store files (eg _0.fdt,
_0.fdx, _0.tvf, _0.tvd, _0.tvx).
Whenever RAM is full (with postings & norms data), a new segment is
flushed, but the doc store files are kept open & shared with further
A single segment then refers to the shared doc stores, but records its
"offset" within them.
So, when we merge contiguous segments, because the resulting docs are
also contiguous in the doc stores, we are able to store a single doc
store offset in the merged segment, referencing the orignial doc
store, and it works fine.
But if we merge non-contiguous segments, we must then pull out & merge
the "slices" from the doc stores into a new [private to the new
segment] set of doc store files.
For apps that store term vectors w/ positions & offsets, and have many
stored fields, and have heterogenous field name -> number assignments
LUCENE-1737 to fix that), the merging of doc stores can easily
dominate the merge cost.