I realize the current implementation that's attached here is quite
complicated, because it works on top of Lucene's APIs.
However, I really like its flexibility. You can right now easily
rewrite certain parallel indexes without touching others. I use it in
quite different ways. E.g you can easily load one parallel index into a
RAMDirectory or SSD and leave the other ones on the conventional disk.
LUCENE-2025 only optimizes a certain use case of the parallel indexing,
where you want to (re)write a parallel index containing only posting
lists and this will especially improve scenarios like Yonik pointed
out a while ago on java-dev where you want to update only a few
documents, not e.g. a certain field for all documents.
In other use cases it is certainly desirable to have a parallel index
that contains a store. It really depends on what data you want to
The version of parallel indexing that goes into Lucene's core I
envision quite differently from the current patch here. That's why I'd
like to refactor the IndexWriter (LUCENE-2026) into SegmentWriter and
let's call it IndexManager (the component that controls flushing,
merging, etc.). You can then have a ParallelSegmentWriter, which
partitions the data into parallel segments, and the IndexManager can
behave the same way as before.
You can keep thinking about the whole index as a collection of segments,
just now it will be a matrix of segments instead of a one-dimensional
E.g. the norms could in the future be a parallel segment with a single
column-stride field that you can update by writing a new generation of
the parallel segment.
Things like two-dimensional merge policies will nicely fit into this
Different SegmentWriter implementations will allow you to write single
segments in different ways, e.g. doc-at-a-time (the default one with
addDocument()) or term-at-a-time (like addIndexes*() works).
So I agree we can achieve updating posting lists the way you describe,
but it will be limited to posting lists then. If we allow (re)writing
segments in both dimensions I think we will create a more flexible
approach which is independent on what data structures we add to Lucene
- as long as they are not global to the index but per-segment as most
of Lucene's structures are today.
What do you think? Of course I don't want to over-complicate all this,
but if we can get LUCENE-2026 right, I think we can implement parallel
indexing in this segment-oriented way nicely.