Are stored fields now sparse? Meaning if I have a segment w/ many docs, and I update stored fields on one doc, in that tiny stacked segments will the stored fields files also be tiny?
In such case you will get the equivalent of a segment with multiple docs with only one of them containing stored fields. I assume the impls of stored fields handle these cases well and you will indeed get tiny stored fields.
You're right, this is up to the codec ... hmm but the API isn't sparse (you have
to .addDocument 1M times to "skip over" 1M docs right?), and I'm not sure how well our
current default (Lucene41StoredFieldsFormat) handles it. Have you tested it?
Regarding the API - I made some cleanup, and removed also Operation.ADD_DOCUMENT. Now there is only one way to perform each operation, and updateFields only allows you to add or replace fields given a term.
This means you cannot reuse fields, you have to be careful with pre-tokenized fields (can't reuse the TokenStream), etc.
This is referred in the Javadoc of updateFields, let me know if there's a better way to address it.
Maybe also state that one cannot reuse Field instances, since the
Field may not be actually "consumed" until some later time (we should
be vague since this really is an implementation detail).
As for the heavier questions. NRT support should be considered separately, but the guideline I followed was to keep things as closely as possible to the way deletions are handled. Therefore, we need to add to SegmentReader a field named liveUpdates - an equivalent to liveDocs. I already put a TODO for this (SegmentReader line 131), implementing it won't be simple...
OK ... yeah it's not simple!
The performance tradeoff you are rightfully concerned about should be handled through merging. Once you merge an updated segment all updates are "cleaned", and the new segment has no performance issues. Apps that perform updates should make sure (through MergePolicy) to avoid reader-side updates as much as possible.
Merging is very important. Hmm, are we able to just merge all updates
down to a single update? Ie, without merging the base segment? We
can't express that today from MergePolicy right? In an NRT setting
this seems very important (ie it'd be best bang (= improved search
performance) for the buck (= merge cost)).
I suspect we need to do something with merging before committing
Hmm I see that
StackedTerms.size()/getSumTotalTermFreq()/getSumDocFreq() pulls a
TermsEnum and goes and counts/aggregates all terms ... which in
general is horribly costly? EG these methods are called per-query to
setup the Sim for scoring ... I think we need another solution here
(not sure what). Also getDocCount() just returns -1 now ... maybe we
should only allow updates against DOCS_ONLY/omitsNorms fields for now?
Have you done any performance tests on biggish indices?
I think we need a test that indexes a known (randomly generated) set
of documents, randomly sometimes using add and sometimes using
update/replace field, mixing in deletes (just like TestField.addDocuments()),
for the first index, and for the second index only using addDocument
on the "surviving" documents, and then we assertIndexEquals(...) in the
end? Maybe we can factor out code from TestDuelingCodecs or
Where do we account for the RAM used by these buffered updates? I see
BufferedUpdates.addTerm has some accounting the first time it sees a
given term, but where do we actually add in the RAM used by the