Patch adds some nocommits and tests that expose some problems:
If you run the test with -Dtests.method=testSegmentMerges -Dtests.seed=7651E2AEEBC55BDF, you'll hit an exception:
NOTE: reproduce with: ant test -Dtestcase=TestNumericDocValuesUpdates -Dtests.method=testSegmentMerges -Dtests.seed=7651E2AEEBC55BDF -Dtests.locale=en_AU -Dtests.timezone=Etc/GMT+11 -Dtests.file.encoding=UTF-8
Aug 29, 2013 11:57:35 AM com.carrotsearch.randomizedtesting.ThreadLeakControl checkThreadLeaks
WARNING: Will linger awaiting termination of 1 leaked thread(s).
Aug 29, 2013 11:57:35 AM com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
WARNING: Uncaught exception in thread: Thread[Lucene Merge Thread #0,6,TGRP-TestNumericDocValuesUpdates]
org.apache.lucene.index.MergePolicy$MergeException: java.lang.AssertionError: formatName=Lucene45 prevValue=Memory
Caused by: java.lang.AssertionError: formatName=Lucene45 prevValue=Memory
What happens is the test uses RandomCodec and picks MemoryDVF for writing that field. Later, when ReaderAndLiveDocs applies updates to that field, it uses SI.codec, which is not RandomCodec anymore, but Lucene45Codec (or in this case Facet45Codec - based on Codec.forName("Lucene45")), and its DVF returns for that field Lucene45DVF, because Lucene45Codec always returns that. The way it works during search is that PerFieldDVF.FieldsReader does not rely on the Codec at all, but rather looks up an attribute in FieldInfo which tells it the DVFormat.name and then it calls DVF.forName. But for writing, it relies on the Codec.
I am not sure how to resolve this. I don't think ReaderAndLiveDocs is doing anything wrong – per-field is not exposed on Codec API, therefore it shouldn't assume it should do any per-field stuff. But on the other hand, Lucene45Codec instances return per-field DVF based on what the instance says, and don't look at the FieldInfo attributes, as PerFieldDVF.FieldsReader does. Any ideas?
Robert thought of this usecase: if you have a sparse DocValue field 'f', such that say in segment 1 only doc1 has a value, but in segment 2 none of the documents have values, you cannot really update documents in segment 2, because the FieldInfos for that segment won't list the field as having DocValues at all. For now, I catch that case in ReaderAndLiveDocs and throw an exception. The workaround is to make sure you always have values for a field in a segment, by e.g. always setting some default value. But this is ugly and exposes internal stuff (e.g. segments) to users. Also, it's bad because e.g. if segments 1+2 are merged, you suddenly can update documents that were in segment2 before.
A way to solve it is to gen FieldInfos as well. That will allow us to additionally support adding new fields through field updates, though that's optional and we can still choose to forbid it. If we gen FieldInfos though, the changes I've done to SegmentInfos (recording per-field dvGen) need to be reverted. So it's important that we come to a resolution about this in this issue. This is somewhat of a corner case (sparse fields), but I don't like the fact that users can trip on exceptions that depend whether or not the segment was merged...
FieldInfos.Builder neglect to update globalFieldNumbers.docValuesType map, if it updates a FieldInfo's DocValueType. It's an easy fix, and I added a test to numeric updates. If someone has an idea how to reproduce this outside of numeric updates scope, I'll be happy handle this in a separate issue. The problem is that if you add two fields to a document with same name, one as a posting and one as DV, globalFieldNumbers does not record that field name in its docValuesType map. This map however is currently used by IW.updateNumericDVField and by an assert in FieldInfos.Builder, therefore I wasn's able to reproduce it outside of numeric updates scope.