Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/index, core/search
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      since IR is now fully R/O and norms are inside codecs we can cut over to use a IDV impl for writing norms. LUCENE-3606 has some ideas about how this could be implemented

      1. LUCENE-3628.patch
        139 kB
        Simon Willnauer
      2. LUCENE-3628.patch
        141 kB
        Simon Willnauer
      3. LUCENE-3628.patch
        139 kB
        Simon Willnauer
      4. LUCENE-3628.patch
        142 kB
        Simon Willnauer
      5. LUCENE-3628.patch
        142 kB
        Simon Willnauer
      6. LUCENE-3628.patch
        166 kB
        Simon Willnauer
      7. LUCENE-3628.patch
        176 kB
        Simon Willnauer

        Activity

        Hide
        Simon Willnauer added a comment -

        Here is a first patch to cut over norms to DocValues. IndexReader now exposes a DocValues normValues(String) method in addition to the still present byte[] norms(String) method. On the codec level I moved all norm value buffering into the codec and remove the NormsWriter/Reader interface entirely in favor of PerDocProducer/Consumer.

        Lucene3x codec has now a private norms impl that reads and writes the old format exposed as DocValues. It basically works as it used to work except of the buffering which is now codec private.

        Lucene4 & Sep codec now uses DocValues directly currently still bound to a single byte norm value. I didn't want to include changes to the similarity since they are kind of orthogonal to this issue and the patch is big enough already.

        SimpleText has its own simple norms impl also exposed via the DocValues API. To make this entire thing simpler I moved around some code in the values package to make merging work out of the box without writing any merge code if you don't want to do bulk copies etc.

        this is quite a big change... review welcome!

        Show
        Simon Willnauer added a comment - Here is a first patch to cut over norms to DocValues. IndexReader now exposes a DocValues normValues(String) method in addition to the still present byte[] norms(String) method. On the codec level I moved all norm value buffering into the codec and remove the NormsWriter/Reader interface entirely in favor of PerDocProducer/Consumer. Lucene3x codec has now a private norms impl that reads and writes the old format exposed as DocValues. It basically works as it used to work except of the buffering which is now codec private. Lucene4 & Sep codec now uses DocValues directly currently still bound to a single byte norm value. I didn't want to include changes to the similarity since they are kind of orthogonal to this issue and the patch is big enough already. SimpleText has its own simple norms impl also exposed via the DocValues API. To make this entire thing simpler I moved around some code in the values package to make merging work out of the box without writing any merge code if you don't want to do bulk copies etc. this is quite a big change... review welcome!
        Hide
        Robert Muir added a comment -

        thanks for working on this!

        at a glance I really like it. I totally agree that the sim changes should be deferred to a separate issue: in that issue I think we should also nuke the byte[] norms(String) method so we only have the dv api (maybe we even do that first on the followup issue, before the more interesting changes).

        I didnt look too much at the codec api or the default merging or whatever, but we really need this. If simpletext in this patch can have a norms-only DV api and merge it etc, then it must not be too far away from having a full DV impl.

        Show
        Robert Muir added a comment - thanks for working on this! at a glance I really like it. I totally agree that the sim changes should be deferred to a separate issue: in that issue I think we should also nuke the byte[] norms(String) method so we only have the dv api (maybe we even do that first on the followup issue, before the more interesting changes). I didnt look too much at the codec api or the default merging or whatever, but we really need this. If simpletext in this patch can have a norms-only DV api and merge it etc, then it must not be too far away from having a full DV impl.
        Hide
        Uwe Schindler added a comment -

        Hi I did not yet understand the patch completely, only two things recognized without close review:

        • Lucene40Codec now imports 3x NormsFormat but it should not use it, I think thats obsolete but maybe there is some backwards compatibility invisible to me
        • byte[] SegmentReader.norms() has less null checks than the new DocValues normsValues() method. Maybe the "old" norms() using byte should simply delegate to the new method, and return the inner byte[]:
        public byte[] norms(String field) throws IOException {
          // ensureOpen() is called by normValues():
          final DocValues docValues = this.normValues(field);
          if (docValues != null) {
            Source source = docValues.getSource();
            assert source.hasArray(); // TODO cut over to source
            return (byte[])source.getArray();  
          }
          return null;
        }
        

        Maybe this method impl should be in the top-level IndexReader class as final method that delegates to the abstract normValues?

        Otherwise initial patch looks fine!

        Show
        Uwe Schindler added a comment - Hi I did not yet understand the patch completely, only two things recognized without close review: Lucene40Codec now imports 3x NormsFormat but it should not use it, I think thats obsolete but maybe there is some backwards compatibility invisible to me byte[] SegmentReader.norms() has less null checks than the new DocValues normsValues() method. Maybe the "old" norms() using byte should simply delegate to the new method, and return the inner byte[]: public byte [] norms( String field) throws IOException { // ensureOpen() is called by normValues(): final DocValues docValues = this .normValues(field); if (docValues != null ) { Source source = docValues.getSource(); assert source.hasArray(); // TODO cut over to source return ( byte [])source.getArray(); } return null ; } Maybe this method impl should be in the top-level IndexReader class as final method that delegates to the abstract normValues? Otherwise initial patch looks fine!
        Hide
        Uwe Schindler added a comment -

        BTW: I like IOUtils.deleteFiles g, should maybe renamed IOUtils.deleteFilesIgnoringExceptions()

        Show
        Uwe Schindler added a comment - BTW: I like IOUtils.deleteFiles g , should maybe renamed IOUtils.deleteFilesIgnoringExceptions()
        Hide
        Michael McCandless added a comment -

        Simon are you using svn 1.7...? If so, can you regen the patch with "svn diff --show-copies-as-adds"? Thanks!

        Show
        Michael McCandless added a comment - Simon are you using svn 1.7...? If so, can you regen the patch with "svn diff --show-copies-as-adds"? Thanks!
        Hide
        Simon Willnauer added a comment -

        here is another patch removing unused imports of Lucene3x codecs internal in Lucene4Codec & renamed IOUtils#deleteFiles into deleteFilesIgnoringExceptions(...)

        Show
        Simon Willnauer added a comment - here is another patch removing unused imports of Lucene3x codecs internal in Lucene4Codec & renamed IOUtils#deleteFiles into deleteFilesIgnoringExceptions(...)
        Hide
        Simon Willnauer added a comment -

        new patch with "svn diff --show-copies-as-adds"

        Show
        Simon Willnauer added a comment - new patch with "svn diff --show-copies-as-adds"
        Hide
        Uwe Schindler added a comment - - edited

        Hi,
        much more readable, thanks!

        Just one question: The Lucene3xNormsConsumer could be theoretically moved to the PreflexRWCodec in test-framework, or am I missing something?

        Means, its impl in Lucene3xNormsFormat should look like this from Lucene3xPostingsFormat:

        @Override
        public FieldsConsumer fieldsConsumer(SegmentWriteState state) throws IOException {
          throw new IllegalArgumentException("this codec can only be used for reading");
        }
        
        Show
        Uwe Schindler added a comment - - edited Hi, much more readable, thanks! Just one question: The Lucene3xNormsConsumer could be theoretically moved to the PreflexRWCodec in test-framework, or am I missing something? Means, its impl in Lucene3xNormsFormat should look like this from Lucene3xPostingsFormat: @Override public FieldsConsumer fieldsConsumer(SegmentWriteState state) throws IOException { throw new IllegalArgumentException( " this codec can only be used for reading" ); }
        Hide
        Simon Willnauer added a comment -

        Just one question: The Lucene3xNormsConsumer could be theoretically moved to the PreflexRWCodec in test-framework, or am I missing something?

        yeah good call... I will move this impl to preflex and throw an exception in the 3x impl.

        Show
        Simon Willnauer added a comment - Just one question: The Lucene3xNormsConsumer could be theoretically moved to the PreflexRWCodec in test-framework, or am I missing something? yeah good call... I will move this impl to preflex and throw an exception in the 3x impl.
        Hide
        Simon Willnauer added a comment -

        next iteration moving the Lucene3xNormsConsumer into Preflex and throwing IAE if Lucene3x codec is used for writing. I also added some more javadoc comments and a missing licence header.

        Show
        Simon Willnauer added a comment - next iteration moving the Lucene3xNormsConsumer into Preflex and throwing IAE if Lucene3x codec is used for writing. I also added some more javadoc comments and a missing licence header.
        Hide
        Simon Willnauer added a comment -

        another patch delegating to normValues in IR#norms(field) as uwe suggested.

        Maybe this method impl should be in the top-level IndexReader class as final method that delegates to the abstract normValues?

        I looked into this but subclasses override this and to tricky things like ParallelReader uses MultiNorms internally etc. I think since this method will go away anyway and consumers should call normValues(field) instead and then decide how they use it instead of forcing an array we should just leave this as it is for now. I can certainly move it up to IR but making it final is going to be tricky and likely not worth the trouble.

        Show
        Simon Willnauer added a comment - another patch delegating to normValues in IR#norms(field) as uwe suggested. Maybe this method impl should be in the top-level IndexReader class as final method that delegates to the abstract normValues? I looked into this but subclasses override this and to tricky things like ParallelReader uses MultiNorms internally etc. I think since this method will go away anyway and consumers should call normValues(field) instead and then decide how they use it instead of forcing an array we should just leave this as it is for now. I can certainly move it up to IR but making it final is going to be tricky and likely not worth the trouble.
        Hide
        Uwe Schindler added a comment -

        I looked into this but subclasses override this and to tricky things like ParallelReader uses MultiNorms internally etc

        This method could theoretically go (and must as final in IR). ParallelReader and SlowMultiReader will also delegate to normValues(), which internally uses MultiDocValues. This should behave identical and MultiNorms would be useless at all and could be deleted (that was the idea). BUT:

        I think the problem is that MultiDocValues no longer has an array, so hasArray() returns false? If that's the case, I agree that we should leave it as it is for now and simply wait until norms() gets obsolete in a later issue.

        Patch looks good!

        Show
        Uwe Schindler added a comment - I looked into this but subclasses override this and to tricky things like ParallelReader uses MultiNorms internally etc This method could theoretically go (and must as final in IR). ParallelReader and SlowMultiReader will also delegate to normValues(), which internally uses MultiDocValues. This should behave identical and MultiNorms would be useless at all and could be deleted (that was the idea). BUT: I think the problem is that MultiDocValues no longer has an array, so hasArray() returns false? If that's the case, I agree that we should leave it as it is for now and simply wait until norms() gets obsolete in a later issue. Patch looks good!
        Hide
        Simon Willnauer added a comment -

        I think the problem is that MultiDocValues no longer has an array, so hasArray() returns false? If that's the case, I agree that we should leave it as it is for now and simply wait until norms() gets obsolete in a later issue.

        this is the case! Yet, this patch implements getArray / hasArray in MultiDocValues which allow to move the norms(String) method up to IR (its final now). I moved MultiNorms to the test-framework since some tests still need that. We can remove it once norms() is gone. I also added tests for get/hasArray in TestDocValuesIndexing. All test pass, I think this is ready; I will add a changes entry.

        Show
        Simon Willnauer added a comment - I think the problem is that MultiDocValues no longer has an array, so hasArray() returns false? If that's the case, I agree that we should leave it as it is for now and simply wait until norms() gets obsolete in a later issue. this is the case! Yet, this patch implements getArray / hasArray in MultiDocValues which allow to move the norms(String) method up to IR (its final now). I moved MultiNorms to the test-framework since some tests still need that. We can remove it once norms() is gone. I also added tests for get/hasArray in TestDocValuesIndexing. All test pass, I think this is ready; I will add a changes entry.
        Hide
        Uwe Schindler added a comment -

        I like the new code in MultiDocValues. The use of reflect.Array and getComponentType/newInstance is fine here, as its only executed when you actually request the merged array, which is expensive by default. If you access DocValues through the standard API, not direct via array, all is as usual. Merging the arrays is slow by default - so who cares? (<- this note is for Robert Muir).

        I think we should fix the DirectoryReader test to not use MultiNorms, maybe its a specific test only for MultiNorms, so obsolete? I will look into it!

        Additional note from discussion with Simon Willnauer on IRC: Maybe we should have the same DV cache in SlowMultiReaderWrapper for non-norm DocValues, too.

        Show
        Uwe Schindler added a comment - I like the new code in MultiDocValues. The use of reflect.Array and getComponentType/newInstance is fine here, as its only executed when you actually request the merged array, which is expensive by default. If you access DocValues through the standard API, not direct via array, all is as usual. Merging the arrays is slow by default - so who cares? (<- this note is for Robert Muir ). I think we should fix the DirectoryReader test to not use MultiNorms, maybe its a specific test only for MultiNorms, so obsolete? I will look into it! Additional note from discussion with Simon Willnauer on IRC: Maybe we should have the same DV cache in SlowMultiReaderWrapper for non-norm DocValues, too.
        Hide
        Uwe Schindler added a comment -

        I think we should fix the DirectoryReader test to not use MultiNorms, maybe its a specific test only for MultiNorms, so obsolete? I will look into it!

        There are more tests, but this can be easily fixed later and MultiNorms nuked (extract from TestDuellingCodecs):

        byte[] leftNorms = MultiNorms.norms(leftReader, field);
        

        replace by:

        byte[] leftNorms = new SlowMultiReaderWrapper(leftReader).norms(field);
        

        The current patch is in my opinion fine to commit, +1. Nice improvement!

        Show
        Uwe Schindler added a comment - I think we should fix the DirectoryReader test to not use MultiNorms, maybe its a specific test only for MultiNorms, so obsolete? I will look into it! There are more tests, but this can be easily fixed later and MultiNorms nuked (extract from TestDuellingCodecs): byte [] leftNorms = MultiNorms.norms(leftReader, field); replace by: byte [] leftNorms = new SlowMultiReaderWrapper(leftReader).norms(field); The current patch is in my opinion fine to commit, +1. Nice improvement!
        Hide
        Michael McCandless added a comment -

        Patch looks great! What an awesome step forward... once we later fix
        sim to be able to do whatever it wants at indexing time (eg, use a
        4-byte float), then apps are free to create arbitrary "norms" per doc!
        Wonderful...

        It's nice we now have a default merge base impl for DocValues, which
        the 4.0 codec then overrides with it's low-RAM impls. Though, why do
        we need to add the 3 per-type add methods...? Can we somehow use the
        existing add that takes a DocValue...? (We can also fix this up
        separately after committing...).

        The PerDocConsumer.pull seems a bit odd... we need this just because
        we don't know if we are merging .normValues vs .docValues right?
        Maybe rename pull to getDocValuesToMerge?

        This change is curious:

        +            try {
                     assert dir.fileExists(IndexFileNames.segmentFileName(filename, "",
                         Writer.INDEX_EXTENSION));
        +            } catch (IOException e) {
        +            }
        +            break;
        

        ... what motivated that? Does MDW sometimes throw errant exceptions
        in fileExists...? Or some test was failing...?

        Hmm I hit an exc in TestSearcherManager, but it doesn't always
        reproduce:

        NOTE: reproduce with: ant test -Dtestcase=TestSearcherManager -Dtestmethod=testSearcherManager -Dtests.seed=9ddad32eae88e7f:4561a85cf046520e:1f76c8c586f3eb2d -Dargs="-Dfile.encoding=UTF-8"
        ENOTE: test params are: codec=Lucene40: {extra29=PostingsFormat(name=MockSep), extra28=PostingsFormat(name=Lucene40WithOrds), body=MockVariableIntBlock(baseBlockSize=43), extra27=PostingsFormat(name=NestedPulsing), extra26=MockVariableIntBlock(baseBlockSize=43), extra25=MockFixedIntBlock(blockSize=1841), extra24=Lucene40(minBlockSize=42 maxBlockSize=86), extra23=Pulsing40(freqCutoff=13 minBlockSize=3 maxBlockSize=44), extra22=PostingsFormat(name=MockRandom), packID=MockVariableIntBlock(baseBlockSize=43), date=PostingsFormat(name=MockRandom), docid=PostingsFormat(name=MockSep), title=PostingsFormat(name=Lucene40WithOrds), extra20=PostingsFormat(name=Lucene40WithOrds), extra21=PostingsFormat(name=MockSep), extra38=PostingsFormat(name=NestedPulsing), extra8=PostingsFormat(name=MockRandom), extra12=Pulsing40(freqCutoff=13 minBlockSize=3 maxBlockSize=44), extra37=MockVariableIntBlock(baseBlockSize=43), extra11=PostingsFormat(name=MockRandom), extra9=Pulsing40(freqCutoff=13 minBlockSize=3 maxBlockSize=44), extra14=MockFixedIntBlock(blockSize=1841), extra13=Lucene40(minBlockSize=42 maxBlockSize=86), extra39=PostingsFormat(name=Lucene40WithOrds), extra34=Pulsing40(freqCutoff=13 minBlockSize=3 maxBlockSize=44), extra16=PostingsFormat(name=NestedPulsing), extra15=MockVariableIntBlock(baseBlockSize=43), extra33=PostingsFormat(name=MockRandom), extra36=MockFixedIntBlock(blockSize=1841), extra18=PostingsFormat(name=MockSep), extra35=Lucene40(minBlockSize=42 maxBlockSize=86), extra17=PostingsFormat(name=Lucene40WithOrds), extra0=PostingsFormat(name=MockRandom), thisCodeMakesAbsolutelyNoSenseCanWeDeleteIt=PostingsFormat(name=NestedPulsing), extra1=Pulsing40(freqCutoff=13 minBlockSize=3 maxBlockSize=44), extra2=Lucene40(minBlockSize=42 maxBlockSize=86), extra3=MockFixedIntBlock(blockSize=1841), extra5=PostingsFormat(name=NestedPulsing), extra6=PostingsFormat(name=Lucene40WithOrds), extra7=PostingsFormat(name=MockSep), titleTokenized=Pulsing40(freqCutoff=13 minBlockSize=3 maxBlockSize=44), extra30=PostingsFormat(name=NestedPulsing), extra31=PostingsFormat(name=Lucene40WithOrds), extra32=PostingsFormat(name=MockSep), extra10=PostingsFormat(name=MockSep)}, sim=RandomSimilarityProvider(queryNorm=true,coord=true): {extra29=BM25(k1=1.2,b=0.75), extra28=DFR I(ne)L2, body=DFR I(F)2, extra27=DFR I(F)B3(800.0), extra26=DFR I(n)LZ(0.3), extra25=DFR I(n)3(800.0), extra24=DFR GB2, extra23=DFR I(n)Z(0.3), extra22=DFR I(F)L2, packID=IB SPL-D2, date=DFR I(F)1, docid=DFR GL2, title=DFR I(n)2, extra20=DFR GLZ(0.3), extra21=IB SPL-L2, extra38=DefaultSimilarity, extra8=DFR I(ne)2, extra12=DefaultSimilarity, extra11=IB SPL-D2, extra37=IB SPL-D2, extra9=IB LL-L2, extra14=IB SPL-D1, extra13=DFR I(ne)B3(800.0), extra39=DFR I(ne)B3(800.0), extra16=IB SPL-LZ(0.3), extra34=IB SPL-L1, extra15=DFR I(n)L2, extra33=IB LL-L1, extra18=DFR I(ne)B2, extra36=DFR I(ne)L1, extra35=DFR I(ne)1, extra17=DFR I(n)L3(800.0), extra0=IB SPL-D2, extra1=DefaultSimilarity, extra2=DFR I(ne)B3(800.0), extra3=IB SPL-D1, extra5=IB SPL-LZ(0.3), extra6=DFR I(n)L3(800.0), extra7=DFR I(ne)B2, titleTokenized=DFR I(ne)B1, extra30=DFR I(ne)Z(0.3), extra31=DFR I(n)1, extra32=IB LL-D2, extra10=DFR GB3(800.0)}, locale=pt, timezone=VST
        NOTE: all tests run in this JVM:
        [TestShardSearching, TestComplexExplanations, TestTieredMergePolicy, TestLongPostings, TestFieldCacheRewriteMethod, TestSlowCollationMethods, TestDocsAndPositions, TestSearcherManager]
        
        java.lang.RuntimeException: MockDirectoryWrapper: cannot close: there are still open files: {_6_nrm.cfs=1}
        	at org.apache.lucene.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:545)
        	at org.apache.lucene.index.ThreadedIndexingAndSearchingTestCase.runTest(ThreadedIndexingAndSearchingTestCase.java:629)
        	at org.apache.lucene.search.TestSearcherManager.testSearcherManager(TestSearcherManager.java:52)
        	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        	at java.lang.reflect.Method.invoke(Method.java:597)
        	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
        	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
        	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
        	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
        	at org.junit.rules.TestWatchman$1.evaluate(TestWatchman.java:48)
        	at org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:528)
        	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
        	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
        	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
        	at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165)
        	at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57)
        	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
        	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
        	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
        	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
        	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
        	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
        	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
        	at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
        	at org.junit.runners.Suite.runChild(Suite.java:128)
        	at org.junit.runners.Suite.runChild(Suite.java:24)
        	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
        	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
        	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
        	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
        	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
        	at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
        	at org.junit.runner.JUnitCore.run(JUnitCore.java:157)
        	at org.junit.runner.JUnitCore.run(JUnitCore.java:136)
        	at org.junit.runner.JUnitCore.run(JUnitCore.java:117)
        	at org.junit.runner.JUnitCore.runMain(JUnitCore.java:98)
        	at org.junit.runner.JUnitCore.runMainAndExit(JUnitCore.java:53)
        	at org.junit.runner.JUnitCore.main(JUnitCore.java:45)
        Caused by: java.lang.RuntimeException: unclosed IndexOutput: _6_nrm.cfs
        	at org.apache.lucene.store.MockDirectoryWrapper.addFileHandle(MockDirectoryWrapper.java:469)
        	at org.apache.lucene.store.MockDirectoryWrapper.createOutput(MockDirectoryWrapper.java:441)
        	at org.apache.lucene.store.CompoundFileWriter.getOutput(CompoundFileWriter.java:124)
        	at org.apache.lucene.store.CompoundFileWriter.createOutput(CompoundFileWriter.java:260)
        	at org.apache.lucene.store.CompoundFileDirectory.createOutput(CompoundFileDirectory.java:290)
        	at org.apache.lucene.codecs.lucene40.values.Bytes$BytesWriterBase.getOrCreateDataOut(Bytes.java:257)
        	at org.apache.lucene.codecs.lucene40.values.FixedStraightBytesImpl$Writer.merge(FixedStraightBytesImpl.java:138)
        	at org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:90)
        	at org.apache.lucene.codecs.PerDocConsumer.merge(PerDocConsumer.java:57)
        	at org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:391)
        	at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:127)
        	at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3630)
        	at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3258)
        	at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:382)
        	at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:451)
        
        Show
        Michael McCandless added a comment - Patch looks great! What an awesome step forward... once we later fix sim to be able to do whatever it wants at indexing time (eg, use a 4-byte float), then apps are free to create arbitrary "norms" per doc! Wonderful... It's nice we now have a default merge base impl for DocValues, which the 4.0 codec then overrides with it's low-RAM impls. Though, why do we need to add the 3 per-type add methods...? Can we somehow use the existing add that takes a DocValue...? (We can also fix this up separately after committing...). The PerDocConsumer.pull seems a bit odd... we need this just because we don't know if we are merging .normValues vs .docValues right? Maybe rename pull to getDocValuesToMerge? This change is curious: + try { assert dir.fileExists(IndexFileNames.segmentFileName(filename, "", Writer.INDEX_EXTENSION)); + } catch (IOException e) { + } + break; ... what motivated that? Does MDW sometimes throw errant exceptions in fileExists...? Or some test was failing...? Hmm I hit an exc in TestSearcherManager, but it doesn't always reproduce: NOTE: reproduce with: ant test -Dtestcase=TestSearcherManager -Dtestmethod=testSearcherManager -Dtests.seed=9ddad32eae88e7f:4561a85cf046520e:1f76c8c586f3eb2d -Dargs="-Dfile.encoding=UTF-8" ENOTE: test params are: codec=Lucene40: {extra29=PostingsFormat(name=MockSep), extra28=PostingsFormat(name=Lucene40WithOrds), body=MockVariableIntBlock(baseBlockSize=43), extra27=PostingsFormat(name=NestedPulsing), extra26=MockVariableIntBlock(baseBlockSize=43), extra25=MockFixedIntBlock(blockSize=1841), extra24=Lucene40(minBlockSize=42 maxBlockSize=86), extra23=Pulsing40(freqCutoff=13 minBlockSize=3 maxBlockSize=44), extra22=PostingsFormat(name=MockRandom), packID=MockVariableIntBlock(baseBlockSize=43), date=PostingsFormat(name=MockRandom), docid=PostingsFormat(name=MockSep), title=PostingsFormat(name=Lucene40WithOrds), extra20=PostingsFormat(name=Lucene40WithOrds), extra21=PostingsFormat(name=MockSep), extra38=PostingsFormat(name=NestedPulsing), extra8=PostingsFormat(name=MockRandom), extra12=Pulsing40(freqCutoff=13 minBlockSize=3 maxBlockSize=44), extra37=MockVariableIntBlock(baseBlockSize=43), extra11=PostingsFormat(name=MockRandom), extra9=Pulsing40(freqCutoff=13 minBlockSize=3 maxBlockSize=44), extra14=MockFixedIntBlock(blockSize=1841), extra13=Lucene40(minBlockSize=42 maxBlockSize=86), extra39=PostingsFormat(name=Lucene40WithOrds), extra34=Pulsing40(freqCutoff=13 minBlockSize=3 maxBlockSize=44), extra16=PostingsFormat(name=NestedPulsing), extra15=MockVariableIntBlock(baseBlockSize=43), extra33=PostingsFormat(name=MockRandom), extra36=MockFixedIntBlock(blockSize=1841), extra18=PostingsFormat(name=MockSep), extra35=Lucene40(minBlockSize=42 maxBlockSize=86), extra17=PostingsFormat(name=Lucene40WithOrds), extra0=PostingsFormat(name=MockRandom), thisCodeMakesAbsolutelyNoSenseCanWeDeleteIt=PostingsFormat(name=NestedPulsing), extra1=Pulsing40(freqCutoff=13 minBlockSize=3 maxBlockSize=44), extra2=Lucene40(minBlockSize=42 maxBlockSize=86), extra3=MockFixedIntBlock(blockSize=1841), extra5=PostingsFormat(name=NestedPulsing), extra6=PostingsFormat(name=Lucene40WithOrds), extra7=PostingsFormat(name=MockSep), titleTokenized=Pulsing40(freqCutoff=13 minBlockSize=3 maxBlockSize=44), extra30=PostingsFormat(name=NestedPulsing), extra31=PostingsFormat(name=Lucene40WithOrds), extra32=PostingsFormat(name=MockSep), extra10=PostingsFormat(name=MockSep)}, sim=RandomSimilarityProvider(queryNorm=true,coord=true): {extra29=BM25(k1=1.2,b=0.75), extra28=DFR I(ne)L2, body=DFR I(F)2, extra27=DFR I(F)B3(800.0), extra26=DFR I(n)LZ(0.3), extra25=DFR I(n)3(800.0), extra24=DFR GB2, extra23=DFR I(n)Z(0.3), extra22=DFR I(F)L2, packID=IB SPL-D2, date=DFR I(F)1, docid=DFR GL2, title=DFR I(n)2, extra20=DFR GLZ(0.3), extra21=IB SPL-L2, extra38=DefaultSimilarity, extra8=DFR I(ne)2, extra12=DefaultSimilarity, extra11=IB SPL-D2, extra37=IB SPL-D2, extra9=IB LL-L2, extra14=IB SPL-D1, extra13=DFR I(ne)B3(800.0), extra39=DFR I(ne)B3(800.0), extra16=IB SPL-LZ(0.3), extra34=IB SPL-L1, extra15=DFR I(n)L2, extra33=IB LL-L1, extra18=DFR I(ne)B2, extra36=DFR I(ne)L1, extra35=DFR I(ne)1, extra17=DFR I(n)L3(800.0), extra0=IB SPL-D2, extra1=DefaultSimilarity, extra2=DFR I(ne)B3(800.0), extra3=IB SPL-D1, extra5=IB SPL-LZ(0.3), extra6=DFR I(n)L3(800.0), extra7=DFR I(ne)B2, titleTokenized=DFR I(ne)B1, extra30=DFR I(ne)Z(0.3), extra31=DFR I(n)1, extra32=IB LL-D2, extra10=DFR GB3(800.0)}, locale=pt, timezone=VST NOTE: all tests run in this JVM: [TestShardSearching, TestComplexExplanations, TestTieredMergePolicy, TestLongPostings, TestFieldCacheRewriteMethod, TestSlowCollationMethods, TestDocsAndPositions, TestSearcherManager] java.lang.RuntimeException: MockDirectoryWrapper: cannot close: there are still open files: {_6_nrm.cfs=1} at org.apache.lucene.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:545) at org.apache.lucene.index.ThreadedIndexingAndSearchingTestCase.runTest(ThreadedIndexingAndSearchingTestCase.java:629) at org.apache.lucene.search.TestSearcherManager.testSearcherManager(TestSearcherManager.java:52) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.rules.TestWatchman$1.evaluate(TestWatchman.java:48) at org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:528) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76) at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165) at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31) at org.junit.runners.ParentRunner.run(ParentRunner.java:236) at org.junit.runners.Suite.runChild(Suite.java:128) at org.junit.runners.Suite.runChild(Suite.java:24) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184) at org.junit.runners.ParentRunner.run(ParentRunner.java:236) at org.junit.runner.JUnitCore.run(JUnitCore.java:157) at org.junit.runner.JUnitCore.run(JUnitCore.java:136) at org.junit.runner.JUnitCore.run(JUnitCore.java:117) at org.junit.runner.JUnitCore.runMain(JUnitCore.java:98) at org.junit.runner.JUnitCore.runMainAndExit(JUnitCore.java:53) at org.junit.runner.JUnitCore.main(JUnitCore.java:45) Caused by: java.lang.RuntimeException: unclosed IndexOutput: _6_nrm.cfs at org.apache.lucene.store.MockDirectoryWrapper.addFileHandle(MockDirectoryWrapper.java:469) at org.apache.lucene.store.MockDirectoryWrapper.createOutput(MockDirectoryWrapper.java:441) at org.apache.lucene.store.CompoundFileWriter.getOutput(CompoundFileWriter.java:124) at org.apache.lucene.store.CompoundFileWriter.createOutput(CompoundFileWriter.java:260) at org.apache.lucene.store.CompoundFileDirectory.createOutput(CompoundFileDirectory.java:290) at org.apache.lucene.codecs.lucene40.values.Bytes$BytesWriterBase.getOrCreateDataOut(Bytes.java:257) at org.apache.lucene.codecs.lucene40.values.FixedStraightBytesImpl$Writer.merge(FixedStraightBytesImpl.java:138) at org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:90) at org.apache.lucene.codecs.PerDocConsumer.merge(PerDocConsumer.java:57) at org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:391) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:127) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3630) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3258) at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:382) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:451)
        Hide
        Simon Willnauer added a comment -

        next iteration...

        Though, why do
        we need to add the 3 per-type add methods...? Can we somehow use the
        existing add that takes a DocValue...?

        those were pulled up during refactoring. I added them to enable the default merging. This patch marks them as protected so they are really an impl detail and not a public API part.

        The PerDocConsumer.pull seems a bit odd...

        renamed to getDocValuesToMerge

        This change is curious:

        I catch the exception there since its only the assert that triggers this so it should not be part of the method signature. I wrapped it in a RuntimeEx in this patch.

        This patch also reduces the DocValue.java interface to gettters. the setters are really part of the DocValuesField and are not needed inside the implementation. Nobody should modify a DocValue instance. I also added a fix for the exception mike reported. I can't reproduce it locally anymore.

        Show
        Simon Willnauer added a comment - next iteration... Though, why do we need to add the 3 per-type add methods...? Can we somehow use the existing add that takes a DocValue...? those were pulled up during refactoring. I added them to enable the default merging. This patch marks them as protected so they are really an impl detail and not a public API part. The PerDocConsumer.pull seems a bit odd... renamed to getDocValuesToMerge This change is curious: I catch the exception there since its only the assert that triggers this so it should not be part of the method signature. I wrapped it in a RuntimeEx in this patch. This patch also reduces the DocValue.java interface to gettters. the setters are really part of the DocValuesField and are not needed inside the implementation. Nobody should modify a DocValue instance. I also added a fix for the exception mike reported. I can't reproduce it locally anymore.
        Hide
        Michael McCandless added a comment -

        Though, why do we need to add the 3 per-type add methods...? Can we somehow use the existing add that takes a DocValue...?

        those were pulled up during refactoring. I added them to enable the default merging. This patch marks them as protected so they are really an impl detail and not a public API part.

        OK... really, ideally (I think?) the base "merge" method should only
        use the public API (ie the very same API indexing/flushing uses when
        writing a new segment). This is how our other base merge impls
        (stored fields, term vectors, postings) work. But let's open a
        separate issue for this...

        I also added a fix for the exception mike reported. I can't reproduce it locally anymore.

        Looks fixed – I ran for 135 iterations and no fail (previously it
        failed in first 10 or so...).

        Show
        Michael McCandless added a comment - Though, why do we need to add the 3 per-type add methods...? Can we somehow use the existing add that takes a DocValue...? those were pulled up during refactoring. I added them to enable the default merging. This patch marks them as protected so they are really an impl detail and not a public API part. OK... really, ideally (I think?) the base "merge" method should only use the public API (ie the very same API indexing/flushing uses when writing a new segment). This is how our other base merge impls (stored fields, term vectors, postings) work. But let's open a separate issue for this... I also added a fix for the exception mike reported. I can't reproduce it locally anymore. Looks fixed – I ran for 135 iterations and no fail (previously it failed in first 10 or so...).
        Hide
        Simon Willnauer added a comment -

        I am going to commit this soonish if nobody objects

        Show
        Simon Willnauer added a comment - I am going to commit this soonish if nobody objects
        Hide
        Robert Muir added a comment -

        I basically agree with Mike's comments, but we can do this on a followup issue. I think its nice as a step for now that you dont
        have to write merge() to make a DV impl.

        In my opinion the default merge() impl should just form a MultiDocValues of subs and call add() just as flush would, and not
        have "hooks" for subclasses to do bulk merging.
        Subclasses just override the entire merge() to do bulk-merging (or maybe other codec-specific optimizations they have)

        Show
        Robert Muir added a comment - I basically agree with Mike's comments, but we can do this on a followup issue. I think its nice as a step for now that you dont have to write merge() to make a DV impl. In my opinion the default merge() impl should just form a MultiDocValues of subs and call add() just as flush would, and not have "hooks" for subclasses to do bulk merging. Subclasses just override the entire merge() to do bulk-merging (or maybe other codec-specific optimizations they have)
        Hide
        Simon Willnauer added a comment -

        Committed in revision 1227676

        thanks guys

        Show
        Simon Willnauer added a comment - Committed in revision 1227676 thanks guys
        Hide
        Karl Wettin added a comment -

        huzzah!

        Show
        Karl Wettin added a comment - huzzah!

          People

          • Assignee:
            Simon Willnauer
            Reporter:
            Simon Willnauer
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development