Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-6065

remove "foreign readers" from merge, fix LeafReader instead.

Details

    • Task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None
    • New

    Description

      Currently, SegmentMerger has supported two classes of citizens being merged:

      1. SegmentReader
      2. "foreign reader" (e.g. some FilterReader)

      It does an instanceof check and executes the merge differently. In the SegmentReader case: stored field and term vectors are bulk-merged, norms and docvalues are transferred directly without piling up on the heap, CRC32 verification runs with IO locality of the data being merged, etc. Otherwise, we treat it as a "foreign" reader and its slow.

      This is just the low-level, it gets worse as you wrap with more stuff. A great example there is SortingMergePolicy: not only will it have the low-level slowdowns listed above, it will e.g. cache/pile up OrdinalMaps for all string docvalues fields being merged and other silliness that just makes matters worse.

      Another use case is 5.0 users wishing to upgrade from fieldcache to docvalues. This should be possible to implement with a simple incremental transition based on a mergepolicy that uses UninvertingReader. But we shouldnt populate internal fieldcache entries unnecessarily on merge and spike RAM until all those segment cores are released, and other issues like bulk merge of stored fields and not piling up norms should still work: its completely unrelated.

      There are more problems we can fix if we clean this up, checkindex/checkreader can run efficiently where it doesn't need to RAM spike like merging, we can remove the checkIntegrity() method completely from LeafReader, since it can always be accomplished on producers, etc. In general it would be nice to just have one codepath for merging that is as efficient as we can make it, and to support things like index modifications during merge.

      I spent a few weeks writing 3 different implementations to fix this (interface, optional abstract class, "fix LeafReader"), and the latter is the only one i don't completely hate: I think our APIs should be efficient for indexing as well as search.

      So the proposal is simple, its to instead refactor LeafReader to just require the producer APIs as abstract methods (and FilterReaders should work on that). The search-oriented APIs can just be final methods that defer to those.

      So we would add 5 abstract methods, but implement 10 current methods as final based on those, and then merging would always be efficient.

        // new abstract codec-based apis
        /** 
         * Expert: retrieve thread-private TermVectorsReader
         * @throws AlreadyClosedException if this reader is closed
         * @lucene.internal 
         */
        protected abstract TermVectorsReader getTermVectorsReader();
      
        /** 
         * Expert: retrieve thread-private StoredFieldsReader
         * @throws AlreadyClosedException if this reader is closed
         * @lucene.internal 
         */
        protected abstract StoredFieldsReader getFieldsReader();
        
        /** 
         * Expert: retrieve underlying NormsProducer
         * @throws AlreadyClosedException if this reader is closed
         * @lucene.internal 
         */
        protected abstract NormsProducer getNormsReader();
        
        /** 
         * Expert: retrieve underlying DocValuesProducer
         * @throws AlreadyClosedException if this reader is closed
         * @lucene.internal 
         */
        protected abstract DocValuesProducer getDocValuesReader();
        
        /** 
         * Expert: retrieve underlying FieldsProducer
         * @throws AlreadyClosedException if this reader is closed
         * @lucene.internal  
         */
        protected abstract FieldsProducer getPostingsReader();
      
        // user/search oriented public apis based on the above
        public final Fields fields();
        public final void document(int, StoredFieldVisitor);
        public final Fields getTermVectors(int);
        public final NumericDocValues getNumericDocValues(String);
        public final Bits getDocsWithField(String);
        public final BinaryDocValues getBinaryDocValues(String);
        public final SortedDocValues getSortedDocValues(String);
        public final SortedNumericDocValues getSortedNumericDocValues(String);
        public final SortedSetDocValues getSortedSetDocValues(String);
        public final NumericDocValues getNormValues(String);
      

      Attachments

        1. LUCENE-6065.patch
          66 kB
          Robert Muir

        Activity

          People

            Unassigned Unassigned
            rcmuir Robert Muir

            Dates

              Created:
              Updated:

              Slack

                Issue deployment