Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8662

Change TermsEnum.seekExact(BytesRef) to abstract + delegate seekExact(BytesRef) in FilterLeafReader.FilterTermsEnum

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 5.5.5, 6.6.5, 7.6, 8.0
    • Fix Version/s: 8.0
    • Component/s: core/search
    • Labels:
    • Lucene Fields:
      New

      Description

      Recently in our production, we found that Solr uses a lot of memory(more than 10g) during recovery or commit for a small index (3.5gb)
      The stack trace is:

       

      Thread 0x4d4b115c0 
        at org.apache.lucene.store.DataInput.readVInt()I (DataInput.java:125) 
        at org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.loadBlock()V (SegmentTermsEnumFrame.java:157) 
        at org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.scanToTermNonLeaf(Lorg/apache/lucene/util/BytesRef;Z)Lorg/apache/lucene/index/TermsEnum$SeekStatus; (SegmentTermsEnumFrame.java:786) 
        at org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.scanToTerm(Lorg/apache/lucene/util/BytesRef;Z)Lorg/apache/lucene/index/TermsEnum$SeekStatus; (SegmentTermsEnumFrame.java:538) 
        at org.apache.lucene.codecs.blocktree.SegmentTermsEnum.seekCeil(Lorg/apache/lucene/util/BytesRef;)Lorg/apache/lucene/index/TermsEnum$SeekStatus; (SegmentTermsEnum.java:757) 
        at org.apache.lucene.index.FilterLeafReader$FilterTermsEnum.seekCeil(Lorg/apache/lucene/util/BytesRef;)Lorg/apache/lucene/index/TermsEnum$SeekStatus; (FilterLeafReader.java:185) 
        at org.apache.lucene.index.TermsEnum.seekExact(Lorg/apache/lucene/util/BytesRef;)Z (TermsEnum.java:74) 
        at org.apache.solr.search.SolrIndexSearcher.lookupId(Lorg/apache/lucene/util/BytesRef;)J (SolrIndexSearcher.java:823) 
        at org.apache.solr.update.VersionInfo.getVersionFromIndex(Lorg/apache/lucene/util/BytesRef;)Ljava/lang/Long; (VersionInfo.java:204) 
        at org.apache.solr.update.UpdateLog.lookupVersion(Lorg/apache/lucene/util/BytesRef;)Ljava/lang/Long; (UpdateLog.java:786) 
        at org.apache.solr.update.VersionInfo.lookupVersion(Lorg/apache/lucene/util/BytesRef;)Ljava/lang/Long; (VersionInfo.java:194) 
        at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(Lorg/apache/solr/update/AddUpdateCommand;)Z (DistributedUpdateProcessor.java:1051)  
      

      We reproduced the problem locally with the following code using Lucene code.

      public static void main(String[] args) throws IOException {
        FSDirectory index = FSDirectory.open(Paths.get("the-index"));
        try (IndexReader reader = new   ExitableDirectoryReader(DirectoryReader.open(index),
      new QueryTimeoutImpl(1000 * 60 * 5))) {
          String id = "the-id";
      
          BytesRef text = new BytesRef(id);
          for (LeafReaderContext lf : reader.leaves()) {
            TermsEnum te = lf.reader().terms("id").iterator();
            System.out.println(te.seekExact(text));
          }
        }
      }
      

       

      I added System.out.println("ord: " + ord); in codecs.blocktree.SegmentTermsEnum.getFrame(int).

      Please check the attached output of test program.txt. 

       

      We found out the root cause:

      we didn't implement seekExact(BytesRef) method in FilterLeafReader.FilterTerms, so it uses the base class TermsEnum.seekExact(BytesRef) implementation which is very inefficient in this case.

      public boolean seekExact(BytesRef text) throws IOException {
        return seekCeil(text) == SeekStatus.FOUND;
      }
      

      The fix is simple, just override seekExact(BytesRef) method in FilterLeafReader.FilterTerms

      @Override
      public boolean seekExact(BytesRef text) throws IOException {
        return in.seekExact(text);
      }
      

        Attachments

        1. output of test program.txt
          1 kB
          jefferyyuan

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                yuanyun.cn jefferyyuan
              • Votes:
                0 Vote for this issue
                Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h
                  1h