Solr
  1. Solr
  2. SOLR-5855

re-use document term-vector Fields instance across fields in the DefaultSolrHighlighter

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 6.0
    • Fix Version/s: 5.2
    • Component/s: highlighter
    • Labels:
      None

      Description

      Hi folks,

      while investigating possible performance bottlenecks in the highlight component i discovered two places where we can save some cpu cylces.

      Both are in the class org.apache.solr.highlight.DefaultSolrHighlighter

      First in method doHighlighting (lines 411-417):
      In the loop we try to highlight every field that has been resolved from the params on each document. Ok, but why not skip those fields that are not present on the current document?
      So i changed the code from:
      for (String fieldName : fieldNames)

      { fieldName = fieldName.trim(); if( useFastVectorHighlighter( params, schema, fieldName ) ) doHighlightingByFastVectorHighlighter( fvh, fieldQuery, req, docSummaries, docId, doc, fieldName ); else doHighlightingByHighlighter( query, req, docSummaries, docId, doc, fieldName ); }

      to:
      for (String fieldName : fieldNames) {
      fieldName = fieldName.trim();
      if (doc.get(fieldName) != null)

      { if( useFastVectorHighlighter( params, schema, fieldName ) ) doHighlightingByFastVectorHighlighter( fvh, fieldQuery, req, docSummaries, docId, doc, fieldName ); else doHighlightingByHighlighter( query, req, docSummaries, docId, doc, fieldName ); }

      }

      The second place is where we try to retrieve the TokenStream from the document for a specific field.
      line 472:
      TokenStream tvStream = TokenSources.getTokenStreamWithOffsets(searcher.getIndexReader(), docId, fieldName);
      where..
      public static TokenStream getTokenStreamWithOffsets(IndexReader reader, int docId, String field) throws IOException {
      Fields vectors = reader.getTermVectors(docId);
      if (vectors == null)

      { return null; }

      Terms vector = vectors.terms(field);
      if (vector == null) { return null; }

      if (!vector.hasPositions() || !vector.hasOffsets())

      { return null; }
      return getTokenStream(vector);
      }

      keep in mind that we currently hit the IndexReader n times where n = requested rows(documents) * requested amount of highlight fields.
      in my usecase reader.getTermVectors(docId) takes around 150.000~250.000ns on a warm solr and 1.100.000ns on a cold solr.

      If we store the returning Fields vectors in a cache, this lookups only take 25000ns.

      I would suggest something like the following code in the doHighlightingByHighlighter method in the DefaultSolrHighlighter class (line 472):
      Fields vectors = null;
      SolrCache termVectorCache = searcher.getCache("termVectorCache");
      if (termVectorCache != null) {
      vectors = (Fields) termVectorCache.get(Integer.valueOf(docId));
      if (vectors == null) { vectors = searcher.getIndexReader().getTermVectors(docId); if (vectors != null) termVectorCache.put(Integer.valueOf(docId), vectors); }
      } else { vectors = searcher.getIndexReader().getTermVectors(docId); }
      TokenStream tvStream = TokenSources.getTokenStreamWithOffsets(vectors, fieldName);

      and TokenSources class:
      public static TokenStream getTokenStreamWithOffsets(Fields vectors, String field) throws IOException {
      if (vectors == null) { return null; }

      Terms vector = vectors.terms(field);
      if (vector == null)

      { return null; }
      if (!vector.hasPositions() || !vector.hasOffsets()) { return null; }

      return getTokenStream(vector);
      }

      4000ms on 1000 docs without cache
      639ms on 1000 docs with cache

      102ms on 30 docs without cache
      22ms on 30 docs with cache

      on an index with 190.000 docs with a numFound of 32000 and 80 different highlight fields.

      I think querys with only one field to highlight on a document does not benefit that much from a cache like this, thats why i think an optional cache would be the best solution there.

      As i saw the FastVectorHighlighter uses more or less the same approach and could also benefit from this cache.

      1. highlight.patch
        5 kB
        Daniel Debray
      2. SOLR-5855_with_FVH_support.patch
        16 kB
        David Smiley
      3. SOLR-5855_with_FVH_support.patch
        15 kB
        David Smiley
      4. SOLR-5855-without-cache.patch
        6 kB
        Thomas Champagne

        Issue Links

          Activity

          Hide
          Otis Gospodnetic added a comment -

          Nice speed improvement.
          So this adds a new type of cache?
          Is this cache also exposed via JMX and Admin like other caches?

          Show
          Otis Gospodnetic added a comment - Nice speed improvement. So this adds a new type of cache? Is this cache also exposed via JMX and Admin like other caches?
          Hide
          Daniel Debray added a comment -

          sure, this cache should be registered as other solr caches and therefore is exposed via jmx too.

          Show
          Daniel Debray added a comment - sure, this cache should be registered as other solr caches and therefore is exposed via jmx too.
          Hide
          Daniel Debray added a comment -

          I did a fork on Github and added the changes + tests. The cache is used in our environment for ~3 months now without problems. The only thing is that this cache has the same limitations as the document cache. So no autowarming available.

          If you think its fine i would like to create a pull request or update this the appended patch.

          https://github.com/DDebray/lucene-solr

          Show
          Daniel Debray added a comment - I did a fork on Github and added the changes + tests. The cache is used in our environment for ~3 months now without problems. The only thing is that this cache has the same limitations as the document cache. So no autowarming available. If you think its fine i would like to create a pull request or update this the appended patch. https://github.com/DDebray/lucene-solr
          Hide
          Thomas Champagne added a comment -

          I think this issue should be split in two issue.

          The first optimization is very simple and can be integrated quickly. Create a new issue with this small patch.

          The second optimization is more complicated but I think it is possible to solve the problem without cache. You tell that the problem is the call at searcher.getIndexReader().getTermVectors(docId) for each field. I think you can move this before the loop on the fields in the doHighlighting method and get "term vectors" only one time for each document.

          I'll try to create a patch but now I don't have time to do this.

          Show
          Thomas Champagne added a comment - I think this issue should be split in two issue. The first optimization is very simple and can be integrated quickly. Create a new issue with this small patch. The second optimization is more complicated but I think it is possible to solve the problem without cache. You tell that the problem is the call at searcher.getIndexReader().getTermVectors(docId) for each field. I think you can move this before the loop on the fields in the doHighlighting method and get "term vectors" only one time for each document. I'll try to create a patch but now I don't have time to do this.
          Hide
          Thomas Champagne added a comment -

          I create a patch with the two optimizations based on branch_5x.

          This patch don't use a cache. I move the call of method searcher.getIndexReader().getTermVectors(docId) before the loop on the fields and get "term vectors" only one time for each document.

          HighlightingByFastVector doesn't benefit of the change but I think it will be possible to change this.

          In the patch, there is a new unit test for testing this feature : Create 20 docs with 80 fields (1/2 null, 1/2 with a value) and 10 queries with hl.fl=*
          Running test without patch : ~6 sec
          Running test with patch : ~3 sec

          Tell me your opinion about this small patch. I think it is easier than the patch with caching.

          Show
          Thomas Champagne added a comment - I create a patch with the two optimizations based on branch_5x. This patch don't use a cache. I move the call of method searcher.getIndexReader().getTermVectors(docId) before the loop on the fields and get "term vectors" only one time for each document. HighlightingByFastVector doesn't benefit of the change but I think it will be possible to change this. In the patch, there is a new unit test for testing this feature : Create 20 docs with 80 fields (1/2 null, 1/2 with a value) and 10 queries with hl.fl=* Running test without patch : ~6 sec Running test with patch : ~3 sec Tell me your opinion about this small patch. I think it is easier than the patch with caching.
          Hide
          Thomas Champagne added a comment -

          I test again the patch with 100 queries (Create 20 docs with 80 fields (1/2 null, 1/2 with a value) and 100 queries with hl.fl=*) :
          Without patch : ~186 sec
          With patch : ~72 sec

          Show
          Thomas Champagne added a comment - I test again the patch with 100 queries (Create 20 docs with 80 fields (1/2 null, 1/2 with a value) and 100 queries with hl.fl=*) : Without patch : ~186 sec With patch : ~72 sec
          Hide
          David Smiley added a comment -

          Nice work! I'll put some attention on this to try and get it in; not necessarily today/tomorrow but should make it for 5.2. Ideally the FVH should be supported as well; it'd be a shame to do one but not the other.

          A SolrCache of term vectors should definitely be a separate issue.

          Show
          David Smiley added a comment - Nice work! I'll put some attention on this to try and get it in; not necessarily today/tomorrow but should make it for 5.2. Ideally the FVH should be supported as well; it'd be a shame to do one but not the other. A SolrCache of term vectors should definitely be a separate issue.
          Hide
          David Smiley added a comment -

          Another thing that should be done is to figure out how to avoid grabbing the term vector Fields altogether if none of the fields to highlight have term vectors in the first place.

          Show
          David Smiley added a comment - Another thing that should be done is to figure out how to avoid grabbing the term vector Fields altogether if none of the fields to highlight have term vectors in the first place.
          Hide
          David Smiley added a comment -

          Attached is a patch addressing the issue, not just for the default/standard Highlighter but for FVH as well. For the FVH case, I use a FilteredDirectoryReader/LeafReader that always returns a pre-fetched Fields instance. What is not in this patch but I did see in the Daniel's patch is an optimization short-circuit for when the current document doesn't have a field value for the current doc. I like the idea of that but it defeats the ability to subclass this highlighter to override getFieldValues that get the field values from some other source (e.g. from a different field). I'm currently doing that in some client work.

          The patch also includes a refactoring that isn't strictly related to this. I changed the internal API contract of doHighlightingByHighlighter and doHighlightingByFastVectorHighlighter and alternateField to not be responsible for populating the NamedList of highlights for the document; instead they return an object and the caller places it on the NL.

          Please let me know what you think Daniel.

          Show
          David Smiley added a comment - Attached is a patch addressing the issue, not just for the default/standard Highlighter but for FVH as well. For the FVH case, I use a FilteredDirectoryReader/LeafReader that always returns a pre-fetched Fields instance. What is not in this patch but I did see in the Daniel's patch is an optimization short-circuit for when the current document doesn't have a field value for the current doc. I like the idea of that but it defeats the ability to subclass this highlighter to override getFieldValues that get the field values from some other source (e.g. from a different field). I'm currently doing that in some client work. The patch also includes a refactoring that isn't strictly related to this. I changed the internal API contract of doHighlightingByHighlighter and doHighlightingByFastVectorHighlighter and alternateField to not be responsible for populating the NamedList of highlights for the document; instead they return an object and the caller places it on the NL. Please let me know what you think Daniel.
          Hide
          David Smiley added a comment -

          This is an improvement over my past 30min ago. This one uses a TermVector caching IndexReader wrapper for both the FVH and standard highlighter code paths, which is more congruent. But it also means that the term vector fetch won't even happen if there are no text stored data to highlight, since both highlighters try to get the stored text first.

          I also tweaked some internal methods to be a little more consistent. Tests pass.

          Show
          David Smiley added a comment - This is an improvement over my past 30min ago. This one uses a TermVector caching IndexReader wrapper for both the FVH and standard highlighter code paths, which is more congruent. But it also means that the term vector fetch won't even happen if there are no text stored data to highlight, since both highlighters try to get the stored text first. I also tweaked some internal methods to be a little more consistent. Tests pass.
          Hide
          ASF subversion and git services added a comment -

          Commit 1680871 from David Smiley in branch 'dev/trunk'
          [ https://svn.apache.org/r1680871 ]

          SOLR-5855: Re-use the document's term vectors in DefaultSolrHighlighter.
          Also refactored DefaultSolrHighlighter's methods to be a little nicer.

          Show
          ASF subversion and git services added a comment - Commit 1680871 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1680871 ] SOLR-5855 : Re-use the document's term vectors in DefaultSolrHighlighter. Also refactored DefaultSolrHighlighter's methods to be a little nicer.
          Hide
          ASF subversion and git services added a comment -

          Commit 1680872 from David Smiley in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1680872 ]

          SOLR-5855: Re-use the document's term vectors in DefaultSolrHighlighter. Also refactored DefaultSolrHighlighter's methods to be a little nicer.

          Show
          ASF subversion and git services added a comment - Commit 1680872 from David Smiley in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1680872 ] SOLR-5855 : Re-use the document's term vectors in DefaultSolrHighlighter. Also refactored DefaultSolrHighlighter's methods to be a little nicer.
          Hide
          David Smiley added a comment -

          Thanks for finding the problem and the initial patch, Daniel Debray. It would be great if those who have benchmarked could try again with this patch (or by pulling from branch 5x since it's committed) – just to be sure it's working well. The 5.2 release branch is going to be cut later today.

          Show
          David Smiley added a comment - Thanks for finding the problem and the initial patch, Daniel Debray . It would be great if those who have benchmarked could try again with this patch (or by pulling from branch 5x since it's committed) – just to be sure it's working well. The 5.2 release branch is going to be cut later today.
          Hide
          Ere Maijala added a comment -

          I was hoping, perhaps naively, that this would help with the highlighter performance problems we're having with Solr 5. Unfortunately this doesn't seems to have made a difference. Using hl.usePhraseHighlighter=false has a significant effect, but obviously with downsides and still much slower than 4.10.2.

          For what it's worth, here is some additional information:

          Timing from Solr 4.10.2 (42.5 million records):

          "process": {
          "time": 1711,
          "query":

          { "time": 0 }

          ,
          "facet":

          { "time": 66 }

          ,
          "mlt":

          { "time": 0 }

          ,
          "highlight":

          { "time": 708 }

          ,
          "stats":

          { "time": 0 }

          ,
          "expand":

          { "time": 0 }

          ,
          "spellcheck":

          { "time": 433 }

          ,
          "debug":

          { "time": 503 }

          }

          Timing from Solr 5.2.0 (38.8 million records):

          "process": {
          "time": 10172,
          "query":

          { "time": 0 }

          ,
          "facet":

          { "time": 45 }

          ,
          "facet_module":

          { "time": 0 }

          ,
          "mlt":

          { "time": 0 }

          ,
          "highlight":

          { "time": 9310 }

          ,
          "stats":

          { "time": 0 }

          ,
          "expand":

          { "time": 0 }

          ,
          "spellcheck":

          { "time": 345 }

          ,
          "debug":

          { "time": 472 }

          }

          A couple of jstack outputs during the query execution are here: http://pastebin.com/8FJiq5R3. The schema and solrconfig are at https://github.com/NatLibFi/NDL-VuFind-Solr/tree/master/vufind/biblio/conf.

          Show
          Ere Maijala added a comment - I was hoping, perhaps naively, that this would help with the highlighter performance problems we're having with Solr 5. Unfortunately this doesn't seems to have made a difference. Using hl.usePhraseHighlighter=false has a significant effect, but obviously with downsides and still much slower than 4.10.2. For what it's worth, here is some additional information: Timing from Solr 4.10.2 (42.5 million records): "process": { "time": 1711, "query": { "time": 0 } , "facet": { "time": 66 } , "mlt": { "time": 0 } , "highlight": { "time": 708 } , "stats": { "time": 0 } , "expand": { "time": 0 } , "spellcheck": { "time": 433 } , "debug": { "time": 503 } } Timing from Solr 5.2.0 (38.8 million records): "process": { "time": 10172, "query": { "time": 0 } , "facet": { "time": 45 } , "facet_module": { "time": 0 } , "mlt": { "time": 0 } , "highlight": { "time": 9310 } , "stats": { "time": 0 } , "expand": { "time": 0 } , "spellcheck": { "time": 345 } , "debug": { "time": 472 } } A couple of jstack outputs during the query execution are here: http://pastebin.com/8FJiq5R3 . The schema and solrconfig are at https://github.com/NatLibFi/NDL-VuFind-Solr/tree/master/vufind/biblio/conf .
          Hide
          David Smiley added a comment -

          I was initially skeptical the stack traces would show anything of interest but I am pleasantly mistaken. Apparently, getting the FieldInfos from SlowCompositeReaderWrapper is a bottleneck. We look this up to determine if there are payloads or not, so that we can then tell MemoryIndex to capture them as well. FYI the call to get this was added recently in SOLR-6916 (Highlighting using payloads), its not related to term vectors – this issue.

          Can you please download the 5x branch, comment out the scorer.getUsePayloads(... line (or set it to true if you want), and see how it performs?

          Show
          David Smiley added a comment - I was initially skeptical the stack traces would show anything of interest but I am pleasantly mistaken. Apparently, getting the FieldInfos from SlowCompositeReaderWrapper is a bottleneck. We look this up to determine if there are payloads or not, so that we can then tell MemoryIndex to capture them as well. FYI the call to get this was added recently in SOLR-6916 (Highlighting using payloads), its not related to term vectors – this issue. Can you please download the 5x branch, comment out the scorer.getUsePayloads(... line (or set it to true if you want), and see how it performs?
          Hide
          David Smiley added a comment -

          Ere Maijala I created an issue for this; please discuss further there: SOLR-7655

          Show
          David Smiley added a comment - Ere Maijala I created an issue for this; please discuss further there: SOLR-7655
          Hide
          Anshum Gupta added a comment -

          Bulk close for 5.2.0.

          Show
          Anshum Gupta added a comment - Bulk close for 5.2.0.

            People

            • Assignee:
              David Smiley
              Reporter:
              Daniel Debray
            • Votes:
              2 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development