[SOLR-5855] re-use document term-vector Fields instance across fields in the DefaultSolrHighlighter - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 6.0
Fix Version/s: 5.2
Component/s: highlighter
Labels:
None

Description

Hi folks,

while investigating possible performance bottlenecks in the highlight component i discovered two places where we can save some cpu cylces.

Both are in the class org.apache.solr.highlight.DefaultSolrHighlighter

First in method doHighlighting (lines 411-417):
In the loop we try to highlight every field that has been resolved from the params on each document. Ok, but why not skip those fields that are not present on the current document?
So i changed the code from:
for (String fieldName : fieldNames) {
fieldName = fieldName.trim();
if( useFastVectorHighlighter( params, schema, fieldName ) )
doHighlightingByFastVectorHighlighter( fvh, fieldQuery, req, docSummaries, docId, doc, fieldName );
else
doHighlightingByHighlighter( query, req, docSummaries, docId, doc, fieldName );
}

to:
for (String fieldName : fieldNames) {
fieldName = fieldName.trim();
if (doc.get(fieldName) != null)

{ if( useFastVectorHighlighter( params, schema, fieldName ) ) doHighlightingByFastVectorHighlighter( fvh, fieldQuery, req, docSummaries, docId, doc, fieldName ); else doHighlightingByHighlighter( query, req, docSummaries, docId, doc, fieldName ); }

}

The second place is where we try to retrieve the TokenStream from the document for a specific field.
line 472:
TokenStream tvStream = TokenSources.getTokenStreamWithOffsets(searcher.getIndexReader(), docId, fieldName);
where..
public static TokenStream getTokenStreamWithOffsets(IndexReader reader, int docId, String field) throws IOException {
Fields vectors = reader.getTermVectors(docId);
if (vectors == null)

{ return null; }

Terms vector = vectors.terms(field);
if (vector == null) { return null; }

if (!vector.hasPositions() || !vector.hasOffsets())

{ return null; }
return getTokenStream(vector);
}

keep in mind that we currently hit the IndexReader n times where n = requested rows(documents) * requested amount of highlight fields.
in my usecase reader.getTermVectors(docId) takes around 150.000~250.000ns on a warm solr and 1.100.000ns on a cold solr.

If we store the returning Fields vectors in a cache, this lookups only take 25000ns.

I would suggest something like the following code in the doHighlightingByHighlighter method in the DefaultSolrHighlighter class (line 472):
Fields vectors = null;
SolrCache termVectorCache = searcher.getCache("termVectorCache");
if (termVectorCache != null) {
vectors = (Fields) termVectorCache.get(Integer.valueOf(docId));
if (vectors == null) { vectors = searcher.getIndexReader().getTermVectors(docId); if (vectors != null) termVectorCache.put(Integer.valueOf(docId), vectors); }
} else {
vectors = searcher.getIndexReader().getTermVectors(docId);
}
TokenStream tvStream = TokenSources.getTokenStreamWithOffsets(vectors, fieldName);

and TokenSources class:
public static TokenStream getTokenStreamWithOffsets(Fields vectors, String field) throws IOException {
if (vectors == null) { return null; }

Terms vector = vectors.terms(field);
if (vector == null)

{ return null; }
if (!vector.hasPositions() || !vector.hasOffsets()) { return null; }

return getTokenStream(vector);
}

4000ms on 1000 docs without cache
639ms on 1000 docs with cache

102ms on 30 docs without cache
22ms on 30 docs with cache

on an index with 190.000 docs with a numFound of 32000 and 80 different highlight fields.

I think querys with only one field to highlight on a document does not benefit that much from a cache like this, thats why i think an optional cache would be the best solution there.

As i saw the FastVectorHighlighter uses more or less the same approach and could also benefit from this cache.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

highlight.patch
12/Mar/14 23:04
5 kB
Daniel Debray
SOLR-5855_with_FVH_support.patch
19/May/15 16:31
16 kB
David Smiley
SOLR-5855_with_FVH_support.patch
19/May/15 16:12
15 kB
David Smiley
SOLR-5855-without-cache.patch
01/Apr/15 14:17
6 kB
Thomas Champagne

Issue Links

relates to

LUCENE-6445 Highlighter TokenSources simplification; just one getAnyTokenStream()

Closed

Activity

People

Assignee:: David Smiley

Reporter:: Daniel Debray

Votes:: 2 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 12/Mar/14 23:03

Updated:: 09/May/16 18:56

Resolved:: 21/May/15 13:54