Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-6375

Inconsistent interpretation of maxDocCharsToAnalyze in Highlighter & WeightedSpanTermExtractor

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • None
    • None
    • New

    Description

      Way back in LUCENE-2939, the default/standard Highlighter's WeightedSpanTermExtractor (referenced by QueryScorer, used by Highlighter.java) got a performance feature maxDocCharsToAnalyze to set a limit on how much text to process when looking for phrase queries and wildcards (and some other advanced query types). Highlighter itself also has a limit by the same name. They are not interpreted the same way!

      Highlighter loops over tokens and halts early if the token's start offset >= maxDocCharsToAnalyze. In this light, it's almost as if the input string was truncated to be this length, but a bit beyond to the next tokenization boundary. The PostingsHighlighter also has a configurable limit it calls "maxLength" (or contentLength) that is conceptually similar but implemented differently because it doesn't tokenize; but it does have the inverted start & end offsets to check if it's reached the end with respect to this configured limit. FYI Solr's hl.maxAnalyzedChars is supplied as a configured input to both highlighters in this manner; the FastVectorHighlighter doesn't have a limit.

      Highlighter propagates it's configured maxAnalyzedChars to QueryScorer which in turn propagates it to WeightedSpanTermExtractor. WSTE doesn't interpret this the same way as Highlighter or PostingsHighlighter. It uses an OffsetLimitTokenFilter which accumulates the deltas in start & end offsets of each token it sees. That is:

            int offsetLength = offsetAttrib.endOffset() - offsetAttrib.startOffset();
            offsetCount += offsetLength;
      

      So if you've got analysis which produces a lot of posInc-0 tokens (as I do), you will likely hit this limit earlier than when Highlighter will. Or if you have very few tokens with tons of whitespace then WSTE will index terms that will never be highlighted. This isn't a big deal but it should be fixed. This filter should simply examine if the startOffset is >= a configured limit and return false from it's incrementToken if so.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              dsmiley David Smiley
              Votes:
              1 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: