Lucene - Core
  1. Lucene - Core
  2. LUCENE-6445

Highlighter TokenSources simplification; just one getAnyTokenStream()

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.2
    • Component/s: modules/highlighter
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The Highlighter "TokenSources" class has quite a few utility methods pertaining to getting a TokenStream from either term vectors or analyzed text. I think it's too much:

      • some go to term vectors, some don't. But if you don't want to go to term vectors, then it's quite easy for the caller to invoke the Analyzer for the field value, and to get that field value.
      • Some methods return null, some never null; I forget which at a glance.
      • Some methods read the Document (to get a field value) from the IndexReader, some don't. Furthermore, it's not an ideal place to get the doc since your app might be using an IndexSearcher with a document cache (e.g. SolrIndexSearcher).
      • None of the methods accept a Fields instance from term vectors as a parameter. Based on how Lucene's term vector format works, this is a performance trap if you don't re-use an instance across fields on the document that you're highlighting.

        Issue Links

          Activity

          Hide
          David Smiley added a comment -

          What I propose is two methods:

          getTokenStream(String field, Fields tvFields, String text, Analyzer analyzer, int maxStartOffset) throws IOException
          

          and

          getTermVectorTokenStreamIfPresent(String field,  Fields tvFields, maxStartOffset) throws IOException
          

          All the others can be deprecated in 5x, removed in trunk. If you supply a maxStartOffset, it should apply to either term vectors or analyzed text – whichever it gets it from. See LUCENE-6423 LimitTokenOffsetFilter. If the term vector doesn't have offsets then it won't be used. Ditto for positions... and if the caller knows what it's doing and wants to use TokenStreamFromTermVector with offsets but not positions then it can do so, but not with these convenience methods. The second method can return null; it's name is suggestive of that. IOException is never masked in a RuntimeException.

          Show
          David Smiley added a comment - What I propose is two methods: getTokenStream( String field, Fields tvFields, String text, Analyzer analyzer, int maxStartOffset) throws IOException and getTermVectorTokenStreamIfPresent( String field, Fields tvFields, maxStartOffset) throws IOException All the others can be deprecated in 5x, removed in trunk. If you supply a maxStartOffset, it should apply to either term vectors or analyzed text – whichever it gets it from. See LUCENE-6423 LimitTokenOffsetFilter. If the term vector doesn't have offsets then it won't be used. Ditto for positions... and if the caller knows what it's doing and wants to use TokenStreamFromTermVector with offsets but not positions then it can do so, but not with these convenience methods. The second method can return null; it's name is suggestive of that. IOException is never masked in a RuntimeException.
          Hide
          ASF subversion and git services added a comment -

          Commit 1675504 from David Smiley in branch 'dev/trunk'
          [ https://svn.apache.org/r1675504 ]

          LUCENE-6392 Highlighter: add maxStartOffset (and other memory saving efficiencies) to TokenStreamFromTermVector.
          Delaying TokenSources changes for LUCENE-6445.

          Show
          ASF subversion and git services added a comment - Commit 1675504 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1675504 ] LUCENE-6392 Highlighter: add maxStartOffset (and other memory saving efficiencies) to TokenStreamFromTermVector. Delaying TokenSources changes for LUCENE-6445 .
          Hide
          ASF subversion and git services added a comment -

          Commit 1675505 from David Smiley in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1675505 ]

          LUCENE-6392 Highlighter: add maxStartOffset (and other memory saving efficiencies) to TokenStreamFromTermVector.
          Delaying TokenSources changes for LUCENE-6445.

          Show
          ASF subversion and git services added a comment - Commit 1675505 from David Smiley in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1675505 ] LUCENE-6392 Highlighter: add maxStartOffset (and other memory saving efficiencies) to TokenStreamFromTermVector. Delaying TokenSources changes for LUCENE-6445 .
          Hide
          David Smiley added a comment -

          Attached patch.
          The 2nd method name is actually "getTermVectorTokenStreamOrNull", and I decided that positions on the term vector needn't be a hard requirement.

          The patch adds a test for the maxStartOffset behavior. The javadocs for these two methods are quite complete, including a warning about multi-valued fields. Solr calls one of these now with the maxStartOffset, so it will benefit. Updating all the test calls was a bit tedious.

          Also, this highlighter module now depends on analysis-common for the LimitTokenOffsetFilter.

          Show
          David Smiley added a comment - Attached patch. The 2nd method name is actually "getTermVectorTokenStreamOrNull", and I decided that positions on the term vector needn't be a hard requirement. The patch adds a test for the maxStartOffset behavior. The javadocs for these two methods are quite complete, including a warning about multi-valued fields. Solr calls one of these now with the maxStartOffset, so it will benefit. Updating all the test calls was a bit tedious. Also, this highlighter module now depends on analysis-common for the LimitTokenOffsetFilter.
          Hide
          ASF subversion and git services added a comment -

          Commit 1676540 from David Smiley in branch 'dev/trunk'
          [ https://svn.apache.org/r1676540 ]

          LUCENE-6445: Highlighter TokenSources simplification

          Show
          ASF subversion and git services added a comment - Commit 1676540 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1676540 ] LUCENE-6445 : Highlighter TokenSources simplification
          Hide
          ASF subversion and git services added a comment -

          Commit 1676571 from David Smiley in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1676571 ]

          LUCENE-6445: Highlighter TokenSources simplification

          Show
          ASF subversion and git services added a comment - Commit 1676571 from David Smiley in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1676571 ] LUCENE-6445 : Highlighter TokenSources simplification
          Hide
          Anshum Gupta added a comment -

          Bulk close for 5.2.0.

          Show
          Anshum Gupta added a comment - Bulk close for 5.2.0.

            People

            • Assignee:
              David Smiley
              Reporter:
              David Smiley
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development