Lucene - Core
  1. Lucene - Core
  2. LUCENE-6392

Add offset limit to Highlighter's TokenStreamFromTermVector

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.2
    • Component/s: modules/highlighter
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      The Highlighter's TokenStreamFromTermVector utility, typically accessed via TokenSources, should have the ability to filter out tokens beyond a configured offset. There is a TODO there already, and this issue addresses it. New methods in TokenSources now propagate a limit.

      This patch also includes some memory saving optimizations, to be described shortly.

        Issue Links

          Activity

          Hide
          David Smiley added a comment -

          (Patch attached).
          Elaborating on the description:

          This patch includes a tweak to the TokenLL[] array size initialization to consider this new limit when guessing a good size.

          This patch includes memory saving optimizations to the information it accumulates. Before the patch, each TokenLL had a char[], so there were a total of 2 objects per token (including the token itself). Now I use a shared CharsRefBuilder with a pointer & length into it, so there's just 1 object now, plus byte savings by avoiding a char[] header. I also reduced the bytes needed for a TokenLL instance from 40 to 32. It does assume that the char offset delta (endOffset - startOffset) can fit within a short, which seems like a reasonable assumption to me. For safety I guard against overflow and substitute Short.MAX_VALUE.

          Finally, to encourage users to supply a limit (even if "-1" to mean no limit), I decided to deprecate many of the methods in TokenSources for new ones that include a limit parameter. But for those methods that fall-back to a provided Analyzer, I have to wonder now if it makes sense for these methods to filter the analyzers. I think it does – if you want to limit the tokens then it shouldn't matter where you got them from – you want to limit them. I haven't added that but I'm looking for feedback first.

          Show
          David Smiley added a comment - (Patch attached). Elaborating on the description: This patch includes a tweak to the TokenLL[] array size initialization to consider this new limit when guessing a good size. This patch includes memory saving optimizations to the information it accumulates. Before the patch, each TokenLL had a char[], so there were a total of 2 objects per token (including the token itself). Now I use a shared CharsRefBuilder with a pointer & length into it, so there's just 1 object now, plus byte savings by avoiding a char[] header. I also reduced the bytes needed for a TokenLL instance from 40 to 32. It does assume that the char offset delta (endOffset - startOffset) can fit within a short , which seems like a reasonable assumption to me. For safety I guard against overflow and substitute Short.MAX_VALUE. Finally, to encourage users to supply a limit (even if "-1" to mean no limit), I decided to deprecate many of the methods in TokenSources for new ones that include a limit parameter. But for those methods that fall-back to a provided Analyzer, I have to wonder now if it makes sense for these methods to filter the analyzers . I think it does – if you want to limit the tokens then it shouldn't matter where you got them from – you want to limit them. I haven't added that but I'm looking for feedback first.
          Hide
          David Smiley added a comment -

          Oh and before I forget, obviously Solr's DefaultSolrHighlighter should propagate the offset. It's not in this patch to avoid interference with my other WIP.

          Show
          David Smiley added a comment - Oh and before I forget, obviously Solr's DefaultSolrHighlighter should propagate the offset. It's not in this patch to avoid interference with my other WIP.
          Hide
          David Smiley added a comment -

          I created LUCENE-6445 to for a broader refactor/simplification of TokenSources.java. I don't think this issue here should bother with modifications to that class; it can be limited to TokenStreamFromTermVector. I plan to commit this issue in ~24 hours without any TokenSources modifications.

          Show
          David Smiley added a comment - I created LUCENE-6445 to for a broader refactor/simplification of TokenSources.java. I don't think this issue here should bother with modifications to that class; it can be limited to TokenStreamFromTermVector. I plan to commit this issue in ~24 hours without any TokenSources modifications.
          Hide
          ASF subversion and git services added a comment -

          Commit 1675504 from David Smiley in branch 'dev/trunk'
          [ https://svn.apache.org/r1675504 ]

          LUCENE-6392 Highlighter: add maxStartOffset (and other memory saving efficiencies) to TokenStreamFromTermVector.
          Delaying TokenSources changes for LUCENE-6445.

          Show
          ASF subversion and git services added a comment - Commit 1675504 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1675504 ] LUCENE-6392 Highlighter: add maxStartOffset (and other memory saving efficiencies) to TokenStreamFromTermVector. Delaying TokenSources changes for LUCENE-6445 .
          Hide
          ASF subversion and git services added a comment -

          Commit 1675505 from David Smiley in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1675505 ]

          LUCENE-6392 Highlighter: add maxStartOffset (and other memory saving efficiencies) to TokenStreamFromTermVector.
          Delaying TokenSources changes for LUCENE-6445.

          Show
          ASF subversion and git services added a comment - Commit 1675505 from David Smiley in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1675505 ] LUCENE-6392 Highlighter: add maxStartOffset (and other memory saving efficiencies) to TokenStreamFromTermVector. Delaying TokenSources changes for LUCENE-6445 .
          Hide
          Anshum Gupta added a comment -

          Bulk close for 5.2.0.

          Show
          Anshum Gupta added a comment - Bulk close for 5.2.0.

            People

            • Assignee:
              David Smiley
              Reporter:
              David Smiley
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development