Lucene - Core
  1. Lucene - Core
  2. LUCENE-1695

Update the Highlighter to use the new TokenStream API

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9
    • Component/s: modules/highlighter
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available
    1. LUCENE-1695.patch
      48 kB
      Mark Miller
    2. LUCENE-1695.patch
      51 kB
      Mark Miller
    3. LUCENE-1695.patch
      61 kB
      Mark Miller
    4. LUCENE-1695.patch
      56 kB
      Mark Miller
    5. LUCENE-1695.patch
      54 kB
      Mark Miller

      Issue Links

        Activity

        Hide
        Mark Miller added a comment -

        Sorry about that Koji - just updated it.

        Show
        Mark Miller added a comment - Sorry about that Koji - just updated it.
        Hide
        Koji Sekiguchi added a comment -

        Mark, can you remove System.out from TokenSources? I got a lot of "inc token" log messages when running highlight-profile.alg...

        Show
        Koji Sekiguchi added a comment - Mark, can you remove System.out from TokenSources? I got a lot of "inc token" log messages when running highlight-profile.alg...
        Hide
        Mark Miller added a comment -

        I've committed this. We can reopen if someone brings up a new argument. Putting it in will help me finish up making the SpanScorer the default Scorer.

        Show
        Mark Miller added a comment - I've committed this. We can reopen if someone brings up a new argument. Putting it in will help me finish up making the SpanScorer the default Scorer.
        Hide
        Mark Miller added a comment -

        Alright - this is no idle threat. I'm gonna commit this. And break everyone's custom Highlighter plugin classes everywhere. Smash them. Obliterate them. Make them rewrite them. Or use the old Highlighter jar. Any moment now ...

        Show
        Mark Miller added a comment - Alright - this is no idle threat. I'm gonna commit this. And break everyone's custom Highlighter plugin classes everywhere. Smash them. Obliterate them. Make them rewrite them. Or use the old Highlighter jar. Any moment now ...
        Hide
        Mark Miller added a comment -

        To trunk

        Show
        Mark Miller added a comment - To trunk
        Hide
        Mark Miller added a comment -

        So without further objection, I'm going to commit this so that I can finish the 'make spanscorer the default' issue.

        Show
        Mark Miller added a comment - So without further objection, I'm going to commit this so that I can finish the 'make spanscorer the default' issue.
        Hide
        Uwe Schindler added a comment -

        not to hijack this issue (sorry) but Uwe, this reminds me that there are things that extend SinkTokenizer in contrib (analysis/sinks) as well, so it would be great to have a migration plan for those also

        Extend!? Very bad, all TokenStreams should be final should be but may not. I think these classes should stay as they are and use Tee/Sink and we should deprecate them. But TeeSinkTokenStream is final (and this is important) and the corresponding Sinks are somehow abstract (without "known" implementation).

        Show
        Uwe Schindler added a comment - not to hijack this issue (sorry) but Uwe, this reminds me that there are things that extend SinkTokenizer in contrib (analysis/sinks) as well, so it would be great to have a migration plan for those also Extend!? Very bad, all TokenStreams should be final – should be but may not. I think these classes should stay as they are and use Tee/Sink and we should deprecate them. But TeeSinkTokenStream is final (and this is important) and the corresponding Sinks are somehow abstract (without "known" implementation).
        Hide
        Uwe Schindler added a comment -

        I think you could probably continue using the 2.4 Highlighter jar as well? All of the classes should be intact and compatible I think.

        Should be (from the TokenStream changes perspective). Are there any non-bw changes in query processing that may affect highlighter?

        Show
        Uwe Schindler added a comment - I think you could probably continue using the 2.4 Highlighter jar as well? All of the classes should be intact and compatible I think. Should be (from the TokenStream changes perspective). Are there any non-bw changes in query processing that may affect highlighter?
        Hide
        Mark Miller added a comment -

        Here we would need to deprecate the whole Highlighter or add some pretty nasty code to the current one to support both. Because the Highlighter has no back compat promise, I'd almost rather just jump to the new API. I think there are a couple other breaks that should be made (to make using the SpanScorer more rational when we make it the default), so doing everything at once with 2.9 seems somewhat reasonable to me. Updating code to work with the new API should be as straightforward as updating a TokenFilter/TokenStream. I think you could probably continue using the 2.4 Highlighter jar as well? All of the classes should be intact and compatible I think.

        Show
        Mark Miller added a comment - Here we would need to deprecate the whole Highlighter or add some pretty nasty code to the current one to support both. Because the Highlighter has no back compat promise, I'd almost rather just jump to the new API. I think there are a couple other breaks that should be made (to make using the SpanScorer more rational when we make it the default), so doing everything at once with 2.9 seems somewhat reasonable to me. Updating code to work with the new API should be as straightforward as updating a TokenFilter/TokenStream. I think you could probably continue using the 2.4 Highlighter jar as well? All of the classes should be intact and compatible I think.
        Hide
        Robert Muir added a comment -

        not to hijack this issue (sorry) but Uwe, this reminds me that there are things that extend SinkTokenizer in contrib (analysis/sinks) as well, so it would be great to have a migration plan for those also.

        Show
        Robert Muir added a comment - not to hijack this issue (sorry) but Uwe, this reminds me that there are things that extend SinkTokenizer in contrib (analysis/sinks) as well, so it would be great to have a migration plan for those also.
        Hide
        Uwe Schindler added a comment -

        We had the same problem in core with Tee/sinkTokenizer. Both classes exported API using Token instances without any real use (I think it was because of tests). We deprecated the whole calss and created a new TeeSinkTokenFilter using States and do not export thier internal implementation (which is not needed for Tee/Sink usage).

        Show
        Uwe Schindler added a comment - We had the same problem in core with Tee/sinkTokenizer. Both classes exported API using Token instances without any real use (I think it was because of tests). We deprecated the whole calss and created a new TeeSinkTokenFilter using States and do not export thier internal implementation (which is not needed for Tee/Sink usage).
        Hide
        Robert Muir added a comment -

        Mark there is a related issue in some of the other contribs.

        ShingleMatrix and Compound expose a token api in a similar way... I'm kinda not sure what to do with these.

        Show
        Robert Muir added a comment - Mark there is a related issue in some of the other contribs. ShingleMatrix and Compound expose a token api in a similar way... I'm kinda not sure what to do with these.
        Hide
        Mark Miller added a comment -

        Actually, I guess I would rather still change the API's - otherwise there will have to be a lot of needless Token object creation. Anyone else have an opinion? If not, I'm going to completely break back compat with the Highlighter here.

        Show
        Mark Miller added a comment - Actually, I guess I would rather still change the API's - otherwise there will have to be a lot of needless Token object creation. Anyone else have an opinion? If not, I'm going to completely break back compat with the Highlighter here.
        Hide
        Mark Miller added a comment -

        Turns out, if Token is not deprecated, we don't really have to change those Highlighter plugin API's after all. I'll revert them.

        Show
        Mark Miller added a comment - Turns out, if Token is not deprecated, we don't really have to change those Highlighter plugin API's after all. I'll revert them.
        Hide
        Mark Miller added a comment -

        to trunk

        Show
        Mark Miller added a comment - to trunk
        Hide
        Mark Miller added a comment -

        Pretty much done, all tests pass. It breaks back compat, but frankly, straddling doesn't seem worth the effort here. Or even very possible. You can't really give new methods to use for the deprecated ones, and deprecating by class would be a real nuisance as we would lose class names I'd rather keep. We have no back compat policy, and I think its worth just pushing this to the new API.

        I was also thinking about breaking back compat with changing the Highlighter to use the SpanScorer, so doing it all in one shot would be nice. The overall migration should be fairly simple once you understand the new TokenFilter API. I'll handle it for Solr.

        Still needs either its own changes file to explain or could go in the contrib common changes file.

        There is a change to the MemoryIndex to get around issues with the new/old API and CachingTokenFilters.

        Ill have to see how the new TokenFilter API improvements issue works out before doing a final patch for this.

        Show
        Mark Miller added a comment - Pretty much done, all tests pass. It breaks back compat, but frankly, straddling doesn't seem worth the effort here. Or even very possible. You can't really give new methods to use for the deprecated ones, and deprecating by class would be a real nuisance as we would lose class names I'd rather keep. We have no back compat policy, and I think its worth just pushing this to the new API. I was also thinking about breaking back compat with changing the Highlighter to use the SpanScorer, so doing it all in one shot would be nice. The overall migration should be fairly simple once you understand the new TokenFilter API. I'll handle it for Solr. Still needs either its own changes file to explain or could go in the contrib common changes file. There is a change to the MemoryIndex to get around issues with the new/old API and CachingTokenFilters. Ill have to see how the new TokenFilter API improvements issue works out before doing a final patch for this.
        Hide
        Mark Miller added a comment -

        Rough, non backward compat patch.

        There is still an issue with testUnRewrittenQuery() - it passes in isolation, but not when run with the other tests:

        java.io.IOException: Stream closed
        at java.io.StringReader.ensureOpen(StringReader.java:56)
        at java.io.StringReader.read(StringReader.java:90)
        at org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:451)
        at org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:637)
        at org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:153)
        at org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:50)
        at org.apache.lucene.analysis.LowerCaseFilter.incrementToken(LowerCaseFilter.java:38)
        at org.apache.lucene.analysis.StopFilter.incrementToken(StopFilter.java:222)
        at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:242)

        Show
        Mark Miller added a comment - Rough, non backward compat patch. There is still an issue with testUnRewrittenQuery() - it passes in isolation, but not when run with the other tests: java.io.IOException: Stream closed at java.io.StringReader.ensureOpen(StringReader.java:56) at java.io.StringReader.read(StringReader.java:90) at org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:451) at org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:637) at org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:153) at org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:50) at org.apache.lucene.analysis.LowerCaseFilter.incrementToken(LowerCaseFilter.java:38) at org.apache.lucene.analysis.StopFilter.incrementToken(StopFilter.java:222) at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:242)

          People

          • Assignee:
            Mark Miller
            Reporter:
            Mark Miller
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development