Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9
    • Component/s: modules/highlighter
    • Labels:
      None

      Description

      Mark Harwoods highlighter package is a great contribution to Lucene, I've used it a lot! However, when you have large documents (fields), highlighting can be quite time consuming if you increase the number of bytes to analyze with setMaxDocBytesToAnalyze(int). The default value of 50k is often too low for indexed PDFs etcetera, which results in empty highlight strings.

      This is an alternative approach using term position vectors only to build fragment info objects. Then a StringReader can read the relevant fragments and skip() between them. This is a lot faster. Also, this method uses the entire field for finding the best fragments so you're always guaranteed to get a highlight snippet.

      Because this method only works with fields which have term positions stored one can check if this method works for a particular field using following code (taken from TokenSources.java):

      TermFreqVector tfv = (TermFreqVector) reader.getTermFreqVector(docId, field);
      if (tfv != null && tfv instanceof TermPositionVector)

      { // use FulltextHighlighter }

      else

      { // use standard Highlighter }

      Someone else might find this useful so I'm posting the code here.

      1. FulltextHighlighter.java
        13 kB
        Ronnie Kolehmainen
      2. FulltextHighlighterTest.java
        15 kB
        Ronnie Kolehmainen
      3. svn-diff.patch
        30 kB
        Ronnie Kolehmainen
      4. TokenSources.java
        17 kB
        Ronnie Kolehmainen
      5. TokenSources.java.diff
        12 kB
        Ronnie Kolehmainen
      6. FulltextHighlighter.java
        13 kB
        Ronnie Kolehmainen
      7. FulltextHighlighterTest.java
        16 kB
        Ronnie Kolehmainen
      8. svn-diff.patch
        31 kB
        Ronnie Kolehmainen

        Activity

        Hide
        Ronnie Kolehmainen added a comment -

        Test case

        Show
        Ronnie Kolehmainen added a comment - Test case
        Hide
        Ronnie Kolehmainen added a comment -

        Patch against current trunk

        Show
        Ronnie Kolehmainen added a comment - Patch against current trunk
        Hide
        Mark Harwood added a comment -

        Hi Ronnie,
        Thanks for the contribution but I'm not sure I follow the justification for producing this code. Could it be because you assume the existing highlighter still requires an Analyzer to obtain a list of tokens? For a while now the highlighter has taken a TokenStream (as an alternative argument to Analyzer) in order to allow for faster sources of tokenized data.
        The TokenSources class from which you took some of the code was specifically introduced to offer the ability to quickly create TokenStreams from TermPositionVectors however I notice that the JUnit tests don't contain an example of using it - that should be added.

        Is there something else in your contribution/thinking here that I have missed?

        Cheers,
        Mark

        Show
        Mark Harwood added a comment - Hi Ronnie, Thanks for the contribution but I'm not sure I follow the justification for producing this code. Could it be because you assume the existing highlighter still requires an Analyzer to obtain a list of tokens? For a while now the highlighter has taken a TokenStream (as an alternative argument to Analyzer) in order to allow for faster sources of tokenized data. The TokenSources class from which you took some of the code was specifically introduced to offer the ability to quickly create TokenStreams from TermPositionVectors however I notice that the JUnit tests don't contain an example of using it - that should be added. Is there something else in your contribution/thinking here that I have missed? Cheers, Mark
        Hide
        Ronnie Kolehmainen added a comment -

        Speed. Roughly about 3-5 times faster in my tests (when creating a TokenStream with TokenSources). The code was produced because a list of 10 or 20 hits with highlighted snippets often took 3-4 seconds to render. May not sound so much but compared with times below a second it feels like a real improvement.

        Show
        Ronnie Kolehmainen added a comment - Speed. Roughly about 3-5 times faster in my tests (when creating a TokenStream with TokenSources). The code was produced because a list of 10 or 20 hits with highlighted snippets often took 3-4 seconds to render. May not sound so much but compared with times below a second it feels like a real improvement.
        Hide
        Mark Harwood added a comment -

        I'm all for speed.

        I'm just not sure I see why/where the optimization comes in over using existing Highlighter + TokenSources.getTokenStream

        Can you post the client code that does the performance comparisons so I can see exactly what it is you are comparing?

        Thanks

        Show
        Mark Harwood added a comment - I'm all for speed. I'm just not sure I see why/where the optimization comes in over using existing Highlighter + TokenSources.getTokenStream Can you post the client code that does the performance comparisons so I can see exactly what it is you are comparing? Thanks
        Hide
        Ronnie Kolehmainen added a comment -

        Mark,

        I believe the performance gain is mostly because TokenSources.getTokenStream iterates all terms in the field, no?

        I've tested against a real index we have with theses and dissertations in fulltext. It's a 1.1GB compound file. I think I will send you the code used for the test and the output instead of posting it here.

        Show
        Ronnie Kolehmainen added a comment - Mark, I believe the performance gain is mostly because TokenSources.getTokenStream iterates all terms in the field, no? I've tested against a real index we have with theses and dissertations in fulltext. It's a 1.1GB compound file. I think I will send you the code used for the test and the output instead of posting it here.
        Hide
        Mark Harwood added a comment -

        Many thanks for the client code Ronnie - I have tried it with my index and have reproduced the speed-up.
        I'm keen to integrate any code that offers a speed-up and ideally in such a way so that we have one highlighter + Junit test rig which can work with indexes with TermPositionVectors and also those without. This I suspect will involve merging bits of our code. There are a lot of test cases with strange analyzers that need to be considered so that's why I'm keen to have one codebase.

        I'm disappearing on 2 weeks holiday (vacation) shortly so haven't got a lot of time to look at this right now but I plan to when I get back.

        After a quick look I haven't yet identified the difference between your approach and mine which offers the speed-up. One factor is likely that your code only considers offset positions of tokens that are actually query terms and that may be something I could retrofit into TokenSources to produce TokenStreams of only the important tokens to the highlighter.
        I suspect there are other benefits to be had from your code too though which I'll have to consider when I have more time.

        Thanks again for this,

        Cheers
        Mark

        Show
        Mark Harwood added a comment - Many thanks for the client code Ronnie - I have tried it with my index and have reproduced the speed-up. I'm keen to integrate any code that offers a speed-up and ideally in such a way so that we have one highlighter + Junit test rig which can work with indexes with TermPositionVectors and also those without. This I suspect will involve merging bits of our code. There are a lot of test cases with strange analyzers that need to be considered so that's why I'm keen to have one codebase. I'm disappearing on 2 weeks holiday (vacation) shortly so haven't got a lot of time to look at this right now but I plan to when I get back. After a quick look I haven't yet identified the difference between your approach and mine which offers the speed-up. One factor is likely that your code only considers offset positions of tokens that are actually query terms and that may be something I could retrofit into TokenSources to produce TokenStreams of only the important tokens to the highlighter. I suspect there are other benefits to be had from your code too though which I'll have to consider when I have more time. Thanks again for this, Cheers Mark
        Hide
        Ronnie Kolehmainen added a comment -

        Many thanks for your feedback Mark.

        I have to admit, this was a a rather quick fix for our needs, thus a separate class. Ideally, Highlighter and TokenSources should handle this internally and transparent, when applicable. Maybe some sort of HighlightedTokenStream (just out of my head) with query term tokens plus a few surrounding tokens. Should be doable

        I'm heading for Seattle shortly so i'm not sure I will have so much time either, but if I do I will see if the current package can be patched in order to benefit from the speed improvements available. Meanwhile this code can stay here as a reminder/inspiration source, or if someone finds use for it

        Cheers!
        Ronnie

        Show
        Ronnie Kolehmainen added a comment - Many thanks for your feedback Mark. I have to admit, this was a a rather quick fix for our needs, thus a separate class. Ideally, Highlighter and TokenSources should handle this internally and transparent, when applicable. Maybe some sort of HighlightedTokenStream (just out of my head) with query term tokens plus a few surrounding tokens. Should be doable I'm heading for Seattle shortly so i'm not sure I will have so much time either, but if I do I will see if the current package can be patched in order to benefit from the speed improvements available. Meanwhile this code can stay here as a reminder/inspiration source, or if someone finds use for it Cheers! Ronnie
        Hide
        Ronnie Kolehmainen added a comment -

        Mark, I played around a bit with the code in TokenSources and added a method which takes an optional Query object. It will return a TokenSources.BestFragmentsTokenStream (which could later be identified by Highlighter if needed) when the query is not null and term positions are available. This token stream only holds the highlighted tokens and their surroundings.

        These changes shave off about 50% of the time but is still two times slower than my first example. Also, the fragments don't look as expected. The terms are highlighted but the surrounding tokens are most often missing. I'm not sure why as I haven't dug deeper in the HighLighter code. The tokens returned by TokenSources look fine though, with positions and in order...

        It would certainly be nice if most changes could be made in TokenSources. This would allow HighLigheter to be flexible as it is today, with Scorers and Formatters.

        I won't have time to look at it anymore, at least for a while, but I'm posting my version of TokenSources and a diff against current trunk here in case you want to have a peek at it later.

        Show
        Ronnie Kolehmainen added a comment - Mark, I played around a bit with the code in TokenSources and added a method which takes an optional Query object. It will return a TokenSources.BestFragmentsTokenStream (which could later be identified by Highlighter if needed) when the query is not null and term positions are available. This token stream only holds the highlighted tokens and their surroundings. These changes shave off about 50% of the time but is still two times slower than my first example. Also, the fragments don't look as expected. The terms are highlighted but the surrounding tokens are most often missing. I'm not sure why as I haven't dug deeper in the HighLighter code. The tokens returned by TokenSources look fine though, with positions and in order... It would certainly be nice if most changes could be made in TokenSources. This would allow HighLigheter to be flexible as it is today, with Scorers and Formatters. I won't have time to look at it anymore, at least for a while, but I'm posting my version of TokenSources and a diff against current trunk here in case you want to have a peek at it later.
        Hide
        Ronnie Kolehmainen added a comment -

        Bugfix

        Show
        Ronnie Kolehmainen added a comment - Bugfix
        Hide
        Grant Ingersoll added a comment -

        Is this still an issue? Does this speedup still apply?

        Show
        Grant Ingersoll added a comment - Is this still an issue? Does this speedup still apply?
        Hide
        Mark Miller added a comment -

        Yes it is still an issue.

        Its been a while since I look at this, so I may be a little off, but I believe the speed gain comes from the fact that this implementation will only consider the terms from the query, and using info from TermVectors, reconstructs the document in large chunks (chunks between each query term). So a 200 page document with one query term will be put together from the original doc after examining one token.

        The current Highlighter reconstructs by running over every term in the TokenStream. This doesn't scale well. A 200 page document will have every token analyzed and scored as the correct offsets from the original document are slowly built up.

        The result is, Ronnies highlighter is much faster with larger documents, but not for smaller documents (getting TermVector info is slow enough that you need large docs to benefit).

        I think Mark H could probably incorporate this into the other Highlighter, but it certainly won't fit the framework, so either you have to change the framework quite radically (affecting code out there I suppose) or have two frameworks that can be chosen from.

        The other disadvantage to this approach is that I don't see any way to incorporate position sensitive highlighting.

        Show
        Mark Miller added a comment - Yes it is still an issue. Its been a while since I look at this, so I may be a little off, but I believe the speed gain comes from the fact that this implementation will only consider the terms from the query, and using info from TermVectors, reconstructs the document in large chunks (chunks between each query term). So a 200 page document with one query term will be put together from the original doc after examining one token. The current Highlighter reconstructs by running over every term in the TokenStream. This doesn't scale well. A 200 page document will have every token analyzed and scored as the correct offsets from the original document are slowly built up. The result is, Ronnies highlighter is much faster with larger documents, but not for smaller documents (getting TermVector info is slow enough that you need large docs to benefit). I think Mark H could probably incorporate this into the other Highlighter, but it certainly won't fit the framework, so either you have to change the framework quite radically (affecting code out there I suppose) or have two frameworks that can be chosen from. The other disadvantage to this approach is that I don't see any way to incorporate position sensitive highlighting.
        Hide
        Mark Miller added a comment -

        I think its time to close this issue - further work here should probably be applied to the FastVectorHighlighter (which is very similar and now in contrib).

        Show
        Mark Miller added a comment - I think its time to close this issue - further work here should probably be applied to the FastVectorHighlighter (which is very similar and now in contrib).
        Hide
        Uwe Schindler added a comment -

        Closing since FastVectorHighlighter was added in Lucene 2.9.

        Show
        Uwe Schindler added a comment - Closing since FastVectorHighlighter was added in Lucene 2.9.

          People

          • Assignee:
            Unassigned
            Reporter:
            Ronnie Kolehmainen
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development