Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.2.2 (incubating)
    • Fix Version/s: 0.4.0
    • Component/s: Core
    • Labels:
      None
    • Environment:

      Linux

      Description

      If I explicitly specify excerpt_length:

      my $hl = Lucy::Highlight::Highlighter->new(
      searcher => $searcher,
      query => $query_compiler,
      field => 'site',
      excerpt_length => 60,
      );

      ...and the field content is longer than 60, then

      $page_highlighter->create_excerpt($hit);

      returns '...'.

      Content which is short than 60, returns the highlighted excerpt as expected.

      If I comment out "excerpt_length => 60," above, then it returns the full
      non-truncated excerpt with highlighting as expected.

      Some >60char samples which return …/"...", searching for [iol.co.za] or
      [news24.com] (brackets are mine):

      [www.iol.co.za/tonight/books/what-the-dickens-gets-a-statue-1.1130220]
      http://www.news24.com/News24v2/Travel/Mini_Site/ContentDisplay/n24TravelMiniSiteHome/0,,,00.html
      [www.news24.com/News24v2/Travel/Mini_Site/ContentDisplay/n24TravelMiniSite_TravelClub/0,,,00.html]

      The following return double-ellipses ("......" - ……), searching
      for [adsl mweb.com]:

      http://www.mweb.co.za/helpcentre/ADSL/ADSLGeneralIdisagreewithyourusagereport.aspx
      http://www.mweb.co.za/helpcentre/FrequentlyAskedQuestions/MWEBHelpCentreFAQsHowdoI/FAQHowdoIHowdoImigratemyADSL/tabid/661/Default.aspx

      1. LUCY-199-quickfix.patch
        4 kB
        Nick Wellnhofer
      2. hltest1.tgz
        2 kB
        Henry

        Activity

        Hide
        Nick Wellnhofer added a comment -

        After rewriting the Highlighter code to use string iterators, the test case produces:

        $ ./search statue
        &#8230; ckens-gets-a-<strong>statue</strong>-1.1130220
        $ ./search dickens
        &#8230; oks/what-the-<strong>dickens</strong>-gets-a-statue-1.
        

        This could be improved by breaking at non-alphabetic characters, but I think it's good enough to resolve this issue.

        (Note that the test case file 'runme' has to be changed to use a FullTextType with 'highlightable => 1'.)

        Show
        Nick Wellnhofer added a comment - After rewriting the Highlighter code to use string iterators, the test case produces: $ ./search statue &#8230; ckens-gets-a-<strong>statue</strong>-1.1130220 $ ./search dickens &#8230; oks/what-the-<strong>dickens</strong>-gets-a-statue-1. This could be improved by breaking at non-alphabetic characters, but I think it's good enough to resolve this issue. (Note that the test case file 'runme' has to be changed to use a FullTextType with 'highlightable => 1'.)
        Hide
        Nick Wellnhofer added a comment -

        Thinking more about a better fix for this problem, it's important to note that choosing a good excerpt is an operation that can be done without knowledge of the actual tokenization algorithm used in the indexing process. I think it's enough to

        • find boundaries that are more or less correct in a semantic and visual sense, and
        • be tolerant enough to find boundaries in long substrings without whitespace that might exceed excerpt_length (considering that whitespace is the obvious place to break words like in the current implementation).

        If the highlighter finds additional word breaks, it shouldn't be a problem as long as the result is visually correct.

        Such an approach wouldn't depend on the analyzer at all and it wouldn't introduce additional coupling of Lucy's components. Of course, it would mean to implement a separate Unicode-capable word breaking algorithm for the highlighter. But this shouldn't be very hard as we could reuse parts of the StandardTokenizer.

        Show
        Nick Wellnhofer added a comment - Thinking more about a better fix for this problem, it's important to note that choosing a good excerpt is an operation that can be done without knowledge of the actual tokenization algorithm used in the indexing process. I think it's enough to find boundaries that are more or less correct in a semantic and visual sense, and be tolerant enough to find boundaries in long substrings without whitespace that might exceed excerpt_length (considering that whitespace is the obvious place to break words like in the current implementation). If the highlighter finds additional word breaks, it shouldn't be a problem as long as the result is visually correct. Such an approach wouldn't depend on the analyzer at all and it wouldn't introduce additional coupling of Lucy's components. Of course, it would mean to implement a separate Unicode-capable word breaking algorithm for the highlighter. But this shouldn't be very hard as we could reuse parts of the StandardTokenizer.
        Hide
        Marvin Humphrey added a comment -

        +1 to commit to trunk.

        +1 to merge to 0.3.

        The patch looks right to me, and all tests pass, both on trunk
        and the 0.3 branch, including while running under Valgrind.

        Show
        Marvin Humphrey added a comment - +1 to commit to trunk. +1 to merge to 0.3. The patch looks right to me, and all tests pass, both on trunk and the 0.3 branch, including while running under Valgrind.
        Hide
        Nick Wellnhofer added a comment -

        Here is the simplest partial fix I could come up with. I'm not really happy with it because it can produce excerpts that don't contain the search term. For example, if you set excerpt_length to 40 in Henry's test case, you get the following result:

        $ ./search statue
        www.iol.co.za/tonight/books/what-the-di…

        The full URL is www.iol.co.za/tonight/books/what-the-dickens-gets-a-statue-1.1130220 and it gets truncated before the term "statue". But the patch is certainly an improvement.

        Show
        Nick Wellnhofer added a comment - Here is the simplest partial fix I could come up with. I'm not really happy with it because it can produce excerpts that don't contain the search term. For example, if you set excerpt_length to 40 in Henry's test case, you get the following result: $ ./search statue www.iol.co.za/tonight/books/what-the-di… The full URL is www.iol.co.za/tonight/books/what-the-dickens-gets-a-statue-1.1130220 and it gets truncated before the term "statue". But the patch is certainly an improvement.
        Hide
        Marvin Humphrey added a comment -

        > We recompute the word boundaries during highlighting like we do now. We
        > could use the analyzer of the current schema for that. But we could also use
        > any other kind algorithm that is better than the current one. This might be
        > very cheap since we only work on a small subset of the document.

        That sounds like a really good idea for a lot of reasons! I don't quite
        understand how it will solve this bug, but that's partly because the boundary
        detection code in Highlighter is complex and messy – and using an Analyzer
        would help to clean it up.

        One thing to bear in mind is that Highlighter is not only concerned with word
        boundaries, but sentence boundaries. Take a look at the excerpts on the SERPs
        for Google or any other major web search engine – they tend to prefer
        complete sentences. Lucy's own highlighter favors sentences just because I
        had a gut feeling that it was superior to the random word boundaries chosen by
        the Lucene highlighter, but I'm sure there are academic papers by now which
        explain why it's desirable.

        I note that UAX #29 describes an algorithm for sentence boundary detection.
        Our StandardTokenizer implements UAX #29 word boundary tokenization; we could
        implement a new Analyzer for sentence boundary detection using the same
        techniques. (Lucy::Analysis::StandardSentenceTokenizer?) Then we could
        leverage Lucy's analysis apparatus for both boundary detection phases within
        Highlighter, while still utilizing the existing highlighting data generated at
        index-time for generating heat maps and scoring excerpt candidates. That
        would get a lot of ugly code out of Highlighter and make it much easier to
        work on.

        If we want a quick fix for this bug, though, I think we could also just wrap
        an "if" test aroud the code which deals with the closing ellipsis and if we
        eat the whole string looking for a boundary, fall back to swapping out the
        last character for an ellipsis.

        Show
        Marvin Humphrey added a comment - > We recompute the word boundaries during highlighting like we do now. We > could use the analyzer of the current schema for that. But we could also use > any other kind algorithm that is better than the current one. This might be > very cheap since we only work on a small subset of the document. That sounds like a really good idea for a lot of reasons! I don't quite understand how it will solve this bug, but that's partly because the boundary detection code in Highlighter is complex and messy – and using an Analyzer would help to clean it up. One thing to bear in mind is that Highlighter is not only concerned with word boundaries, but sentence boundaries. Take a look at the excerpts on the SERPs for Google or any other major web search engine – they tend to prefer complete sentences. Lucy's own highlighter favors sentences just because I had a gut feeling that it was superior to the random word boundaries chosen by the Lucene highlighter, but I'm sure there are academic papers by now which explain why it's desirable. I note that UAX #29 describes an algorithm for sentence boundary detection. Our StandardTokenizer implements UAX #29 word boundary tokenization; we could implement a new Analyzer for sentence boundary detection using the same techniques. (Lucy::Analysis::StandardSentenceTokenizer?) Then we could leverage Lucy's analysis apparatus for both boundary detection phases within Highlighter, while still utilizing the existing highlighting data generated at index-time for generating heat maps and scoring excerpt candidates. That would get a lot of ugly code out of Highlighter and make it much easier to work on. If we want a quick fix for this bug, though, I think we could also just wrap an "if" test aroud the code which deals with the closing ellipsis and if we eat the whole string looking for a boundary, fall back to swapping out the last character for an ellipsis.
        Hide
        Nick Wellnhofer added a comment -

        Thanks for the excellent test case.

        The whole thing is a bug in Highlighter_raw_excerpt. When prepending or appending an ellipsis, the code tries to make sure this happens on a word boundary. So it chops off words to make place for the ellipsis. Unfortunately, it simply looks for whitespace to determine a word boundary. In case of URLs it doesn't find whitespace and deletes the whole URL from raw_excerpt.

        I see two approaches to fix this:

        • Since the word boundaries are already computed during analysis, we could try to reuse this data. AFAICS this would mean to loop through all the terms of the document and extract and finally sort all start and end offsets. I'm not sure how expensive this would be.
        • We recompute the word boundaries during highlighting like we do now. We could use the analyzer of the current schema for that. But we could also use any other kind algorithm that is better than the current one. This might be very cheap since we only work on a small subset of the document.
        Show
        Nick Wellnhofer added a comment - Thanks for the excellent test case. The whole thing is a bug in Highlighter_raw_excerpt. When prepending or appending an ellipsis, the code tries to make sure this happens on a word boundary. So it chops off words to make place for the ellipsis. Unfortunately, it simply looks for whitespace to determine a word boundary. In case of URLs it doesn't find whitespace and deletes the whole URL from raw_excerpt. I see two approaches to fix this: Since the word boundaries are already computed during analysis, we could try to reuse this data. AFAICS this would mean to loop through all the terms of the document and extract and finally sort all start and end offsets. I'm not sure how expensive this would be. We recompute the word boundaries during highlighting like we do now. We could use the analyzer of the current schema for that. But we could also use any other kind algorithm that is better than the current one. This might be very cheap since we only work on a small subset of the document.
        Hide
        Henry added a comment - - edited

        I have attached a test case for this issue as requested by Nick.

        To use:
        mkdir test
        cd test
        tar zxf hltest1.tgz

        ./runme # to create the test index

        To show incorrect highlighting (ie, no highlighting at all) when using 60 char length for excerpt (two hits):
        % ./search mweb

        ……

        Another:
        ./search adsl

        More:
        ./search news24.com

        Finally:
        ./search iol.co.za

        If you comment out the length in ./search:

        • excerpt_length => 60,
          + #excerpt_length => 60,

        ...and rerun the ./search tests, you'll get the expected highlighting, but obviously with an undesirable length.

        Show
        Henry added a comment - - edited I have attached a test case for this issue as requested by Nick. To use: mkdir test cd test tar zxf hltest1.tgz ./runme # to create the test index To show incorrect highlighting (ie, no highlighting at all) when using 60 char length for excerpt (two hits): % ./search mweb … …… Another: ./search adsl … More: ./search news24.com … … Finally: ./search iol.co.za … If you comment out the length in ./search: excerpt_length => 60, + #excerpt_length => 60, ...and rerun the ./search tests, you'll get the expected highlighting, but obviously with an undesirable length.
        Hide
        Henry added a comment -

        test case for highlighter issue

        Show
        Henry added a comment - test case for highlighter issue

          People

          • Assignee:
            Unassigned
            Reporter:
            Henry
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development