Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7717

UnifiedHighlighter doesn't highlight PrefixQuery with multi-byte chars

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 5.1, 6.3, 6.4.1
    • Fix Version/s: 6.4.2
    • Component/s: modules/highlighter
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      UnifiedHighlighter highlighter = new UnifiedHighlighter(null, new StandardAnalyzer());
      Query query = new PrefixQuery(new Term("title", "я"));
      String testData = "я";
      Object snippet = highlighter.highlightWithoutSearcher(fieldName, query, testData, 1);
      System.out.printf("testData=[%s] Query=%s snippet=[%s]\n", testData, query, snippet==null?null:snippet.toString());

      1. LUCENE-7717.patch
        6 kB
        David Smiley
      2. LUCENE-7717.patch
        2 kB
        Christine Poerschke

        Issue Links

          Activity

          Hide
          dmitrymalinin Dmitry Malinin added a comment -

          I think that in org.apache.lucene.search.uhighlight.MultiTermHighlighting.extractAutomata()
          condition "if (query instanceof AutomatonQuery)" should be the last in "if" chain

          Show
          dmitrymalinin Dmitry Malinin added a comment - I think that in org.apache.lucene.search.uhighlight.MultiTermHighlighting.extractAutomata() condition "if (query instanceof AutomatonQuery)" should be the last in "if" chain
          Hide
          cpoerschke Christine Poerschke added a comment -

          Hello Dmitry. - I am attaching potential test case adapted from your code snippet (no pun intended) in the description. The test passes locally for me though. Could you perhaps try running it locally too and adapt/adjust it and with/without the MultiTermHighlighting change you mention? Thanks. - Christine

          Show
          cpoerschke Christine Poerschke added a comment - Hello Dmitry. - I am attaching potential test case adapted from your code snippet (no pun intended) in the description. The test passes locally for me though. Could you perhaps try running it locally too and adapt/adjust it and with/without the MultiTermHighlighting change you mention? Thanks. - Christine
          Hide
          dsmiley David Smiley added a comment -

          At some point after MultiTermHighlighting.java was first written, PrefixQuery was altered to be a subclass of AutomatonQuery. So PrefixQuery detection could simply be removed now, I think, since it's handled via AutomatonQuery condition.

          I'm working on debugging to see why this fails & a proper test. (the test would go in TestUnifiedHighlighterMTQ by the way)

          Show
          dsmiley David Smiley added a comment - At some point after MultiTermHighlighting.java was first written, PrefixQuery was altered to be a subclass of AutomatonQuery. So PrefixQuery detection could simply be removed now, I think, since it's handled via AutomatonQuery condition. I'm working on debugging to see why this fails & a proper test. (the test would go in TestUnifiedHighlighterMTQ by the way)
          Hide
          dsmiley David Smiley added a comment -

          Here's my take on it: The UnifiedHighlighter (and PostingsHighlighter from which it derives) processes the MultiTermQueries (e.g. wildcards) in the query and creates multiple CharacterRunAutomaton intended to match the same things. CharacterRunAutomaton takes a Automaton as input, and when it does it's processing, it matches the Character code points (integers from 0 to 0x10FFFF) against the integers in the Automaton. However, this strategy assumes that the Automaton was constructed based on character code points. But AutomatonQuery.getAutomaton is intended to match byte by byte (integers 0 to 255). PrefixQuery.toAutomaton will get 2 bytes for the the "я" in BytesRef form, and add 2 states. This does not line up with the assumptions of CharacterRunAutomaton.

          A short term immediate "fix" is simply to put AutomatonQuery last in the if-else list as Dmitry indicated. As such, PrefixQuery will work again. This was broken by LUCENE-6367 (Lucene 5.1). TermRangeQuery, which also now extends AutomatonQuery, will likewise work – broken by LUCENE-5879 (Lucene 5.2). Again, back when MultiTermHighlighting was first written, neither of those queries extended AutomatonQuery. But there will be bugs for other types of AutomatonQuery (namely WildcardQuery and RegexpQuery) that have yet to be reported.

          Robert Muir or Michael McCandless I wonder if you have any thoughts on how to fix this. An idea I have is to not use a CharacterRunAutomaton in the UnifiedHighlighter; use a ByteRunAutomaton instead. Then, add ByteRunAutomaton.run(char[] ...etc) that converts each character to the equivalent UTF8 bytes to match. Even with that, I wonder if this points to areas to improve the automata API so that people don't bump into this trap in the future. For example, maybe have the Automata self-report if it's byte oriented, Unicode codepoint oriented, or something custom. Then, RunAutomaton could throw an exception if there is a mis-match. However that would be a runtime error; maybe the Automata could be typed.

          Any way, what I'd like to do is do a short term fix that addresses many common cases and the title of this issue. And then do a more thorough fix in a follow-on issue. Ishan Chattopadhyaya do you think this could go into 6.4.2 or are you only looking for "critical" issues? It's debatable what's critical and not. This bug has been around since 5.1 so perhaps it isn't.

          (a patch will follow shortly)

          Show
          dsmiley David Smiley added a comment - Here's my take on it: The UnifiedHighlighter (and PostingsHighlighter from which it derives) processes the MultiTermQueries (e.g. wildcards) in the query and creates multiple CharacterRunAutomaton intended to match the same things. CharacterRunAutomaton takes a Automaton as input, and when it does it's processing, it matches the Character code points (integers from 0 to 0x10FFFF) against the integers in the Automaton. However, this strategy assumes that the Automaton was constructed based on character code points. But AutomatonQuery.getAutomaton is intended to match byte by byte (integers 0 to 255). PrefixQuery.toAutomaton will get 2 bytes for the the "я" in BytesRef form, and add 2 states. This does not line up with the assumptions of CharacterRunAutomaton. A short term immediate "fix" is simply to put AutomatonQuery last in the if-else list as Dmitry indicated. As such, PrefixQuery will work again. This was broken by LUCENE-6367 (Lucene 5.1). TermRangeQuery, which also now extends AutomatonQuery, will likewise work – broken by LUCENE-5879 (Lucene 5.2). Again, back when MultiTermHighlighting was first written, neither of those queries extended AutomatonQuery. But there will be bugs for other types of AutomatonQuery (namely WildcardQuery and RegexpQuery) that have yet to be reported. Robert Muir or Michael McCandless I wonder if you have any thoughts on how to fix this. An idea I have is to not use a CharacterRunAutomaton in the UnifiedHighlighter; use a ByteRunAutomaton instead. Then, add ByteRunAutomaton.run(char[] ...etc) that converts each character to the equivalent UTF8 bytes to match. Even with that, I wonder if this points to areas to improve the automata API so that people don't bump into this trap in the future. For example, maybe have the Automata self-report if it's byte oriented, Unicode codepoint oriented, or something custom. Then, RunAutomaton could throw an exception if there is a mis-match. However that would be a runtime error; maybe the Automata could be typed. Any way, what I'd like to do is do a short term fix that addresses many common cases and the title of this issue. And then do a more thorough fix in a follow-on issue. Ishan Chattopadhyaya do you think this could go into 6.4.2 or are you only looking for "critical" issues? It's debatable what's critical and not. This bug has been around since 5.1 so perhaps it isn't. (a patch will follow shortly)
          Hide
          dsmiley David Smiley added a comment -

          Here's a patch. It fixed the MultiTermHighlighting class in both the postingshighlight package as well as uhighlight. It adds a test method to TestUnifiedHighlighterMTQ. I also beefed up the test for a related method testWhichMTQMatched to avoid potential inadvertent changes to the CharRunAutomata toString that people might depend on. It appears there was no breakage in this case but until I added more query types, wether it did or didn't break wasn't apparent.

          Show
          dsmiley David Smiley added a comment - Here's a patch. It fixed the MultiTermHighlighting class in both the postingshighlight package as well as uhighlight . It adds a test method to TestUnifiedHighlighterMTQ . I also beefed up the test for a related method testWhichMTQMatched to avoid potential inadvertent changes to the CharRunAutomata toString that people might depend on. It appears there was no breakage in this case but until I added more query types, wether it did or didn't break wasn't apparent.
          Hide
          dsmiley David Smiley added a comment -

          IntelliJ IDEA has clued me into this else-if having dead code paths for a long while now here and I'm kicking myself for putting it off – LOL.

          Show
          dsmiley David Smiley added a comment - IntelliJ IDEA has clued me into this else-if having dead code paths for a long while now here and I'm kicking myself for putting it off – LOL.
          Hide
          ichattopadhyaya Ishan Chattopadhyaya added a comment -

          David Smiley, I don't mind including this fix, if you think this is a low risk fix and should be included. Feel free to backport this one to the release branch. I'm anyway waiting for SOLR-10215 as of now.

          Show
          ichattopadhyaya Ishan Chattopadhyaya added a comment - David Smiley , I don't mind including this fix, if you think this is a low risk fix and should be included. Feel free to backport this one to the release branch. I'm anyway waiting for SOLR-10215 as of now.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit ec13032a948a29f69d50d41e4859fd38ed5ca377 in lucene-solr's branch refs/heads/master from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ec13032 ]

          LUCENE-7717: UnifiedHighlighter and PostingsHighlighter bug in PrefixQuery and TermRangeQuery for multi-byte text

          Show
          jira-bot ASF subversion and git services added a comment - Commit ec13032a948a29f69d50d41e4859fd38ed5ca377 in lucene-solr's branch refs/heads/master from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ec13032 ] LUCENE-7717 : UnifiedHighlighter and PostingsHighlighter bug in PrefixQuery and TermRangeQuery for multi-byte text
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit d9a2c64041067acf4f1d967e13ab7a045502ce1c in lucene-solr's branch refs/heads/branch_6x from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d9a2c64 ]

          LUCENE-7717: UnifiedHighlighter and PostingsHighlighter bug in PrefixQuery and TermRangeQuery for multi-byte text

          (cherry picked from commit ec13032)

          Show
          jira-bot ASF subversion and git services added a comment - Commit d9a2c64041067acf4f1d967e13ab7a045502ce1c in lucene-solr's branch refs/heads/branch_6x from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d9a2c64 ] LUCENE-7717 : UnifiedHighlighter and PostingsHighlighter bug in PrefixQuery and TermRangeQuery for multi-byte text (cherry picked from commit ec13032)
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 7467c369aaae5c17584360d57a3e6226ac57d817 in lucene-solr's branch refs/heads/branch_6_4 from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7467c36 ]

          LUCENE-7717: UnifiedHighlighter and PostingsHighlighter bug in PrefixQuery and TermRangeQuery for multi-byte text

          (cherry picked from commit d9a2c64)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 7467c369aaae5c17584360d57a3e6226ac57d817 in lucene-solr's branch refs/heads/branch_6_4 from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7467c36 ] LUCENE-7717 : UnifiedHighlighter and PostingsHighlighter bug in PrefixQuery and TermRangeQuery for multi-byte text (cherry picked from commit d9a2c64)
          Hide
          dsmiley David Smiley added a comment -

          Closing. I'll create a linked follow-up bug issue for WildcardQuery (also applies to Regexp) where we can discuss how to deal with that – the more overall fix. I don't think that one should hold up a 6.4.2. It'll likely result in removing the PrefixQuery and TermRangeQuery sections in MultiTermHighlighting.

          Show
          dsmiley David Smiley added a comment - Closing. I'll create a linked follow-up bug issue for WildcardQuery (also applies to Regexp) where we can discuss how to deal with that – the more overall fix. I don't think that one should hold up a 6.4.2. It'll likely result in removing the PrefixQuery and TermRangeQuery sections in MultiTermHighlighting.
          Hide
          dmitrymalinin Dmitry Malinin added a comment -

          Hello Christine
          I saw your test. I think that

          • "I" is not suitable data for test, because StandardAnalyzer converts data to lower case (but PrefixQuery doesn't)
          • why "random"? Сould be better make text array String[] texts = {"i", "я", }

            and run for each?

          • and I mean that result should be, for example, "<b>q</b>" for data "q" and query "q*" and so on for multi-byte data (assertEquals("<b>"text"</b>", snippetString)
            Best regards
          Show
          dmitrymalinin Dmitry Malinin added a comment - Hello Christine I saw your test. I think that "I" is not suitable data for test, because StandardAnalyzer converts data to lower case (but PrefixQuery doesn't) why "random"? Сould be better make text array String[] texts = {"i", "я", } and run for each? and I mean that result should be, for example, "<b>q</b>" for data "q" and query "q*" and so on for multi-byte data (assertEquals("<b>" text "</b>", snippetString) Best regards

            People

            • Assignee:
              dsmiley David Smiley
              Reporter:
              dmitrymalinin Dmitry Malinin
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development