Lucene - Core
  1. Lucene - Core
  2. LUCENE-4518

Suggesters: highlighting (explicit markup of user-typed portions vs. generated portions in a suggestion)

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      As a user, I would like the lookup result of the suggestion engine to contain information which allows me to distinguish the user-entered portion from the autocompleted portion of a suggestion. That information can then be used for e.g. highlighting.

      Notes:

      It's trivial if the suggestion engine only applies simple prefix search, as then the user-typed prefix is always a true prefix of the completion. However, it's non-trivial as soon as you use an AnalyzingSuggester, where the completion may (in extreme cases) be quite different from the user-provided input. As soon as case/diacritics folding, script adaptation (kanji/hiragana) come into play, the completion is no longer guaranteed to be an extension of the query. Since the caller of the suggestion engine (UI) generally does not know the implementation details, the required information needs to be passed in the LookupResult.

      Discussion on java-user:

      > I haven't found a simple solution for the highlighting yet,
      > particularly when using AnalyzingSuggester (where it's non-trivial).

      Mike McCandless:

      Ahh I see ... it is challenging in that case. Hmm. Maybe open an issue for this as well, so we can discuss/iterate?

      1. LUCENE-4518.patch
        12 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        OK thanks for testing Oliver ... sounds like the approach here is a no-go in practice.

        Show
        Michael McCandless added a comment - OK thanks for testing Oliver ... sounds like the approach here is a no-go in practice.
        Hide
        Oliver Christ added a comment -

        I’ve played around with Mike’s patches, but for the AnalyzingSuggester the results have been mixed. Since the transition symbols in the automaton are not closely aligned between the surface and the analyzed form, LookupResult.prefixLength (which attempts to represent the length of the surface string which corresponds to the lookup string) is off quite a bit, leading to very confusing highlighting in non-trivial cases.

        I think this is ultimately due to the way how the FST is constructed, but that seems to be non-trivial to change.

        In addition, just returning the (surface) prefix length which corresponds to the lookup string is not sufficient for more complex suggesters, such as “infix suggesters” where the user-provided string is not a prefix of the full surface term (google.com: type in “sox rumor”). What the suggesters ultimately would have to return is a list of text chunks where each chunk has a flag whether it’s based on the lookup string or has been auto-completed.

        So at this point we are back at trying to identify the matched string portions by other means, which isn’t perfect either, but acceptable in most cases.

        Show
        Oliver Christ added a comment - I’ve played around with Mike’s patches, but for the AnalyzingSuggester the results have been mixed. Since the transition symbols in the automaton are not closely aligned between the surface and the analyzed form, LookupResult.prefixLength (which attempts to represent the length of the surface string which corresponds to the lookup string) is off quite a bit, leading to very confusing highlighting in non-trivial cases. I think this is ultimately due to the way how the FST is constructed, but that seems to be non-trivial to change. In addition, just returning the (surface) prefix length which corresponds to the lookup string is not sufficient for more complex suggesters, such as “infix suggesters” where the user-provided string is not a prefix of the full surface term (google.com: type in “sox rumor”). What the suggesters ultimately would have to return is a list of text chunks where each chunk has a flag whether it’s based on the lookup string or has been auto-completed. So at this point we are back at trying to identify the matched string portions by other means, which isn’t perfect either, but acceptable in most cases.
        Hide
        Michael McCandless added a comment -

        Initial patch, that just records the starting output from when the path was added to the queue, and uses that to set the new prefixLength in LookupResult.

        Currently I only fixed WFST, Analyzing and Fuzzy suggesters to set this. WFST will always be correct (it's trivial), while Analyzing/Fuzzy can sometimes be too long (if there is a common prefix to all possible completions from that starting point).

        Show
        Michael McCandless added a comment - Initial patch, that just records the starting output from when the path was added to the queue, and uses that to set the new prefixLength in LookupResult. Currently I only fixed WFST, Analyzing and Fuzzy suggesters to set this. WFST will always be correct (it's trivial), while Analyzing/Fuzzy can sometimes be too long (if there is a common prefix to all possible completions from that starting point).
        Hide
        Michael McCandless added a comment -

        Hmm... it could be that if we simply record the partial output (surface form) we've accumulated so far, when we add a start path into the TopNSearcher, that this could make a good hilite candidate.

        The FST will always output "eagerly", meaning on seeing a given partial input, it will output as much as is unambiguously possible. So I suspect the equivalent in Lucene of the "praefi" example would just work.

        The only problem I can think of where this won't work is if the completion is [somewhat] deterministic. EG if you only had added "electron" and "electronics" to your suggester, and user has typed only 'e' so far, the output on traversing only 'e' would be electron, which is way too much to hilite. But in a "real" app, where there are tons and tons of suggestions, I suspect this would become a vanishingly minor issue.

        Show
        Michael McCandless added a comment - Hmm... it could be that if we simply record the partial output (surface form) we've accumulated so far, when we add a start path into the TopNSearcher, that this could make a good hilite candidate. The FST will always output "eagerly", meaning on seeing a given partial input, it will output as much as is unambiguously possible. So I suspect the equivalent in Lucene of the "praefi" example would just work. The only problem I can think of where this won't work is if the completion is [somewhat] deterministic. EG if you only had added "electron" and "electronics" to your suggester, and user has typed only 'e' so far, the output on traversing only 'e' would be electron, which is way too much to hilite. But in a "real" app, where there are tons and tons of suggestions, I suspect this would become a vanishingly minor issue.
        Hide
        Michael McCandless added a comment -

        Actually start/end offset of a each token is probably sufficient here? Because the analyzed form of the partial token the user typed will show the end offset we want, I think?

        Show
        Michael McCandless added a comment - Actually start/end offset of a each token is probably sufficient here? Because the analyzed form of the partial token the user typed will show the end offset we want, I think?
        Hide
        Michael McCandless added a comment -

        It'd be quite easy to back trace on each topN path to get the point at
        which it "started" (= the end of where we matched based on the user's
        input).

        The challenge is ... that's the analyzed form, not the surface form;
        in general we need the reverse mapping from analyzed form offsets back
        to surface form offsets ... and the OffsetAttribute gives us that, but
        unfortunately only gives us start/end of each token.

        Another challenge is we convert the analyzed form into a graph
        (TokenStreamToAutomaton), so we'd somehow need to get the surface
        form offsets through there too.

        Show
        Michael McCandless added a comment - It'd be quite easy to back trace on each topN path to get the point at which it "started" (= the end of where we matched based on the user's input). The challenge is ... that's the analyzed form, not the surface form; in general we need the reverse mapping from analyzed form offsets back to surface form offsets ... and the OffsetAttribute gives us that, but unfortunately only gives us start/end of each token. Another challenge is we convert the analyzed form into a graph (TokenStreamToAutomaton), so we'd somehow need to get the surface form offsets through there too.
        Hide
        Oliver Christ added a comment -

        In a classic FST, once you consumed the input and start looking for completions, you'd know how many input symbols you consumed, and how many output symbols you've collected so far, and how many symbols the TopNSearcher appends (i.e. how long, in characters, the completed portion of the string is). That information should be sufficient to explicitly distinguish the two parts. As long as completions don't "surround" the user-entered portions (google: "sox ticket purc"), or the prefix for some reason ends in the "middle" of a UTF8 byte sequence, this may be sufficient to cover basic use cases and put the length of the covered prefix (or completed suffix) into each LookupResult. I'm assuming that the input and output symbols are "reasonably aligned" in the transition labels, which may not be the case in the current implementation (I haven't gotten to that level of detail yet ).

        Show
        Oliver Christ added a comment - In a classic FST, once you consumed the input and start looking for completions, you'd know how many input symbols you consumed, and how many output symbols you've collected so far, and how many symbols the TopNSearcher appends (i.e. how long, in characters, the completed portion of the string is). That information should be sufficient to explicitly distinguish the two parts. As long as completions don't "surround" the user-entered portions (google: "sox ticket purc"), or the prefix for some reason ends in the "middle" of a UTF8 byte sequence, this may be sufficient to cover basic use cases and put the length of the covered prefix (or completed suffix) into each LookupResult. I'm assuming that the input and output symbols are "reasonably aligned" in the transition labels, which may not be the case in the current implementation (I haven't gotten to that level of detail yet ).
        Hide
        Simon Willnauer added a comment -

        I tend to agree with Oliver. I have done similar things because the frontend lacks information here. I agree its non-trivial but we should provide it if we can. For absolute correctness you likely need to reanalyze and intersect with the automaton :/

        Show
        Simon Willnauer added a comment - I tend to agree with Oliver. I have done similar things because the frontend lacks information here. I agree its non-trivial but we should provide it if we can. For absolute correctness you likely need to reanalyze and intersect with the automaton :/
        Hide
        Oliver Christ added a comment -

        I don't see how this can easily be addressed in the UI.

        Go to http://www.google.de and enter "praefi" as the query. The top two completions are

        präfi*x*
        präfi*nal*

        Note that in both cases the prefix "präfi" is recognized although the query is "praefi".

        To handle this in the UI, the UI layer would have to duplicate the Analyzer's logic about case and diacritics folding.

        I agree that in general, this may not be possible at all, but in simpler cases (case folding, diacritics insensitivity) I should think it's feasible (but hard on the UI level).

        Show
        Oliver Christ added a comment - I don't see how this can easily be addressed in the UI. Go to http://www.google.de and enter "praefi" as the query. The top two completions are präfi*x* präfi*nal* Note that in both cases the prefix "präfi" is recognized although the query is "praefi". To handle this in the UI, the UI layer would have to duplicate the Analyzer's logic about case and diacritics folding. I agree that in general, this may not be possible at all, but in simpler cases (case folding, diacritics insensitivity) I should think it's feasible (but hard on the UI level).
        Hide
        Robert Muir added a comment -

        I don't really see highlighting fitting into suggesters. Suggesters are already too complicated: this is a UI problem that can be handled outside of lucene.

        Show
        Robert Muir added a comment - I don't really see highlighting fitting into suggesters. Suggesters are already too complicated: this is a UI problem that can be handled outside of lucene.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Oliver Christ
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development