Lucene - Core
  1. Lucene - Core
  2. LUCENE-1622

Multi-word synonym filter (synonym expansion at indexing time).

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: modules/analysis
    • Labels:
      None

      Description

      It would be useful to have a filter that provides support for indexing-time synonym expansion, especially for multi-word synonyms (with multi-word matching for original tokens).

      The problem is not trivial, as observed on the mailing list. The problems I was able to identify (mentioned in the unit tests as well):

      • if multi-word synonyms are indexed together with the original token stream (at overlapping positions), then a query for a partial synonym sequence (e.g., "big" in the synonym "big apple" for "new york city") causes the document to match;
      • there are problems with highlighting the original document when synonym is matched (see unit tests for an example),
      • if the synonym is of different length than the original sequence of tokens to be matched, then phrase queries spanning the synonym and the original sequence boundary won't be found. Example "big apple" synonym for "new york city". A phrase query "big apple restaurants" won't match "new york city restaurants".

      I am posting the patch that implements phrase synonyms as a token filter. This is not necessarily intended for immediate inclusion, but may provide a basis for many people to experiment and adjust to their own scenarios.

      1. synonyms.patch
        25 kB
        Dawid Weiss

        Issue Links

          Activity

          Otis Gospodnetic made changes -
          Link This issue is related to SOLR-4381 [ SOLR-4381 ]
          Shai Erera made changes -
          Component/s modules/analysis [ 12310230 ]
          Component/s contrib/analyzers [ 12312333 ]
          Mark Thomas made changes -
          Workflow Default workflow, editable Closed status [ 12563426 ] jira [ 12584621 ]
          Mark Thomas made changes -
          Workflow jira [ 12462073 ] Default workflow, editable Closed status [ 12563426 ]
          Hide
          Robert Muir added a comment -

          We'd then need an AutomatonWordQuery - the same idea as
          AutomatonQuery, except at the word level not at the character level.

          This is a cool idea, and on the analysis side a word-level automaton is really the datastructure I think we want for actually doing the multi-word synonym match efficiently (with minimal lookahead etc)

          Show
          Robert Muir added a comment - We'd then need an AutomatonWordQuery - the same idea as AutomatonQuery, except at the word level not at the character level. This is a cool idea, and on the analysis side a word-level automaton is really the datastructure I think we want for actually doing the multi-word synonym match efficiently (with minimal lookahead etc)
          Hide
          Uwe Schindler added a comment -

          In my opinion, we should also have a very simply and user-friendly QP like Google: no syntax at all. Just tokenize Text with Analyzer and create a TermQuery for each token. The only params to this QP are field name and default Occur enum.

          People should create always ranges and so on programatically. Having this in a query parser is stupid. XMLQueryParser is good for this, or maybe we also get a JSON query parser (I have plans to create one similar to XML Query Parser, maybe using the saem builders). Mark Miller was talking about this for solr, too.

          Show
          Uwe Schindler added a comment - In my opinion, we should also have a very simply and user-friendly QP like Google: no syntax at all. Just tokenize Text with Analyzer and create a TermQuery for each token. The only params to this QP are field name and default Occur enum. People should create always ranges and so on programatically. Having this in a query parser is stupid. XMLQueryParser is good for this, or maybe we also get a JSON query parser (I have plans to create one similar to XML Query Parser, maybe using the saem builders). Mark Miller was talking about this for solr, too.
          Hide
          Michael McCandless added a comment -

          For other reasons, including this, we should start thinking about removing QueryParser's split-on-whitespace.

          I think we should remove it!

          The fact that QP does this whitespace pre-split means a SynFilter
          (that applies to multiple words) is unusable with QP since the
          analyzer sees only one word at a time from QP.

          And, QP should be as language neutral as possible...

          Show
          Michael McCandless added a comment - For other reasons, including this, we should start thinking about removing QueryParser's split-on-whitespace. I think we should remove it! The fact that QP does this whitespace pre-split means a SynFilter (that applies to multiple words) is unusable with QP since the analyzer sees only one word at a time from QP. And, QP should be as language neutral as possible...
          Hide
          Robert Muir added a comment -

          There are tricky tradeoffs of index time vs search time

          The worst tradeoff at all, is that users can't make it.

          For other reasons, including this, we should start thinking about removing QueryParser's split-on-whitespace.

          Show
          Robert Muir added a comment - There are tricky tradeoffs of index time vs search time The worst tradeoff at all, is that users can't make it. For other reasons, including this, we should start thinking about removing QueryParser's split-on-whitespace.
          Hide
          Michael McCandless added a comment -

          Here's the dev thread that lead to this issue, for context:

          http://www.lucidimagination.com/search/document/fde6d4b979481398/synonym_filter_with_support_for_phrases

          I think the syn filter here takes generally the same approach as
          Solr's (now moved to modules/analyzer in trunk) SynonymFilter, ie
          overlapping words as the expanded synonyms unwind? Are there salient
          differences between the two? Maybe we can merge them and get best of
          both worlds?

          There are tricky tradeoffs of index time vs search time – index time
          is less flexible (you must re-index on changing them) but better
          search perf (OR in a TermQuery instead of expanding to many
          PhraseQuerys); index time is better scoring (the IDF is "true" if the
          syn is a term in the index, vs PhraseQuery which necessarily
          approximates, possibly badly).

          There is also the controversial question of whether using manually
          defined synonyms even helps relevance As Robert points out, doing
          an iteration of feedback (take the top N docs, that match user's
          query, extract their salient terms, and do a 2nd search expanded w/
          those salient terms), sort of accomplishes something similar (and
          perhaps better since it's not just synonyms but also uncovers
          "relationships" like Barack Obama is a US president), but w/o the
          manual effort of creating the synonyms. And it's been shown to
          improve relevance.

          Still, I think Lucene should make index and query time expansion
          feasible. At the plumbing level we don't have a horse in that race

          If you do index syns at index time, you really should just inject a
          single syn token, representing any occurrence of a term/phrase that
          this synonym accepts (and do the matching thing @ query time). But,
          then, as Earwin pointed out, Lucene is missing the notion of "span"
          saying how many positions this term took up (we only encode the pos
          incr, reflecting where this token begins relative to the last token's
          beginning).

          EG if "food place" is a syn for "restaurant", and you have a doc
          "... a great food place in boston ...", and so you inject RESTAURANT (syn
          group) "over" the phrase "food place", then an exact phrase query
          won't work right – you can't have "a great RESTAURANT in boston"
          match.

          One simple way to express this during analysis is as a new SpanAttr
          (say), which expresses how many positions the token takes up. We
          could then index this, doing so efficiently for the default case
          (span==1), and then in addition to getting the .nextPosition() you
          could then also ask for .span() from DocsAndPositionsEnum.

          But, generalizing this a bit, really we are indexing a graph, where
          the nodes are positions and the edges are tokens connecting them.
          With only posIncr & span, you restrict the nodes to be a single linear
          chain; but if we generalize it, then nodes can be part of side
          branches; eg the node in the middle of "food place" need not be a
          "real" position if it were injected into a document / query containing
          restaurant. Hard boundaries (eg b/w sentences) would be more cleanly
          represented here – there would not even be an edge between the nodes.

          We'd then need an AutomatonWordQuery – the same idea as
          AutomatonQuery, except at the word level not at the character level.
          MultiPhraseQuery would then be a special case of AutomatonWordQuery.

          Then analysis becomes the serializing of this graph... analysis would
          have to flatten out the nodes into a single linear chain, and then
          express the edges using position & span. I think position would no
          longer be a hard relative position. EG when injecting "food place" (=
          2 tokens) into the tokens that contain restaurant, both food and
          restaurant would have the same start position, but food would have
          span 1 and restaurant would have span 2.

          (Sorry for the rambling... this is a complex topic!!).

          Show
          Michael McCandless added a comment - Here's the dev thread that lead to this issue, for context: http://www.lucidimagination.com/search/document/fde6d4b979481398/synonym_filter_with_support_for_phrases I think the syn filter here takes generally the same approach as Solr's (now moved to modules/analyzer in trunk) SynonymFilter, ie overlapping words as the expanded synonyms unwind? Are there salient differences between the two? Maybe we can merge them and get best of both worlds? There are tricky tradeoffs of index time vs search time – index time is less flexible (you must re-index on changing them) but better search perf (OR in a TermQuery instead of expanding to many PhraseQuerys); index time is better scoring (the IDF is "true" if the syn is a term in the index, vs PhraseQuery which necessarily approximates, possibly badly). There is also the controversial question of whether using manually defined synonyms even helps relevance As Robert points out, doing an iteration of feedback (take the top N docs, that match user's query, extract their salient terms, and do a 2nd search expanded w/ those salient terms), sort of accomplishes something similar (and perhaps better since it's not just synonyms but also uncovers "relationships" like Barack Obama is a US president), but w/o the manual effort of creating the synonyms. And it's been shown to improve relevance. Still, I think Lucene should make index and query time expansion feasible. At the plumbing level we don't have a horse in that race If you do index syns at index time, you really should just inject a single syn token, representing any occurrence of a term/phrase that this synonym accepts (and do the matching thing @ query time). But, then, as Earwin pointed out, Lucene is missing the notion of "span" saying how many positions this term took up (we only encode the pos incr, reflecting where this token begins relative to the last token's beginning). EG if "food place" is a syn for "restaurant", and you have a doc "... a great food place in boston ...", and so you inject RESTAURANT (syn group) "over" the phrase "food place", then an exact phrase query won't work right – you can't have "a great RESTAURANT in boston" match. One simple way to express this during analysis is as a new SpanAttr (say), which expresses how many positions the token takes up. We could then index this, doing so efficiently for the default case (span==1), and then in addition to getting the .nextPosition() you could then also ask for .span() from DocsAndPositionsEnum. But, generalizing this a bit, really we are indexing a graph, where the nodes are positions and the edges are tokens connecting them. With only posIncr & span, you restrict the nodes to be a single linear chain; but if we generalize it, then nodes can be part of side branches; eg the node in the middle of "food place" need not be a "real" position if it were injected into a document / query containing restaurant. Hard boundaries (eg b/w sentences) would be more cleanly represented here – there would not even be an edge between the nodes. We'd then need an AutomatonWordQuery – the same idea as AutomatonQuery, except at the word level not at the character level. MultiPhraseQuery would then be a special case of AutomatonWordQuery. Then analysis becomes the serializing of this graph... analysis would have to flatten out the nodes into a single linear chain, and then express the edges using position & span. I think position would no longer be a hard relative position. EG when injecting "food place" (= 2 tokens) into the tokens that contain restaurant, both food and restaurant would have the same start position, but food would have span 1 and restaurant would have span 2. (Sorry for the rambling... this is a complex topic!!).
          Robert Muir made changes -
          Component/s contrib/* [ 12312028 ]
          Component/s contrib/analyzers [ 12312333 ]
          Hide
          Earwin Burrfoot added a comment - - edited

          I'll shortly cite my experiences mentioned on the list.

          • Injecting "synonym group id" token instead of all tokens for all synonyms in group is a big win with index size and saves you from matching for "big". It also plays better with highlighting (still had to rewrite it to handle all corner cases).
          • Properly handling multiword synonyms only on index-side is impossible, you have to dabble in query rewriting (even then low-probability corner cases exist, and you might find extra docs).
          • Query expansion is the only absolutely clear way to have multiword synonyms with current Lucene, but it is impractical on any adequate synonym dictionary.
          • There is a possible change to the way Lucene indexes tokens+positions to enable fully proper multiword synonyms (with index+query rewrite approach) - adding a notion of 'length' or 'span' to a token, this length should play together with positionIncrement when calculating distance between tokens in phrase/spannear queries.
          Show
          Earwin Burrfoot added a comment - - edited I'll shortly cite my experiences mentioned on the list. Injecting "synonym group id" token instead of all tokens for all synonyms in group is a big win with index size and saves you from matching for "big". It also plays better with highlighting (still had to rewrite it to handle all corner cases). Properly handling multiword synonyms only on index-side is impossible, you have to dabble in query rewriting (even then low-probability corner cases exist, and you might find extra docs). Query expansion is the only absolutely clear way to have multiword synonyms with current Lucene, but it is impractical on any adequate synonym dictionary. There is a possible change to the way Lucene indexes tokens+positions to enable fully proper multiword synonyms (with index+query rewrite approach) - adding a notion of 'length' or 'span' to a token, this length should play together with positionIncrement when calculating distance between tokens in phrase/spannear queries.
          Dawid Weiss made changes -
          Field Original Value New Value
          Attachment synonyms.patch [ 12406674 ]
          Hide
          Dawid Weiss added a comment -

          Token filter implementing synonyms. Java 1.5 is required to compile it (I left generics for clarity; if folks really need 1.4 compatibility they can be easily removed of course).

          Show
          Dawid Weiss added a comment - Token filter implementing synonyms. Java 1.5 is required to compile it (I left generics for clarity; if folks really need 1.4 compatibility they can be easily removed of course).
          Dawid Weiss created issue -

            People

            • Assignee:
              Unassigned
              Reporter:
              Dawid Weiss
            • Votes:
              6 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:

                Development