Lucene - Core
  1. Lucene - Core
  2. LUCENE-6400

SynonymParser should encode 'expand' correctly.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.2, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Today SolrSynonymParser encodes something like A, B, C with 'expand=true' like this:
      A -> A, B, C (includeOrig=false)
      B -> B, A, C (includeOrig=false)
      C -> C, A, B (includeOrig=false)

      This gives kinda buggy output (synfilter sees it all as replacements, and makes all the terms with type synonym, positionLength isnt supported, etc) and it wastes space in the FST (includeOrig is just one bit).

      Example with "spiderman, spider man" and analysis on 'spider man'

      Trunk:
      term=spider,startOffset=0,endOffset=6,positionIncrement=1,positionLength=1,type=SYNONYM
      term=spiderman,startOffset=0,endOffset=10,positionIncrement=0,positionLength=1,type=SYNONYM
      term=man,startOffset=7,endOffset=10,positionIncrement=1,positionLength=1,type=SYNONYM

      You can see this is confusing, all the words have type SYNONYM, because spider and man got deleted, and totally replaced by new terms (Which happen to have the same text).

      Patch:
      term=spider,startOffset=0,endOffset=6,positionIncrement=1,positionLength=1,type=word
      term=spiderman,startOffset=0,endOffset=10,positionIncrement=0,positionLength=2,type=SYNONYM
      term=man,startOffset=7,endOffset=10,positionIncrement=1,positionLength=1,type=word

      1. LUCENE-6400.patch
        11 kB
        Michael McCandless
      2. LUCENE-6400.patch
        10 kB
        Michael McCandless
      3. LUCENE-6400.patch
        10 kB
        Robert Muir
      4. LUCENE-6400.patch
        2 kB
        Robert Muir
      5. PositionLenghtAndType-unittests.patch
        7 kB
        Ian Ribas
      6. unittests-expand-and-parse.patch
        8 kB
        Ian Ribas

        Activity

        Hide
        Robert Muir added a comment -

        Here is a patch. just needs tests.

        Show
        Robert Muir added a comment - Here is a patch. just needs tests.
        Hide
        Michael McCandless added a comment -

        +1

        Show
        Michael McCandless added a comment - +1
        Hide
        Ian Ribas added a comment -

        I did some unit tests, to try and get into the code and saw a behavior that I think is not right.

        For the same example of "spiderman, spider man", an analysis on 'spiderman' gives:

        term=spiderman,positionIncrement=1,positionLength=1,type=word
        term=spider,positionIncrement=1,positionLength=1,type=SYNONYM
        term=man,positionIncrement=1,positionLength=1,type=SYNONYM

        To be coherent, I thought it should be:
        term=spiderman,positionIncrement=1,positionLength=2,type=word
        term=spider,positionIncrement=1,positionLength=1,type=SYNONYM
        term=man,positionIncrement=1,positionLength=1,type=SYNONYM

        Things get even more complicated when the synonyms have even more different word counts, such as this example (from the Elasticsearch documentation: http://www.elastic.co/guide/en/elasticsearch/guide/current/multi-word-synonyms.html):

        "usa,united states,u s a,united states of america"

        The analysis of the longest synonym: 'united states of america', works fine, but an analysis of a text containing a shorter one, such as 'the united states is wealthy' still yields a sausage.

        I attached a patch with the changes plus the unit tests that exemplify these situations. The tests now pass, but the results I think are the correct ones are commented just under the one's I think are wrong. To be used if useful, and discarded if not, of course.

        I'm not sure I'll be able to do it, but I'm looking into how to handle positionLength to have a better graph.

        Show
        Ian Ribas added a comment - I did some unit tests, to try and get into the code and saw a behavior that I think is not right. For the same example of "spiderman, spider man", an analysis on 'spiderman' gives: term=spiderman,positionIncrement=1, positionLength=1 ,type=word term=spider,positionIncrement=1,positionLength=1,type=SYNONYM term=man,positionIncrement=1,positionLength=1,type=SYNONYM To be coherent, I thought it should be: term=spiderman,positionIncrement=1, positionLength=2 ,type=word term=spider,positionIncrement=1,positionLength=1,type=SYNONYM term=man,positionIncrement=1,positionLength=1,type=SYNONYM Things get even more complicated when the synonyms have even more different word counts, such as this example (from the Elasticsearch documentation: http://www.elastic.co/guide/en/elasticsearch/guide/current/multi-word-synonyms.html): "usa,united states,u s a,united states of america" The analysis of the longest synonym: 'united states of america', works fine, but an analysis of a text containing a shorter one, such as 'the united states is wealthy' still yields a sausage. I attached a patch with the changes plus the unit tests that exemplify these situations. The tests now pass, but the results I think are the correct ones are commented just under the one's I think are wrong. To be used if useful, and discarded if not, of course. I'm not sure I'll be able to do it, but I'm looking into how to handle positionLength to have a better graph.
        Hide
        Robert Muir added a comment -

        Thanks Ian. What you see is a limitation of synonymfilter (unrelated to this parser). synonymfilter doesn't "introduce additional positions" except for a trailer at the end as a special case. Otherwise it "sausages" by interleaving phrases together. To change this is much more complicated.

        So your "spiderman" case will not behave correctly, but its unrelated to my patch here. The parser does the right thing...

        Show
        Robert Muir added a comment - Thanks Ian. What you see is a limitation of synonymfilter (unrelated to this parser). synonymfilter doesn't "introduce additional positions" except for a trailer at the end as a special case. Otherwise it "sausages" by interleaving phrases together. To change this is much more complicated. So your "spiderman" case will not behave correctly, but its unrelated to my patch here. The parser does the right thing...
        Hide
        Ian Ribas added a comment -

        Thank you for the explanation. I can see now that the behavior I mentioned is unrelated to the patch, and to the parser, for that matter ...

        Show
        Ian Ribas added a comment - Thank you for the explanation. I can see now that the behavior I mentioned is unrelated to the patch, and to the parser, for that matter ...
        Hide
        Robert Muir added a comment -

        Yeah, that problem is disappointing, but a difficult problem. Definitely one that needs to be fixed. I get the impression from Mike (who is the expert on it), that it requires changes to the tokenstream api so that it can be done safely.

        On the other hand we should look at your tests and try to integrate ones for parsing that show we do the right thing. Maybe we can find or add an assert method that just compares against the SynonymMap directly. Something like assertEntryEquals(String word, boolean includeOrig, String synonyms...) as a start and build from there. It could verify synonyms.length vs count and includeOrig from the header and then the set of strings (empty string means a hole).

        Show
        Robert Muir added a comment - Yeah, that problem is disappointing, but a difficult problem. Definitely one that needs to be fixed. I get the impression from Mike (who is the expert on it), that it requires changes to the tokenstream api so that it can be done safely. On the other hand we should look at your tests and try to integrate ones for parsing that show we do the right thing. Maybe we can find or add an assert method that just compares against the SynonymMap directly. Something like assertEntryEquals(String word, boolean includeOrig, String synonyms...) as a start and build from there. It could verify synonyms.length vs count and includeOrig from the header and then the set of strings (empty string means a hole).
        Hide
        Robert Muir added a comment -

        I also tend to think we should fix this parser here (it was just a TODO all along) at least so SYNONYM types are correct for the common case.

        Show
        Robert Muir added a comment - I also tend to think we should fix this parser here (it was just a TODO all along) at least so SYNONYM types are correct for the common case.
        Hide
        Ian Ribas added a comment -

        Here is a new patch with unit tests.

        I left a simple "spiderman" test based on what is said on the description of the issue, validating the type and positionLength (of the common case).

        I and also added the assertEntryEquals() method Robert suggested, that can be used to validate the parsed synonyms directly, and created some simple tests.

        I didn't really understand "holes" on the synonyms, so I didn't implement support for that on the method above.

        Show
        Ian Ribas added a comment - Here is a new patch with unit tests. I left a simple "spiderman" test based on what is said on the description of the issue, validating the type and positionLength (of the common case). I and also added the assertEntryEquals() method Robert suggested, that can be used to validate the parsed synonyms directly, and created some simple tests. I didn't really understand "holes" on the synonyms, so I didn't implement support for that on the method above.
        Hide
        Robert Muir added a comment -

        Thank you, that patch is great!

        I folded those in with the change into a new patch. I think its ready.

        Show
        Robert Muir added a comment - Thank you, that patch is great! I folded those in with the change into a new patch. I think its ready.
        Hide
        Michael McCandless added a comment -

        Patch looks great!

        I tried to further simplify SolrSynonymParser by breaking out separate for loops for the expand true vs false cases in the non-explicit case ... attached.

        I also put a new TODO.

        Show
        Michael McCandless added a comment - Patch looks great! I tried to further simplify SolrSynonymParser by breaking out separate for loops for the expand true vs false cases in the non-explicit case ... attached. I also put a new TODO.
        Hide
        Michael McCandless added a comment -

        OK a bit more simplifiying: I don't create outputs in the 2a case, I moved inputs/outputs decls down into where they are used, and I just pass true / false for expand since we are already in the if clauses...

        Show
        Michael McCandless added a comment - OK a bit more simplifiying: I don't create outputs in the 2a case, I moved inputs/outputs decls down into where they are used, and I just pass true / false for expand since we are already in the if clauses...
        Hide
        Michael McCandless added a comment -

        Yeah, that problem is disappointing, but a difficult problem. Definitely one that needs to be fixed. I get the impression from Mike (who is the expert on it), that it requires changes to the tokenstream api so that it can be done safely.

        Fixing SynFilter to be able to "make positions" is really important.

        It's somewhat tricky but not impossible, because PosIncAtt + PosLengthAtt are sufficient for expressing any graph and changing any incoming graph to another graph (with enough buffering). I don't think we need changes to TokenStream API, only to the SynFilter impl.

        What makes fixing SynFilter tricky is noting when a new position was created and then fixing any syns that had "spanned" that new position to also increase their position lengths, I think? And it may require more buffering than syn filter does now...

        Show
        Michael McCandless added a comment - Yeah, that problem is disappointing, but a difficult problem. Definitely one that needs to be fixed. I get the impression from Mike (who is the expert on it), that it requires changes to the tokenstream api so that it can be done safely. Fixing SynFilter to be able to "make positions" is really important. It's somewhat tricky but not impossible, because PosIncAtt + PosLengthAtt are sufficient for expressing any graph and changing any incoming graph to another graph (with enough buffering). I don't think we need changes to TokenStream API, only to the SynFilter impl. What makes fixing SynFilter tricky is noting when a new position was created and then fixing any syns that had "spanned" that new position to also increase their position lengths, I think? And it may require more buffering than syn filter does now...
        Hide
        Ian Ribas added a comment -

        About the TODO, when not expanding (expand = false), the mappings are created without preserving the original (includeOrig=false), isn't that why the mapping of the first term to itself is needed?

        Show
        Ian Ribas added a comment - About the TODO, when not expanding ( expand = false ), the mappings are created without preserving the original ( includeOrig=false ), isn't that why the mapping of the first term to itself is needed?
        Hide
        Ian Ribas added a comment -

        Sorry, looked at it again and saw that actually, if that mapping was absent, it would be even better. There would be no match and the term would stay there with type word, and not SYNONYM.

        Show
        Ian Ribas added a comment - Sorry, looked at it again and saw that actually, if that mapping was absent, it would be even better. There would be no match and the term would stay there with type word, and not SYNONYM.
        Hide
        Michael McCandless added a comment -

        isn't that why the mapping of the first term to itself is needed?

        Well, if we simply don't add a rule for inputs[0] -> inputs[0] then it would keep that token?

        But it would be a slight change in behavior because today we remove that token, and put back a type=SYNONYM token....

        So I left it as is now even though it really is quite wasteful...

        Show
        Michael McCandless added a comment - isn't that why the mapping of the first term to itself is needed? Well, if we simply don't add a rule for inputs [0] -> inputs [0] then it would keep that token? But it would be a slight change in behavior because today we remove that token, and put back a type=SYNONYM token.... So I left it as is now even though it really is quite wasteful...
        Hide
        Michael McCandless added a comment -

        There would be no match and the term would stay there with type word, and not SYNONYM.

        Right! I think it's more sane that way. But I wonder if anything would care that we no longer label as type=SYNONYM...

        Show
        Michael McCandless added a comment - There would be no match and the term would stay there with type word, and not SYNONYM. Right! I think it's more sane that way. But I wonder if anything would care that we no longer label as type=SYNONYM...
        Hide
        ASF subversion and git services added a comment -

        Commit 1674155 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1674155 ]

        LUCENE-6400: preserve original token when possible in SolrSynonymParser

        Show
        ASF subversion and git services added a comment - Commit 1674155 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1674155 ] LUCENE-6400 : preserve original token when possible in SolrSynonymParser
        Hide
        ASF subversion and git services added a comment -

        Commit 1674159 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1674159 ]

        LUCENE-6400: preserve original token when possible in SolrSynonymParser

        Show
        ASF subversion and git services added a comment - Commit 1674159 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1674159 ] LUCENE-6400 : preserve original token when possible in SolrSynonymParser
        Hide
        Michael McCandless added a comment -

        Thanks Ian!

        Show
        Michael McCandless added a comment - Thanks Ian!
        Hide
        Anshum Gupta added a comment -

        Bulk close for 5.2.0.

        Show
        Anshum Gupta added a comment - Bulk close for 5.2.0.

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development